Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: indexing with missing labels deprecation not applied to MultiIndex #39424

Open
3 tasks done
attack68 opened this issue Jan 26, 2021 · 26 comments
Open
3 tasks done

BUG: indexing with missing labels deprecation not applied to MultiIndex #39424

attack68 opened this issue Jan 26, 2021 · 26 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@attack68
Copy link
Contributor

attack68 commented Jan 26, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

df = pd.DataFrame({("z","a"): [1, 2, 3], ("z","b"):[4, 5, 6]})
df.loc[pd.IndexSlice[:, pd.IndexSlice["z", ["a", "b", "c"]]]]  # works: DataFrame with columns 'a' and 'b'.

df = pd.DataFrame({("a"): [1, 2, 3], ("b"):[4, 5, 6]})
df.loc[pd.IndexSlice[:, pd.IndexSlice[["a", "b", "c"]]]]  # KeyError: missing label 'c'

Problem description

Should this behaviour be consistent?
A recent decision to move to KeyError: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike

Expected Output

Both return KeyError?

This can be linked to #32125 also.

@attack68 attack68 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021
@phofl
Copy link
Member

phofl commented Jan 26, 2021

Hi, thanks for your report.

@jbrockmendel I think this is somehow intended? I remember vaguely that we have discussed this in the past

@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

this is as intended. if you have a label that is not in the index, you must use .reindex

@jbrockmendel
Copy link
Member

@jreback i think the issue is that the first case doesn't raise, not that the second case does

@attack68
Copy link
Contributor Author

To give this even more context (possibly a bug) @jbrockmendel @jreback :

df = pd.DataFrame({("z","a"): [1, 2, 3], ("z","b"):[4, 5, 6]})

df.loc[:, ("z", ["a", "b", "c"])]                 # NO ERROR: DataFrame returned with columns 'a' and 'b'.
df.loc[:, [("z", "a"), ("z", "b"), ("z", "c")]]   # KeyError: The following labels were missing: MultiIndex([('z', 'c')]

The error message hints that this behaviour is known or at least expected:

Passing list-likes to .loc or [] with any missing labels is no longer supported.

I took a look but the .loc code is quite tricky to get into and well nested. Suggest someone who knows their way around might feel like commenting.

@phofl
Copy link
Member

phofl commented Jan 30, 2021

@attack68 I might be able to help a bit here. In case of a nested tuple as indexer, we dispatch to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_locs.html

As you already mentioned this is a bit inconsistent. I try to write the cases we encounter up:

df = pd.DataFrame({("z","a"): [1, 2], ("z","b"):[4, 5], ("y", "b"): [6, 7], ("y", "c"): [6, 7]})

df.loc[:, ("x", ["a", "b", "c"])]  # Already raising since the not list like key is missing, but KeyError x is a bit confusing
df.loc[:, ("z", ["a", "b", "c"])]  # should raise probably
df.loc[:, (["z", "y"], ["a", "b"])]  # should raise probably
df.loc[:, (["z", "y", "x"], ["b"])]  # should raise probably

The ways get_locs is implemented this is quite hard to fix with a meaningful error message. Especially the cases where something like ("y","a") is missing but ("y","b"), ("z", "a"), ("z", "b") all exist is hard to handle.

@attack68 attack68 added MultiIndex Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2021
@attack68
Copy link
Contributor Author

@phofl I encountered this problem in a new context and have revised my view. Instead of 'just for consistency' sake I think there is good reason Index and MultiIndex behave differently.
I will close this since there is more info there..

see #39775

@attack68
Copy link
Contributor Author

attack68 commented May 8, 2021

Hi @phofl: reopening this to discuss the how MultiIndex generates KeyIndex. I have actually changed my mind in the interim for good reason. My original argument was consistency, but I think a better argument for now avoiding KeyError is practical use and no availability to reindex with a MultiIndex. (also take a look here #39775).

In summary for dealing with missing key in a single index the documented recommendation is .reindex.

For a MultiIndex, say:

    [("a", 1, "z"), 
     ("b", 1, "z"), 
     ("b", 2, "z"), 
     ("c", 3, "y"),
     ("d", 5, "y"), 
     ("e", 5, "y")]

then if you intend to index with say: ["f", :] or [(slice(None), 4), :] then you would get a KeyError, so you specifically have to ensure you don't do that by testing for key presence in the specific level_values and exclude certain keys from the lookup.

You cannot simple use .reindex since there are numerous sensible approaches, e.g. trying to add 4 to level_1:

  • you could exhaustively concatenate MultIndex.from_product([["a", "b", "c", "d", "e"], [4], ["z", "y"]]) which is memory error prone and impractical.
  • you could, at most, double the length of the Index by concatenating the new key in every available location and removing duplicates, i.e. MultiIndex.from_arrays([["a","b","b","c","d","e"], [4,4,4,4,4,4], ["z","z","z","y","y","y"]]).drop_duplicates
  • or you could insert a single key presence determined by some nearest neighbour or 'first' algorithm, i.e. just adding the single key: ("a", 4, "z") or ("c", 4, "y")

Thus, I think the behaviour should be considered very carefully for how MultiIndex slicing should operate.

@phofl
Copy link
Member

phofl commented May 8, 2021

Let's take a step back first please, because I've got a question what you would expected when reindexing a MultiIndex. The behavior you are describing basically comes down to:

idx = MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 3), ("b", 4)])
idx.reindex(["a", "c"], level=0)

which returns

(MultiIndex([('a', 1),
            ('a', 2)],
           ), array([0, 1]))

with levels

[['a', 'c'], [1, 2, 3, 4]]

So the issue you mentioned above does not really exist here, since "c" is added as an unused level? Would you expect something different here?
In general I would agree that raising for something like ("a", 3) is not a really good idea, but raising for something like ("c", 1) looks doable with the current implementation without major implications for the reindexing alternative?

@attack68
Copy link
Contributor Author

attack68 commented May 9, 2021

Let's take a step back first please, because I've got a question what you would expected when reindexing a MultiIndex. The behavior you are describing basically comes down to:

Ok good point, actually I haven't revisited this for a while, and the last time I tried re-indexing a MultiIndex it didn't quite work out. So good to start with the basics.

Yes I think that "c" getting added as an unused level value is probably the best way to go. This, of course, differs to a SingleIndex, where "c" would be visible and the missing data would be inserted as NaN.

With a single index Keys are either present or not-present and that is the determinant in a KeyError. In a MultiIndex level keys are either present or not-present in each level with the potential that a key containing present-level values across all levels is actually not-present in the MultiIndex, e.g. ("a", 3).

Using your example consider the following:

ix = pd.IndexSlice
idx = MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 3), ("b", 4)])
s = Series ([1,2,3,4], index=idx)

A) s.loc["b", [1,2]] = --> KeyError: ('b', [1, 2])
B) s.loc["b", ix[1:2]] = --> Series([], dtype: float64)
C) s.loc["b", [1,2,3]] = --> Series with one element ("b", 3)
D)  s.loc["c", 1] = --> KeyError: ('c', 1)
E) s.loc[["c"], 1] = --> Series([], dtype: int64)
F) s.loc["c", [1]] = --> KeyError: 'c'
G) s.loc[["c"], [1]] = --> Series([], dtype: int64)
H) s.loc[["c"], [1,2]] = --> Series([], dtype: float64)
I) s.loc["b", 100] = --> KeyError: ('b', 100)
J) s.loc[["b"], 100] = --> KeyError: 100
K) s.loc[["b"], [100]] = --> Series([], dtype: float64)
L) s.loc["b", [100]] = --> Series([], dtype: float64)

s2 = s.reindex(["a", "b", "c"], level=0)

M) s2.loc["c", [1,2]] = --> KeyError: 'c'

I think there are lots of combinations here worth reviewing but some obvious ones:

  • A) conflicts with B) and C)
  • A) conflicts with L)
  • D) and G) conflict (maybe good reason)
  • A) conflicts with M) in error message after the reindex.

My risk averse opinion is that if raising KeyError is to be more useful than returning an empty dataframe it must be consistently applied. If not, the tail risk bad cases are increased. That I wanted to raise a cautionary flag against PRs that add more KeyErrors to MultiIndexes.

Is there any kind of developer document on how multiindex .loc is supposed to deal with all the possible combinations?

@phofl
Copy link
Member

phofl commented May 9, 2021

Is there any kind of developer document on how multiindex .loc is supposed to deal with all the possible combinations?

I don't think so apart from missing keys should raise a KeyError.

I am not sure why you would do the following:

s2 = s.reindex(["a", "b", "c"], level=0)

M) s2.loc["c", [1,2]] = --> KeyError: 'c'

Reindexing already selects the wanted labels, so you don't need loc anymore. You could do s2.loc[:, [1,2]] if you want to select the columns afterwards.

Apart from C I would argue that every one of these cases should raise. But we are not that consistent here. This is something we should work on in the future. I think C should raise too probably, but the underlying implementation makes this hard. Would probably need a do over to be able to do this correctly.

I'll try to look into this in the coming weeks, maybe we can get this mor consistent with 1.3

@attack68
Copy link
Contributor Author

Apart from C I would argue that every one of these cases should raise. But we are not that consistent here. This is something we should work on in the future. I think C should raise too probably, but the underlying implementation makes this hard. Would probably need a do over to be able to do this correctly.

if .loc["b", [1,2,3]] raises what about .loc[["b"], [1,2,3]], where the cases are separable by a kind of 1-dim vector-product instead of scalar broadcast?

And what about .loc[["a","b"], [1,2,3]] which like the above 1-dim vector-product contains valid and invalid keys through the cross product? If we start raising on that then I have a suspicion (without mathematical proof) that the majority of MultiIndex slicing will become impossible except over complete sets.

if .loc["b", [1,2,3]] raises even when a valid key exists ("b", 3) would you consider it the case where pandas prioritises errors over returning valid data?

I'll try to look into this in the coming weeks, maybe we can get this mor consistent with 1.3

I'll try and add something to the discussion also, maybe a formal working doc on the ideas and working MultiIndexes may prove helpful in the future, especially if new devs come in, since the mechanics is quite complex for a user let alone a developer I expect!

@attack68
Copy link
Contributor Author

@phofl, I apologise for the long message here, but I think having a developer specification in the code/docs somewhere for how this works is needed and the below provides a starting framework (even if this is greatly re-worked). It also collects most of my opinion into a consistent structure.

Pandas Indexing with .loc

.loc is a flexible indexer for DataFrames and Series, operable either on the index or columns (in a DataFrame) or both simultaneously, although the simultaneous case necessarily requires only a basic intersection, so, without loss of generality, it permits to restrict attention to indexing only an index of a Series.

Definitions for an Index

  • An index is a sequence of defined keys, either unique or non-unique, and either ordered, i.e. monotonic, or non-monotonic.

  • A key is a tuple containing values from each level in the index. If the number of levels is greater than one we refer to the index as a MultiIndex. If it is one, then a single element, e.g. "a" as opposed to a one-tuple, ("a",), is a key.

  • A level of a MultiIndex contains two definitions of values:

    • A sequence of level_values which are those values within the sequence of keys associated with the given level. These may be duplicated. For example if keys = [("a",1), ("b",2), ("b",3)] then level_0_values = ["a", "b", "b"]
    • A set of level_elements which are values that may (or may not) be included in at least one key. If a level_element is not included in any key then it is called an unused_level_element otherwise it is a used_level_element. For example if keys = [("a",1), ("b",2), ("b",3)] the level_0_elements might be ["a", "b", "c"] where "c" is unused, but "b", "a" must be included as used elements. Thus set(level_values) is a subset of level_elements.

We observe that for a MultiIndex, say [("a", 1), ("b", 2), ("b", 3)] with level_elements=[["a", "b", "c"], [1,2,3]] this creates 5 different cases when proposing a key:

  • a) a proposed key contains used_elements on all levels and is present in keys and has a data value, ("a", 1) = 10
  • b) a proposed key contains used_elements on all levels and is present in keys and has missing data, ("b", 2) = NaN
  • c) a proposed key contains used_elements on all levels and is not present in keys, ("a", 2) = <empty1>
  • d) a proposed key conatins used_elements and/or unused_elements for at least one level as is thus not present in keys, ("c", 3) = <empty2>
  • e) a proposed key contains at least one level_value not in level_elements and is thus not present in keys, ("d", 2) = <empty3>

Levels of Error

This document proposes that there be no discernible difference between <empty1> and <empty2>, in that both keys ("a", 2) and ("c", 3) provide a valid combination from level_elements (be it used or unused), but that it does not result in a valid key. On the other hand ("d", 2) provides a discernible difference since "d" can be definitively shown not to be in level_0_elements and therefore <empty3> is more of an <error> or <undefined> result. Although this test could be done solely for used_elements there is reason to provide more weight to "c" being in level_elements through reindex (shown below).

For a MultiIndex, there is then we expand the definition to:

  • valid_key_combination: an combination of level_elements that is contained in keys.
  • invalid_key_combination: a combination of level_elements that happens not to be contained in keys.
  • erroneous_key_combination: a key which is definitively not in keys since at least one value is not in its associated level_elements

If the level_elements were expanded by use of a reindex method to [["a", "b", "c", "d"], [1, 2, 3]] then the interpretation of ("d", 2) should change to now be considered as the case with ("a", 2) or ("c", 3) or simply as result <empty>, as an invalid_key_combination rather than an erroneous_key_combination.

Observation for a Single Index

We note that for a single index the concept of unused_elements does not exist. If a single index is reindexed the new element will be added to the keys and thus exist within level_values, ensuring that for a single index the set(level_0_values) = level_0_elements and thus, by definition, none are unused, so there are only 3 possible cases here:

  • a) a proposed key is present in keys and has a data value, "a" = 10
  • b) a proposed key is present in keys and does not have a data value, "a" = NaN
  • e) a proposed key is not present in keys and "d" = <undefined/error>

Definitions for Selections

To use .loc a valid selection of keys must be presented or given in a way in which it can be inferred.

Single level selections (even of MultiIndexes):

  • Single explicit key: for example .loc["a"] or .loc[("a", 1)] for a MultiIndex.
  • List of explicit keys: for example .loc[["a", "b"]] or .loc[[("a", 1), ("b", 2)]].
  • A tuple IndexSlice (measured over a monotonic index), for example .loc[IndexSlice[("a", 1):("b", 2)]], (which is not the same as the multi-level IndexSlice: .loc[(IndexSlice["a":"b"], IndexSlice[1:2])])
    An tuple indexSlice effectively reduces to a list, i.e. .loc[IndexSlice["a":"c"]] is equivalent to .loc[["a", "b", "c"]]
  • a colon, ':', or none slice is interpreted as a list of all used_element_values for the associated level.

Given the reduction of an IndexSlice we treat it the same as a list, or a single element list, and, without loss of generality, can discuss only the cases of a single key or a list of keys. The none slice is also regarded as a list.

Multi-level selections (only valid for MultiIndexes):

  • inferred single explicit key: for example .loc["a", 1] is equivalent to .loc[("a", 1)]
  • scalar broadcast across levels: for example .loc["a", [1,2]] is equivalent to .loc[[("a", 1), ("a", 2)]]
  • vector product across levels: for example .loc[["a", "b"], [1,2]] is equivalent to .loc[[("a", 1), ("a", 2), ("b", 1), ("b", 2)]]
  • not that there is any difference for this document but .loc[["a", "b], "x", [1,2]] is regarded as a scalar broadcast of level 1 with the vector-product of levels 0 and 2, and not the vector-product of level 0 with the vector result of the scalar broadcast of levels 1 and 2, or the vector-product of level 2 with the vector result of the scalar broadcast of levels 1 and 0.

Defining a KeyError

The issue of raising a KeyError is important for the usefulness of indexing and user-feedback. The document proposes, in general to raise a KeyError when an erroneous_key_combination is given but not an invalid_key_combination (which only applies to a MultiIndex). The exception to this rule is where a key is explicitly input as a tuple that is an invalid_key_combination, either as part of a list of a single level index slice, and this is a special circumstance KeyError.

For example:

  • .loc["a", [1,2,3]] is not a list of explicit keys, it is a scalar broadcast and all level inputs are contained in level_elements therefore it is a permitted indexer which returns only one valid_key_combination and 2 invalids, which are ignored as <empty>.
  • .loc[[("a", 1), ("a, 2), ("a", 3)]] is an explicit list of keys, 2 of which are invalid and thus we raise a special KeyError.
  • .loc[IndexSlice[("a",2):("b",3)]] contains an explicit key which is invalid and thus we raise a special KeyError
  • .loc[IndexSlice["a":"b"], [1,2,3]] is a vector-product of level_elements and yields valid and invalid keys which are <empty> and ignored.
  • .loc[IndexSlice["a":"d"], [1,2,3]] contains a value not in level_0_elements, i.e. "d", and thus a regular KeyError is raised.

Motivation for this Specification

One of the strengths of pandas is the MultiIndex and data selection procedures. We must be careful not to remove this functionality by inadvertently introduing KeyErrors across the board and greatly restricting data accessibility.

For a single index where a KeyError is raised the suggestion to counteract this is to reindex. For example, if .loc["c"] raises then s.reindex(["a", "b", "c"]).loc["c"] will yield NaN, and allow the user to avoid the KeyError if his indexer is based on some other dynamic execution.

For a MultiIndex under the above framework this reindexing becomes consistent. For example:

s = Series([1,2,3], index=MultiIndex.from_arrays([["a", "b", "b"], [1,2,3]]))
s.loc["b", [1,2,3]]
b  2    2
   3    3
dtype: int64

s.loc[["c", "b"], [1,2,3]]
KeyError("c" not in level_0 of s.index)

s = s.reindex(["a", "b", "c"], level=0)
a  1    1
b  2    2
   3    3
dtype: int64

s.loc[["c", "b"], [1,2,3]]
b  2    2
   3    3
dtype: int64

It is imperative to avoid the raising of a KeyError in the below in the case the vector-product yields invalid_key_combinations:

s.loc[["a", "b"], [1,2,3]]
KeyError(("a", 2) not in s.index)

Not only is this error non-exhaustive since ("a", 3) and ("b", 1) are also invalid, but the number of invalids becomes exponentially great for larger indexes and reporting the errors, not least looking them up would greatly hinder performance.

@phofl
Copy link
Member

phofl commented May 11, 2021

No need to apologize. I agree, this would be useful to have in the docs. Thanks for writing this up.

I agree with most of the things you have written. This would make the behavior consistent for MultiIndexes.

For a single index where a KeyError is raised the suggestion to counteract this is to reindex. For example, if .loc["c"] raises then s.reindex(["a", "b", "c"]).loc["c"] will yield NaN, and allow the user to avoid the KeyError if his indexer is based on some other dynamic execution.

I don't think this is the recommended way to do this. You should do s.reindex(["c"]) without a loc afterwards. Loc is not necessary here at all, which solves the unused level issue partly.

Hence doing this allows us to raise when having unused levels without restricting functionality from a user perspective. I think we should raise on unused levels, because reindex is not the only way to create unused levels in a MultiIndex. This can also happen accidentially (see #41362 and associated issues) and I don't think something which is done for performance reasons (see link in #36227) should have impact on the behavior here.

Imo we should move something like this (after we agreed on the content of course) to the docs somewhere. I do think this should be visible to users too.

@attack68
Copy link
Contributor Author

I don't think this is the recommended way to do this. You should do s.reindex(["c"]) without a loc afterwards. Loc is not necessary here at all, which solves the unused level issue partly.

But reindex does not offer the full potential of indexing capability that loc does. I agree the docs do agree with your comment but they miss the following use case to obtain a specifically structured output (even if the docs example evidently makes the solution apparent):

s = pd.Series([1,2], index=["a", "c"])
s.loc["a":"c"]
a    1
c    2
dtype: int64

s = pd.Series([1,2], index=["a", "c"]).reindex(["a", "b", "c", "d"])
s.loc["a":"c"]
a    1.0
b    NaN
c    2.0
dtype: float64

For trying to exemplify this with a MultiIndex, let me use a working case in finance where 3 assets have different trading calendars, and possibly different collected data:

us_stock = Series([1,2,5,6], 
                  index=MultiIndex.from_product([to_datetime(["2000-01-01", "2000-01-03"]), ["open", "close"]]))
uk_stock = Series([1,2,3,4], 
                  index=MultiIndex.from_product([to_datetime(["2000-01-01", "2000-01-02"]), ["open", "close"]]))
de_stock = Series([4,6], 
                  index=MultiIndex.from_product([to_datetime(["2000-01-02", "2000-01-03"]), ["close"]]))

In my proposal you would get the following:

ix0 = IndexSlice["2000-01-01":"2000-01-03"]
ix1 = ["open", "close"]

de_stock.loc[ix0, ix1] --> KeyError: "2000-01-01" not in level_0_elements, "open" not in level_1_elements
uk_stock.loc[ix0, ix1] --> KeyError: "2000-01-03" not in level_0_elements
us_stock.loc[ix0, ix1]
2000-01-01  open     1
            close    2
2000-01-03  open     5
            close    6

but if you can reindex without necessarily adding any keys:

de_stock = de_stock.reindex(to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"]), level=0).reindex(["open", "close"], level=1)
uk_stock = uk_stock.reindex(to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"]), level=0).reindex(["open", "close"], level=1)
us_stock = us_stock.reindex(to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"]), level=0).reindex(["open", "close"], level=1)

de_stock.loc[ix0, ix1]
2000-01-02  close    4
2000-01-03  close    6
uk_stock.loc[ix0, ix1]
2000-01-01  open     1
            close    2
2000-01-02  open     3
            close    4
us_stock.loc[ix0, ix1]
2000-01-01  open     1
            close    2
2000-01-03  open     5
            close    6

Hence doing this allows us to raise when having unused levels without restricting functionality from a user perspective. I think we should raise on unused levels, because reindex is not the only way to create unused levels in a MultiIndex. This can also happen accidentially (see #41362 and associated issues) and I don't think something which is done for performance reasons (see link in #36227) should have impact on the behavior here.

I see there are downstream issues to this decision - and I'm keeping an open mind! If unused levels are raised it would be good and interesting to finds the downstream implications it has also.

@phofl
Copy link
Member

phofl commented May 11, 2021

I think I understand what you are trying to do, but I am still not sure what you are needing loc for. Could you give me a usecase why we would need loc after using reindex and the unused level thing becomes a problem? Sorry if this is too obvious, but I can not see it.

Edit: Everything I can think of right now can be done without running into the KeyError, if you really need the loc call after reindexing

@jbrockmendel
Copy link
Member

The workarounds discussed focus on reindex, but I think isin is a natural candidate given how get_locs actually works.
ser.loc[single_key, slice, listlike] is equivalent to

mask = ser.index.isin(listlike, level=2)
ser[mask].loc[single_key, slice]

Granted that's more verbose than a single loc call. Also it's not quite right if ser.index.levels[2] is an IntervalIndex, which is a whole other can of worms. I'd be open to having a method to make this more concise, but am leaning towards preferring it not be loc (or maybe pd.IndexSet({...}) that could be passed to loc and have .isin semantics?)

@attack68
Copy link
Contributor Author

The workarounds discussed focus on reindex, but I think isin is a natural candidate given how get_locs actually works.
ser.loc[single_key, slice, listlike] is equivalent to

mask = ser.index.isin(listlike, level=2)
ser[mask].loc[single_key, slice]

For big data with multiindexes with a lot of levels the isin solution is impractical and probably ambiguous as to what order you mask your level selections by. It also doesn't help to yield an empty series, over a KeyError if that is the objective (which is solved for a single Index by using reindex). The reindex solution is also impractical because you have to specify a list of tuple indexes, which if generated dynamically with a product approach can explode memory.

Granted that's more verbose than a single loc call. Also it's not quite right if ser.index.levels[2] is an IntervalIndex, which is a whole other can of worms. I'd be open to having a method to make this more concise, but am leaning towards preferring it not be loc (or maybe pd.IndexSet({...}) that could be passed to loc and have .isin semantics?)

Maybe this is best way. I really like the flexibility of loc but MultiIndexes are complicated and its difficult to communicate how to effectively use to casual user. I hadn't event thought of the isin approach and I use MI a lot.

@jbrockmendel
Copy link
Member

Another case I've run into (@attack68 LMK if this belongs in a different thread)

In #27591 we have a case where a level contains a tuple, but incorrectly goes through the get_locs path and returns a empty Series

lev1 = ["a", "b", "c"]
lev2 = [(0, 1), (1, 0)]
lev3 = [0, 1]
cols = pd.MultiIndex.from_product([lev1, lev2, lev3], names=["x", "y", "z"])
df = pd.DataFrame(index=range(5), columns=cols)

# the lev2[0] here should be treated as a single label, not as a sequence
#  of labels
result = df.loc[:, (lev1[0], lev2[0], lev3[0])]

expected = df.iloc[:, :1]
tm.assert_frame_equal(result, expected)

We can fix this by checking for this case before dispatching to get_locs. Bug what do we do if instead of lev2[0] we passed (0, 2)? If we're treating it as a single label, then we should raise KeyError. If we're falling through to get_locs, we get an empty Series.

@jbrockmendel
Copy link
Member

Shoot, it gets even worse: what if we have multiple levels like lev2?

@attack68
Copy link
Contributor Author

attack68 commented Jun 30, 2021

@jbrockmendel here is a pathological example for you:

>>> df = pd.DataFrame([[1,2,3]], columns=pd.MultiIndex.from_tuples([("a", 0), (("a", "b"), (0, 1)), ("b", 1)]))
    a   (a, b)   b
    0   (0, 1)   1
0   1        2   3

>>> df.loc[:, (("a", "b"), (0,1))]
    a    b
    0    1
0   1    3

!

I would suggest that since tuples can be used as immutable index level values it should not be possible to provide a tuple as sequence to an input to a level within the loc indexer, it should be reserved that sequences can only be passed as list (or set).

Since the tuple (and only the tuple) is used as the notation by which to identify axes and levels within loc this then provides a complete separation.

This is also supported by the fact it is not possible to construct index levels values from lists or sets:

>>> df = pd.DataFrame([[1,2,3]], columns=pd.MultiIndex.from_tuples([("a", 0), (["a", "b"], [0, 1]), ("b", 1)])) 
## TypeError: unhashable type: list
>>> df = pd.DataFrame([[1,2,3]], columns=pd.MultiIndex.from_tuples([("a", 0), ({"a", "b"}, {0, 1}), ("b", 1)])) 
## TypeError: unhashable type: set

In this regime the following two commands have separate and distinct meaning:

df.loc[:, (["a", "b"], [0, 1])]
df.loc[:, (("a", "b"), (0, 1))]

(not entirely sure why you can't use df.loc[:, ({"a", "b"}, {0, 1})] as an equivalent to lists actually)

Note the workaround for this pathological example is:

>>> df.loc[:, ([("a", "b")], [(0, 1)])]```
    (a, b)   
    (0, 1)   
0        2   

edit: with so much complexity involved in these matters I also think it is advantageous to restrict input format to a very specific set of instructions to help not only with coding but with educating how to use.

@jbrockmendel
Copy link
Member

Sounds like you're suggesting we deprecate the behavior for a tuple but retain it for other listlike?

@attack68
Copy link
Contributor Author

attack68 commented Jul 1, 2021

Sounds like you're suggesting we deprecate the behavior for a tuple but retain it for other listlike?

Besides backwards compatability, is there anything that would be affected by deprecating being able to use tuples in loc to also mean a sequence of labels, as opposed to just defining the axis and levels structure, and specific level value keys?

@jbrockmendel
Copy link
Member

Besides backwards compatability, is there anything that would be affected by deprecating being able to use tuples in loc to also mean a sequence of labels, as opposed to just defining the axis and levels structure, and specific level value keys?

The only downside that comes to mind is that we would be deprecating it here but not in a bunch of other places where tuples can be used to indicate sequences (id be on board for deprecating that across the board, but that's a daunting task)

@jbrockmendel
Copy link
Member

@attack68 is this closed by #42351?

@jbrockmendel
Copy link
Member

@attack68 gentle ping is this closed by #42351?

@attack68
Copy link
Contributor Author

attack68 commented Dec 21, 2021

@jbrockmendel

Ignoring some of the digressive comments in this thread and just analysing the title, of implementing "KeyError" consistently, running my script reveals the following open areas for improvement:

  • Using an index slice, where one of the slice elements is invalid for Monotonic Indexes (either Unique or Duplicated) no KeyError is reported, but is reported for Unsorted Indexes:
indexes = [
    Index(['a','c','b','d'], name='Unique Unsorted<br>["a","c","b","d"]'),
    Index(['a','c','b','d'], name='Unique Monotonic<br>["a","b","c","d"]').sort_values(),
    Index(['a','b','c','b'], name='Duplicated Unsorted<br>["a","b","c","b"]'),
    Index(['a','c','b','b'], name='Duplicated Monotonic<br>["a","b","b","d"]').sort_values(),
]
s = Series([1,2,3,4], index=indexes[j])
s.loc[IndexSlice("a":"!")]
# for j = 0 to 3: KeyError, Empty Series, KeyError, Empty Series

This also shows the same issue for Monotonic MultiIndexes.

  • When slicing with an unique key for an Index there is a return in all cases; Duplicated or Unique Indexes, but with a MultiIndex this is not consistent, (although this one might have a simple explanation acually)
s.loc[IndexSlice("a":"c")]
# for j = 0 to 3: Series, Series, Series, Series
multi_indexes = [
    MultiIndex.from_tuples(names=['Unique Levels Unsorted', None],
        tuples=[('a', 0), ('c', 2), ('b', 1), ('d', 3)], ),
    MultiIndex.from_tuples(names=['Unique Index Unsorted', None],
        tuples=[('a', 0), ('b', 1), ('c', 2), ('b', 3)], ),
    MultiIndex.from_tuples(names=['Unique Levels Monotonic', None],
        tuples=[('a', 0), ('c', 2), ('b', 1), ('d', 3)], ).sort_values(),
    MultiIndex.from_tuples(names=['Unique Index Monotonic', None],
        tuples=[('a', 0), ('b', 1), ('c', 2), ('b', 3)], ).sort_values(),
    MultiIndex.from_tuples(names=['Duplicated Unsorted', None],
        tuples=[('a', 0), ('b', 1), ('c', 2), ('b', 1)], ),
    MultiIndex.from_tuples(names=['Duplicated Monotonic', None],
        tuples=[('a', 0), ('b', 1), ('c', 2), ('b', 1)], ).sort_values(),
]

ms = Series([1,2,3,4], index=multi_indexes[j])
# for j = 0 to 5: IndexErr, IndexErr, Series, Series, IndexErr, Series
  • The inconsistencies that were present with nested sequence of labels is deprecated with a warning so, for now this is OK. Will test again when it changes in future version, (eg. .loc[(["a"], ["a", "!"])])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants