New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview of [] (__getitem__) API #9595

Open
jorisvandenbossche opened this Issue Mar 5, 2015 · 11 comments

Comments

Projects
None yet
7 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Mar 5, 2015

some examples (on Series only) in #12890

I started making an overview of the indexing semantics with http://nbviewer.ipython.org/gist/jorisvandenbossche/7889b389a21b41bc1063 (only for series/frame, not for panel)

Conclusion: it is mess :-)


Summary for slicing

  • Slicing with integer labels is:
    • always integer location based
    • except for a float indexer where it is label based
  • Slicing with other types of labels is always label based if it is of appropriate type for the indexer.

So, you can say that the behaviour is equivalent to .ix, except that the behaviour for integer labels is different for integer indexers (swapped). (For .ix, when having an integer axis, it is always label based and no fallback to integer location based).

Summary for single label

  • Indexing with a single label is always label based
  • But, there is fallback to integer location based, except for integer and float indexers

Summary for indexing with list of labels

  • It is primarily label based, but:
    • There is fallback to integer location based apart from int/float integer axis
    • It is a pure reindex, also if no label of the list is found, you just get an all NaN series (which contrasts with loc, where at least one label should be found)
    • String parsing for a datetime index does not seem to work

This mainly follows ix, apart from points 2 and 3

Summary for boolean indexing

  • This is simple, it just works as expected

Summary for DataFrames

  • It uses the 'information' axis (axis 1) for:
    • single labels
    • list of labels
  • It uses the rows (axis 0) for:
    • slicing
    • boolean indexing

This is as documented (only the boolean case is not explicitely documented I think).

For the rest (on the choses axis), it follows the same semantics as [] on a series, but:

  • for a list of labels, now all labels must be present (no pure reindex as with series)
  • for single labels: no fallback to integer location based for non-numeric index (but this does fallback for a list of labels ...)

Questions are here:

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 5, 2015

Contributor

@jorisvandenbossche this is a really nice summary.

I think in general we can move []/.ix closer (maybe can get identical), so as not to have any confusion. (of course we may have to eliminate fallback which is not a bad thing anyhow).

I suppose we should prepare any changes for 0.17.0 as these will technically be API changes.

Contributor

jreback commented Mar 5, 2015

@jorisvandenbossche this is a really nice summary.

I think in general we can move []/.ix closer (maybe can get identical), so as not to have any confusion. (of course we may have to eliminate fallback which is not a bad thing anyhow).

I suppose we should prepare any changes for 0.17.0 as these will technically be API changes.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 6, 2015

Contributor

xref #7501 , #8976, #7187

Contributor

jreback commented Mar 6, 2015

xref #7501 , #8976, #7187

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Mar 6, 2015

Member

xref #9213, CC @hugadams @dandavison

@jorisvandenbossche Indeed, this is a nice summary of current behavior. Thanks!

I think we should consider radical API changes for __getitem__ if we want pandas to have a lasting influence.

My two cents on indexing is that "fallback indexing" is a really bad idea. It starts with the best of intentions, but leads to things like special cases like distinctions between integer and float indexes (e.g., see #9213). In the face of ambiguity, refuse the temptation to guess.

So if I were reinventing indexing rules from scratch, I would consider something like this (for DataFrame):

  • Indexing with a string or list of strings does label based selection on columns.
  • All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

That's it. Two simple rules that probably cover 90% of existing uses of __getitem__, at least the only ones that I could ever keep straight (string column labels and boolean arrays). Importantly, indexing would never depend on the type of the index and there would be no reindexing/NaN-filling behavior. We could also eliminate the need for .iloc as a separate indexer entirely.

This sort of change would require a serious deprecation cycle or perhaps need to wait until pandas 1.0 (likely both), but something needs to change. The fact that even pandas developers need to run extensive experiments to figure out how __getitem__ works indicates just how wrong things are. Indexing should be simple enough that its behavior can be relied on in production code. The current state of indexing is, frankly, embarrassing.

Member

shoyer commented Mar 6, 2015

xref #9213, CC @hugadams @dandavison

@jorisvandenbossche Indeed, this is a nice summary of current behavior. Thanks!

I think we should consider radical API changes for __getitem__ if we want pandas to have a lasting influence.

My two cents on indexing is that "fallback indexing" is a really bad idea. It starts with the best of intentions, but leads to things like special cases like distinctions between integer and float indexes (e.g., see #9213). In the face of ambiguity, refuse the temptation to guess.

So if I were reinventing indexing rules from scratch, I would consider something like this (for DataFrame):

  • Indexing with a string or list of strings does label based selection on columns.
  • All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

That's it. Two simple rules that probably cover 90% of existing uses of __getitem__, at least the only ones that I could ever keep straight (string column labels and boolean arrays). Importantly, indexing would never depend on the type of the index and there would be no reindexing/NaN-filling behavior. We could also eliminate the need for .iloc as a separate indexer entirely.

This sort of change would require a serious deprecation cycle or perhaps need to wait until pandas 1.0 (likely both), but something needs to change. The fact that even pandas developers need to run extensive experiments to figure out how __getitem__ works indicates just how wrong things are. Indexing should be simple enough that its behavior can be relied on in production code. The current state of indexing is, frankly, embarrassing.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Sep 9, 2016

Member

@jorisvandenbossche Did you ever figure out how __setitem__ works? :)

Member

shoyer commented Sep 9, 2016

@jorisvandenbossche Did you ever figure out how __setitem__ works? :)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 9, 2016

Member

@shoyer nope :-) I would suspect it is largely the same, but you never know ... Will try to look at it next week

Member

jorisvandenbossche commented Sep 9, 2016

@shoyer nope :-) I would suspect it is largely the same, but you never know ... Will try to look at it next week

@matthewgilbert

This comment has been minimized.

Show comment
Hide comment
@matthewgilbert

matthewgilbert Aug 28, 2017

Contributor

I wanted to add this here since it is somewhat related to "String parsing for a datetime index does not seem to work" mentioned above and I have not seen it come up anywhere else. For a MultiIndex, string parsing for a datetime index with a scalar does not result in dropping the MultiIndex level.

In [2]: dfm = pd.DataFrame([1, 2, 3], index=pd.MultiIndex.from_arrays([pd.date_range("2015-01-01", "2015-01-03"), ['A', 'A', 'B']]))

In [3]: dfm.loc["2015-01-01"]
Out[3]: 
              0
2015-01-01 A  1

In [4]: dfm.loc[pd.Timestamp("2015-01-01")]
Out[4]: 
   0
A  1

this seems like somewhat unintuitive behaviour (to me at least)

Contributor

matthewgilbert commented Aug 28, 2017

I wanted to add this here since it is somewhat related to "String parsing for a datetime index does not seem to work" mentioned above and I have not seen it come up anywhere else. For a MultiIndex, string parsing for a datetime index with a scalar does not result in dropping the MultiIndex level.

In [2]: dfm = pd.DataFrame([1, 2, 3], index=pd.MultiIndex.from_arrays([pd.date_range("2015-01-01", "2015-01-03"), ['A', 'A', 'B']]))

In [3]: dfm.loc["2015-01-01"]
Out[3]: 
              0
2015-01-01 A  1

In [4]: dfm.loc[pd.Timestamp("2015-01-01")]
Out[4]: 
   0
A  1

this seems like somewhat unintuitive behaviour (to me at least)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 29, 2017

Contributor

@matthewgilbert this is just how partial string indexing works, see the docs here. The first is treated as a slice, while the second is an exact match.

Contributor

jreback commented Aug 29, 2017

@matthewgilbert this is just how partial string indexing works, see the docs here. The first is treated as a slice, while the second is an exact match.

@aavanian

This comment has been minimized.

Show comment
Hide comment
@aavanian

aavanian Sep 18, 2017

I came around this and this seems related but could also be a bug in the above interacting with the CategoricalIndex. Using the same example as #15470:

pandas 0.20.3

s = pd.Series([2, 1, 0], index=pd.CategoricalIndex([2, 1, 0]))
s[2]  # works (interpreting as label)
s.loc[2]  # fails with TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [2] of <class 'int'>

# of course the below works!
s = pd.Series([2, 1, 0], index=[2, 1, 0])
s[2]  # works (interpreting as label)
s.loc[2]  # works (interpreting as label)

I came around this and this seems related but could also be a bug in the above interacting with the CategoricalIndex. Using the same example as #15470:

pandas 0.20.3

s = pd.Series([2, 1, 0], index=pd.CategoricalIndex([2, 1, 0]))
s[2]  # works (interpreting as label)
s.loc[2]  # fails with TypeError: cannot do label indexing on <class 'pandas.core.indexes.category.CategoricalIndex'> with these indexers [2] of <class 'int'>

# of course the below works!
s = pd.Series([2, 1, 0], index=[2, 1, 0])
s[2]  # works (interpreting as label)
s.loc[2]  # works (interpreting as label)
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Sep 18, 2017

Contributor

@aavanian that looks like a bug. Could you open a separate issue for it?

Contributor

TomAugspurger commented Sep 18, 2017

@aavanian that looks like a bug. Could you open a separate issue for it?

@aavanian

This comment has been minimized.

Show comment
Hide comment
@aavanian

aavanian Sep 18, 2017

Sure, done in #17569

Sure, done in #17569

@tdpetrou

This comment has been minimized.

Show comment
Hide comment
@tdpetrou

tdpetrou Nov 27, 2017

Contributor

If I were to rebuild pandas, I would make indexing as simple as possible and only use .loc and .iloc. I would not implement __getitem__. There would be no ambiguity. I also wouldn't allow attribute access to columns. It would be a pain to select a single column df.loc[:, 'col'] but pandas really needs to focus on being explicit.

Contributor

tdpetrou commented Nov 27, 2017

If I were to rebuild pandas, I would make indexing as simple as possible and only use .loc and .iloc. I would not implement __getitem__. There would be no ambiguity. I also wouldn't allow attribute access to columns. It would be a pain to select a single column df.loc[:, 'col'] but pandas really needs to focus on being explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment