Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: what should a 2D indexing operation into a 1D Index do? (eg idx[:, None]) #27837

Closed
jorisvandenbossche opened this issue Aug 9, 2019 · 8 comments · Fixed by #35141
Closed
Labels
API Design Deprecate Functionality to remove in pandas Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Follow-up on #27775 and #27818.

Short recap of what those issues were about:

Currently, indexing into an Index with a 2D (or multiple D) indexer results in an "invalid" Index with an underlying ndarray:

In [1]: idx = pd.Index([1, 2, 3])  

In [2]: idx2 = idx[:, None] 

In [3]: idx2
Out[3]: Int64Index([1, 2, 3], dtype='int64')

In [4]: idx2.values
Out[4]: 
array([[1],
       [2],
       [3]])

So from the repr it looks like a proper index, but the underlying values of an Index should always be 1D (such an invalid index will also lead to errors once you do operations on them).

Before pandas 0.25.0, the shape attribute of the index "correctly" returned the shape of the underlying values: (3, 1), but in 0.25.0 this was changed to (3,) (only checking the length). This caused a regression matplotlib (#27775), and will be "fixed" in 0.25.1 returning again the 2D shape of the underlying values (#27818). Of course, this is only about the shape attribute, while the root cause is this invalid Index.

I think it is clear that we should not allow such invalid Index object to exist.
I currently know of two ways to end up such situation:

So let's use this issue to discuss what to do for this second way: a 2D indexing operation on a 1D object.

This is relevant for the Index, but we should probably try to have it consistent with Series as well.

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design Compat pandas objects compatability with Numpy or Python functions Index Related to the Index class or subclasses labels Aug 9, 2019
@jorisvandenbossche
Copy link
Member Author

Now on the actual issue: I suppose the current behaviour has historical reasons (certainly from when Series/Index where ndarray subclasses): when doing a 2D operation on a 1D object, we returned a 2D array. This provides some array duck abilities (it behaves as an array-like for code that expects a numpy-like array, as matplotlib did).

For example, Series still does this (for plain numpy types):

In [16]: pd.Series([1, 2, 3])[:, None] 
Out[16]: 
array([[1],
       [2],
       [3]])

and the source of Index.__getitem__ actually mentions that for such a case, a plain ndarray should be returned:

If resulting ndim != 1, plain ndarray is returned instead of
corresponding `Index` subclass.

but for Index, that does not happen any more (although, under the hood, we still do create a 2D array, but then wrap it again in the 1D Index).

Now, what to do with this longer term. I think there are two obvious solutions:

  • Consistently return a 2D numpy array (so let Index follow Series (and its own old) behaviour).
    • Advantages:
      • This is the least invasive change, and probably does not break anything (or does not even require a change) in downstream libraries (such as the matplotlib case)
      • The duck-array ability can be useful to write array-implementation agnostic code?
    • Disadvantages:
      • This is a very implicit way to convert to a numpy array. It's probably better to do this explicit.
      • What with non-numpy dtypes? (eg currently for Series[category] this fails)
  • Raise an error (after a deprecation period):
    • This will eventually break code and will require changes in eg matplotlib (but, we can do this with a proper deprecation period), but we don't get an implicit type change when indexing an object (Series -> array change due to a certain indexing operation is rather surprising)

@jorisvandenbossche
Copy link
Member Author

cc @tacaswell @shoyer

@TomAugspurger
Copy link
Contributor

+1 for deprecate and raise.

@jbrockmendel
Copy link
Member

I think there is special handling for this in DatetimelikeArrayMixin getitem.

@shoyer
Copy link
Member

shoyer commented Aug 9, 2019

Yes, this should definitely raise an error in the long term.

@jorisvandenbossche
Copy link
Member Author

OK, my preference is also to raise in the future, so let's go that way.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2019

+1 to deprecate and raise

@jorisvandenbossche
Copy link
Member Author

So in the meantime, the Index behaviour is deprecated (#30588). I tagged the issue with 1.1 to not forget that we should also deprecate the same behaviour for Series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants