Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Simplifying indexing (DataFrame.__getitem__) #22

Open
shoyer opened this issue Sep 9, 2016 · 2 comments
Open

Simplifying indexing (DataFrame.__getitem__) #22

shoyer opened this issue Sep 9, 2016 · 2 comments
Labels

Comments

@shoyer
Copy link

shoyer commented Sep 9, 2016

The rules for exactly what DataFrame.__getitem__/__setitem__ does (pandas-dev/pandas#9595) are sufficiently complex and inconsistent that they are impossible to understand without extensive experimentation.

This makes for a rather embarrassing situation that we really should fix for pandas 2.0.

I made a proposal when this came up last year:

  • Indexing with a string or list of strings does label based selection on columns.
  • All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

I still like my proposal, but more importantly, it satisfies two important criteria:

  1. The most common uses of DataFrame indexing work unchanged (df['foo'], df[['foo', 'bar']], and df[df['foo'] == 'bar'] might cover 80% of use cases).
  2. It's short and simple, with no exceptions.
@jorisvandenbossche
Copy link
Contributor

@shoyer Thanks for starting this discussion. I will try to update my overview next week.
I included your proposal in the top post for convenience here, but I recommend others to still have a read of my overview and @shoyer proposal at pandas-dev/pandas#9595.

I think we should certainly consider this (which does not mean that it will eventually turn out to be possible/desirable to change this so radically)

I like the simplicity of the proposal and how it covers the most important use case (I think you can count 'slicing with integers' (eg df[5:10])) to those common use case that stay unchanged as well).
Trying to think about the most important drawbacks or consequences / backward incompatibilities. The main use cases that would be impacted that I now think of:

  • integer/numeric column names:
    • df[0] would you no longer give the first column (or in general the column with name 0, not necessarily the first), but rather the first row
    • unless we decide to keep the behaviour of single indexers targetting the information axis (columns for dataframe) for integer indexers as well (deviating from numpy rules of targetting axis 0)
    • Note that when having string columns, something like df[0] now raises a KeyError, so here such a change is less of a problem
  • float index:
    • This is the exception on the other indexes when it comes to slicing (as it is now label based instead of integer position based). However, this is an inconsistency we should try to solve in any case IMO
  • Do we also allow strings in slicing? (you now only mentioned single string or list of strings)
    • For example, cases like s['a':'d'] or with timeseries df['2012-01-01':] are quite convenient as well (certainly the string-date slicing is something I use).
  • Using other objects (apart from numeric or string) as indexer.
    • Consider the case of a DatetimeIndex where you now can index with Timestamp objects, eg s[pd.Timestamp('2012-01-01 09:00:00')]
  • Series looses a bit of its pure dict-like bahaviour:
    • Considering s = pd.Series({1: 1, 2: 2}), s[1] would you no longer give you the element with label/key 1
  • ...

I we want to do this, the question is also (apart from the exact semantics): how can we facilitate the transition?
A __future__ like import is probably not possible (your SO question), but in principle we could try to raise warnings in a bunch of specific cases we know will change in some releases before 2.0 (although that implies putting those in the correct places in the complex indexing code)

@shoyer
Copy link
Author

shoyer commented Sep 26, 2016

Here's another proposal, more similar to existing rules and without type dependent logic for indexer keys:

  1. Indexing a DataFrame with a non-pandas object (including slice objects) implicitly indexes columns rather than rows (e.g., df[k] -> df[:, k] rather than df[k, :] as in NumPy).
  2. Indexing a DataFrame with a pandas.Series implicitly indexes by rows instead.
  3. Indexing always uses labels like .loc.

We need rule (2) because otherwise boolean indexing like df[df.foo == 'bar'] breaks. This does make for a potentially awkward distinction between pandas and non-pandas objects, and it's not entirely clear where types like numpy.ndarray should fall. Alternatively we could continue to use a indexer dtype based distinction for rows/column like the current behavior (boolean arrays and slice objects do rows, everything else does columns), which is mostly reasonable but does break badly if booleans are used as column names.

If we make labels optional (#17), we would use integer indexing instead when there is no index for both __getitem__ and .loc (but importantly, never as a fallback).

A downside of this alternative is that it does break slicing with integers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants