Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: per axis and per level indexing (orig GH6134) #6301

Merged
merged 9 commits into from
Feb 13, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 8, 2014

This is a reprise of #6134, with tests, and multi-axis support; it is dependent on #6299

closes #4036
closes #4116
closes #3057
closes #2598
closes #5641
closes #3738

This is the whatsnew/docs

MultiIndexing Using Slicers

In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers. You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None). See the docs

Warning 

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. Their are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....,:]
rather than this:

df.loc[(slice('A1','A3'),.....]
Warning

You will need to make sure that the selection axes are fully lexsorted!
In [7]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 

In [8]: index = MultiIndex.from_product([mklbl('A',4),
   ...:                                  mklbl('B',2),
   ...:                                  mklbl('C',4),
   ...:                                  mklbl('D',2)])
   ...: 

In [9]: columns = MultiIndex.from_tuples([('a','foo'),('a','bar'),
   ...:                                   ('b','foo'),('b','bah')],
   ...:                                    names=['lvl0', 'lvl1'])
   ...: 

In [10]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),
   ....:                index=index,
   ....:                columns=columns).sortlevel().sortlevel(axis=1)
   ....: 

In [11]: df
Out[11]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C0 D0   33   32   35   34
         D1   37   36   39   38
      C1 D0   41   40   43   42
         D1   45   44   47   46
      C2 D0   49   48   51   50
         D1   53   52   55   54
      C3 D0   57   56   59   58
             ...  ...  ...  ...

[64 rows x 4 columns]
In [12]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[12]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
             ...  ...  ...  ...

[16 rows x 4 columns]

In [13]: df.loc[(slice(None),slice(None), ['C1','C3']),:]
Out[13]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
         D1   45   44   47   46
      C3 D0   57   56   59   58
         D1   61   60   63   62
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
             ...  ...  ...  ...

[32 rows x 4 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [14]: df.loc['A1',(slice(None),'foo')]
Out[14]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
      D1   84   86
   C3 D0   88   90
      D1   92   94
B1 C0 D0   96   98
      D1  100  102
   C1 D0  104  106
      D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
          ...  ...

[16 rows x 2 columns]

In [15]: df.loc[(slice(None),slice(None), ['C1','C3']),(slice(None),'foo')]
Out[15]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
         D1   60   62
A1 B0 C1 D0   72   74
         D1   76   78
      C3 D0   88   90
         D1   92   94
   B1 C1 D0  104  106
         D1  108  110
      C3 D0  120  122
             ...  ...

[32 rows x 2 columns]

Furthermore you can set the values using these methods

In [16]: df2 = df.copy()

In [17]: df2.loc[(slice(None),slice(None), ['C1','C3']),:] = -10

In [18]: df2
Out[18]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
   B1 C0 D0   33   32   35   34
         D1   37   36   39   38
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   49   48   51   50
         D1   53   52   55   54
      C3 D0  -10  -10  -10  -10
             ...  ...  ...  ...

[64 rows x 4 columns]

You use a right-hand-side of an alignable object as well.

In [19]: df2 = df.copy()

In [20]: df2.loc[(slice(None),slice(None), ['C1','C3']),:] = df2*1000

In [21]: df2
Out[21]: 
lvl0             a             b       
lvl1           bar    foo    bah    foo
A0 B0 C0 D0      1      0      3      2
         D1      5      4      7      6
      C1 D0   1000      0   3000   2000
         D1   5000   4000   7000   6000
      C2 D0     17     16     19     18
         D1     21     20     23     22
      C3 D0   9000   8000  11000  10000
         D1  13000  12000  15000  14000
   B1 C0 D0     33     32     35     34
         D1     37     36     39     38
      C1 D0  17000  16000  19000  18000
         D1  21000  20000  23000  22000
      C2 D0     49     48     51     50
         D1     53     52     55     54
      C3 D0  25000  24000  27000  26000
               ...    ...    ...    ...

[64 rows x 4 columns]

@jreback jreback added this to the 0.14.0 milestone Feb 8, 2014
@ghost
Copy link

ghost commented Feb 8, 2014

Would it be possible not to take code I (anyone for that matter) spent significant effort writing,
slogging through unreadable, undocumented 2 year-old code, to implement a feature
repeatedly requested and punted on for over a year... and squash it into a commit
titled "add comments in indexing code"?

Granted, OSS contributors come from diverse backgrounds and a varied, richly-textured
mosaic of cultures. Norms and sensitivities differ. But, in my part of the mosaic that's called
"Urinating on someone's shoe".

@jreback
Copy link
Contributor Author

jreback commented Feb 8, 2014

@ghost
Copy link

ghost commented Feb 8, 2014

Thanks, much better.

@jreback
Copy link
Contributor Author

jreback commented Feb 9, 2014

cc @dragoljub

if you could review this PR, would be gr8

@dragoljub
Copy link

This is a fantastic feature to add and has been long overdue. Thanks to y-p for the coding effort and Jeff for docs, discussion, etc. 👍

All the features look good.

My major feedback would be to add an option to allow multilevel indexing to return the complete index depth (all levels) even if you select one specific level with only one value like this: df.loc['A1',(slice(None),'foo')]. Currexntly df.loc[('A0', 'B0', 'C1', 'D0')] returns the full index if I recall, which is generally what I want.

Many times I find myself relying on a global indexing scheme that I would like to preserve regardless of the selection I make. This is epically true when I apply multivariate functions on groupby's, since I'm used to having the full index depth

Quick comments on the first warning:
I think there are missing close parenthesis for both examples:

df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3')),.....]

@jreback
Copy link
Contributor Author

jreback commented Feb 9, 2014

I think the examples I posted are slightly old
since modified the code a bit

we could support something like this

df.loc(drop_level=True)[......]

where drop_level will normally be False

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2014

any further comments on the API?....I think this is mergable

cc @dragoljub, cc @nehalecky, cc @immerrr
@cpcloud @jorisvandenbossche @jtratner

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2014

cc @timcera

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2014

cc @aharoon123
cc @floux

@immerrr
Copy link
Contributor

immerrr commented Feb 12, 2014

@jreback, ha, I now get it what you meant when you said [0] of the tuple. That's my mistake: __getitem__ is defined as __getitem__(self, x) in python reference, so IndexSlice.__getitem__ should read def __getitem__(self, args): return args without the asterisk and [0].

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2014

@immerrr fixed up....

@jorisvandenbossche
Copy link
Member

Some questions:

If you have a Series with a multi-index, would something like the following (without using the slice(None) but direct :, since it is directly in the []) work?

s['A1':'A3', :, ['C1','C3']]

I think this could be possible? Or does this make it to complex for users to know when and when not they can use : and when they have to use slice(None)?

And you could also have something were you can specify on which axis you want to slice:

df.loc(axis=0)['A1':'A3', :, ['C1','C3']]

were this would be the same as

df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

or as (with the IndexSlicer idea):

df.loc[idx['A1':'A3', :, ['C1','C3']],:]

Although I think we should go to one 'preferred' way of doing this (not saying that only one could work, but just choose one to use consistently in the docs).
I also like the IndexSlicer. The only 'problem' that came to my mind is that idx is at the moment sometimes used as a variable name for an index (but only a few times in the docs it seems, so not really a problem I think).

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2014

@jorisvandenbossche

Your Series example will work (but will add as a test). This is the sort of ambiguity that
multi-indexes got into from the start; if you don't know that s is a Series you might think
its a multi-level frame with multi-axis indexing. Blame python syntax.

yep...thinking about adding arguments to .loc and friends e.g.

df.loc(axis=None, drop_level=None)[.....] (e.g. an axis can be specified and drop_level is
the argument right now in .xs)...

I think those are the right way to do it, but more 'conviences' than anything

I defined only pd.IndexSlicer; it is up to the user to alias it, so for example

df.loc[pd.IndexSlicer['A1':'A3', :, ['C1','C3']],:]

or

from pandas import IndexSlicer as idx
df.loc[idx['A1':'A3', :, ['C1','C3']],:]

(this is how it is in the doc example). I think idx is TOO common.

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2014

The folliwng are now possible

In [4]: df.loc(axis=0)['A1':'A3',:,['C1','C3']]
Out[4]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
         D1  189  188  191  190
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]
In [5]: df.loc(axis='columns')[:,'foo']
Out[5]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C0 D0    0    2
         D1    4    6
      C1 D0    8   10
         D1   12   14
      C2 D0   16   18
         D1   20   22
      C3 D0   24   26
         D1   28   30
   B1 C0 D0   32   34
         D1   36   38
      C1 D0   40   42
         D1   44   46
      C2 D0   48   50
...

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2014

side issue.. @jorisvandenbossche was thinking that the 'indexing and selecting data' section is getting too long....split off multiindex into another section?


.. code-block:: python

df.loc[(slice('A1','A3'),.....,:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a missing )? (also in the example below)

y-p and others added 8 commits February 13, 2014 08:37
CLN: add comments in indexing code

CLN: comment out possibly stale kludge fix and wait for explosion

CLN: Mark if clause for handling of per-axis tuple indexing with loc

PERF: vectorize _spec_to_array_indices, for 3-4x speedup

PERF: remove no longer needed list conversion. 1.4x speedup
ENH: add core/indexing.py/_getitem_nested_tuple to handle the nested_tuple cases for partial multi-indexing
… a particular level

ENH: remove get_specs/specs_to_index -> replace with get_locs, to directly compute
     an indexer for a multi-level specification
TST: better error messages when levels are not sorted with core/index/get_locs

ENH: add boolean indexer support on per_axis/per_level

BUG: handle a multi-level indexed series passed like with a nested tuple of selectors
     e.g. something like: s.loc['A1':'A3',:,['C1','C3']]
DOC: release notes and issues for mi_slicing
ENH: allow the axis keyword to short-circuit indexing
jreback added a commit that referenced this pull request Feb 13, 2014
ENH: per axis and per level indexing (orig GH6134)
@jreback jreback merged commit 2a9e994 into pandas-dev:master Feb 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment