ENH: per axis and per level indexing (orig GH6134) #6301

jreback · 2014-02-08T02:00:53Z

This is a reprise of #6134, with tests, and multi-axis support; it is dependent on #6299

closes #4036
closes #4116
closes #3057
closes #2598
closes #5641
closes #3738

docs
v0.14.0 example
release notes
setting assignment to slice #5641, Assignment with MultiIndex replaces dataframe contents with NaNs #3738

This is the whatsnew/docs

MultiIndexing Using Slicers

In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers. You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None). See the docs

Warning 

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. Their are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....,:]
rather than this:

df.loc[(slice('A1','A3'),.....]

Warning

You will need to make sure that the selection axes are fully lexsorted!

In [7]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 

In [8]: index = MultiIndex.from_product([mklbl('A',4),
   ...:                                  mklbl('B',2),
   ...:                                  mklbl('C',4),
   ...:                                  mklbl('D',2)])
   ...: 

In [9]: columns = MultiIndex.from_tuples([('a','foo'),('a','bar'),
   ...:                                   ('b','foo'),('b','bah')],
   ...:                                    names=['lvl0', 'lvl1'])
   ...: 

In [10]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),
   ....:                index=index,
   ....:                columns=columns).sortlevel().sortlevel(axis=1)
   ....: 

In [11]: df
Out[11]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C0 D0   33   32   35   34
         D1   37   36   39   38
      C1 D0   41   40   43   42
         D1   45   44   47   46
      C2 D0   49   48   51   50
         D1   53   52   55   54
      C3 D0   57   56   59   58
             ...  ...  ...  ...

[64 rows x 4 columns]

In [12]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[12]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
             ...  ...  ...  ...

[16 rows x 4 columns]

In [13]: df.loc[(slice(None),slice(None), ['C1','C3']),:]
Out[13]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
         D1   45   44   47   46
      C3 D0   57   56   59   58
         D1   61   60   63   62
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
             ...  ...  ...  ...

[32 rows x 4 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [14]: df.loc['A1',(slice(None),'foo')]
Out[14]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
      D1   84   86
   C3 D0   88   90
      D1   92   94
B1 C0 D0   96   98
      D1  100  102
   C1 D0  104  106
      D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
          ...  ...

[16 rows x 2 columns]

In [15]: df.loc[(slice(None),slice(None), ['C1','C3']),(slice(None),'foo')]
Out[15]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
         D1   60   62
A1 B0 C1 D0   72   74
         D1   76   78
      C3 D0   88   90
         D1   92   94
   B1 C1 D0  104  106
         D1  108  110
      C3 D0  120  122
             ...  ...

[32 rows x 2 columns]

Furthermore you can set the values using these methods

In [16]: df2 = df.copy()

In [17]: df2.loc[(slice(None),slice(None), ['C1','C3']),:] = -10

In [18]: df2
Out[18]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
   B1 C0 D0   33   32   35   34
         D1   37   36   39   38
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   49   48   51   50
         D1   53   52   55   54
      C3 D0  -10  -10  -10  -10
             ...  ...  ...  ...

[64 rows x 4 columns]

You use a right-hand-side of an alignable object as well.

In [19]: df2 = df.copy()

In [20]: df2.loc[(slice(None),slice(None), ['C1','C3']),:] = df2*1000

In [21]: df2
Out[21]: 
lvl0             a             b       
lvl1           bar    foo    bah    foo
A0 B0 C0 D0      1      0      3      2
         D1      5      4      7      6
      C1 D0   1000      0   3000   2000
         D1   5000   4000   7000   6000
      C2 D0     17     16     19     18
         D1     21     20     23     22
      C3 D0   9000   8000  11000  10000
         D1  13000  12000  15000  14000
   B1 C0 D0     33     32     35     34
         D1     37     36     39     38
      C1 D0  17000  16000  19000  18000
         D1  21000  20000  23000  22000
      C2 D0     49     48     51     50
         D1     53     52     55     54
      C3 D0  25000  24000  27000  26000
               ...    ...    ...    ...

[64 rows x 4 columns]

ghost · 2014-02-08T08:23:52Z

Would it be possible not to take code I (anyone for that matter) spent significant effort writing,
slogging through unreadable, undocumented 2 year-old code, to implement a feature
repeatedly requested and punted on for over a year... and squash it into a commit
titled "add comments in indexing code"?

Granted, OSS contributors come from diverse backgrounds and a varied, richly-textured
mosaic of cultures. Norms and sensitivities differ. But, in my part of the mosaic that's called
"Urinating on someone's shoe".

jreback · 2014-02-08T14:58:20Z

@y-p that was YOUR comment

https://github.com/y-p/pandas/commit/120c4c513feb5318eacbcf1133c8cdadf4dd4bac

ghost · 2014-02-08T15:26:56Z

Thanks, much better.

jreback · 2014-02-09T00:34:19Z

cc @dragoljub

if you could review this PR, would be gr8

dragoljub · 2014-02-09T20:40:02Z

This is a fantastic feature to add and has been long overdue. Thanks to y-p for the coding effort and Jeff for docs, discussion, etc. 👍

All the features look good.

My major feedback would be to add an option to allow multilevel indexing to return the complete index depth (all levels) even if you select one specific level with only one value like this: df.loc['A1',(slice(None),'foo')]. Currexntly df.loc[('A0', 'B0', 'C1', 'D0')] returns the full index if I recall, which is generally what I want.

Many times I find myself relying on a global indexing scheme that I would like to preserve regardless of the selection I make. This is epically true when I apply multivariate functions on groupby's, since I'm used to having the full index depth

Quick comments on the first warning:
I think there are missing close parenthesis for both examples:

df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3')),.....]

jreback · 2014-02-09T23:59:03Z

I think the examples I posted are slightly old
since modified the code a bit

we could support something like this

df.loc(drop_level=True)[......]

where drop_level will normally be False

jreback · 2014-02-12T15:26:50Z

any further comments on the API?....I think this is mergable

cc @dragoljub, cc @nehalecky, cc @immerrr
@cpcloud @jorisvandenbossche @jtratner

jreback · 2014-02-12T15:27:31Z

cc @timcera

jreback · 2014-02-12T15:27:53Z

cc @aharoon123
cc @floux

immerrr · 2014-02-12T15:55:23Z

@jreback, ha, I now get it what you meant when you said [0] of the tuple. That's my mistake: __getitem__ is defined as __getitem__(self, x) in python reference, so IndexSlice.__getitem__ should read def __getitem__(self, args): return args without the asterisk and [0].

jreback · 2014-02-12T16:01:27Z

@immerrr fixed up....

jorisvandenbossche · 2014-02-12T23:23:54Z

Some questions:

If you have a Series with a multi-index, would something like the following (without using the slice(None) but direct :, since it is directly in the []) work?

s['A1':'A3', :, ['C1','C3']]

I think this could be possible? Or does this make it to complex for users to know when and when not they can use : and when they have to use slice(None)?

And you could also have something were you can specify on which axis you want to slice:

df.loc(axis=0)['A1':'A3', :, ['C1','C3']]

were this would be the same as

df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

or as (with the IndexSlicer idea):

df.loc[idx['A1':'A3', :, ['C1','C3']],:]

Although I think we should go to one 'preferred' way of doing this (not saying that only one could work, but just choose one to use consistently in the docs).
I also like the IndexSlicer. The only 'problem' that came to my mind is that idx is at the moment sometimes used as a variable name for an index (but only a few times in the docs it seems, so not really a problem I think).

jreback · 2014-02-12T23:35:04Z

@jorisvandenbossche

Your Series example will work (but will add as a test). This is the sort of ambiguity that
multi-indexes got into from the start; if you don't know that s is a Series you might think
its a multi-level frame with multi-axis indexing. Blame python syntax.

yep...thinking about adding arguments to .loc and friends e.g.

df.loc(axis=None, drop_level=None)[.....] (e.g. an axis can be specified and drop_level is
the argument right now in .xs)...

I think those are the right way to do it, but more 'conviences' than anything

I defined only pd.IndexSlicer; it is up to the user to alias it, so for example

df.loc[pd.IndexSlicer['A1':'A3', :, ['C1','C3']],:]

or

from pandas import IndexSlicer as idx
df.loc[idx['A1':'A3', :, ['C1','C3']],:]

(this is how it is in the doc example). I think idx is TOO common.

jreback · 2014-02-13T00:44:09Z

The folliwng are now possible

In [4]: df.loc(axis=0)['A1':'A3',:,['C1','C3']]
Out[4]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
         D1  125  124  127  126
A2 B0 C1 D0  137  136  139  138
         D1  141  140  143  142
      C3 D0  153  152  155  154
         D1  157  156  159  158
   B1 C1 D0  169  168  171  170
         D1  173  172  175  174
      C3 D0  185  184  187  186
         D1  189  188  191  190
A3 B0 C1 D0  201  200  203  202
         D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

In [5]: df.loc(axis='columns')[:,'foo']
Out[5]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C0 D0    0    2
         D1    4    6
      C1 D0    8   10
         D1   12   14
      C2 D0   16   18
         D1   20   22
      C3 D0   24   26
         D1   28   30
   B1 C0 D0   32   34
         D1   36   38
      C1 D0   40   42
         D1   44   46
      C2 D0   48   50
...

jreback · 2014-02-13T01:24:54Z

side issue.. @jorisvandenbossche was thinking that the 'indexing and selecting data' section is getting too long....split off multiindex into another section?

jorisvandenbossche · 2014-02-13T08:53:09Z

doc/source/indexing.rst

+
+   .. code-block:: python
+
+      df.loc[(slice('A1','A3'),.....,:]


a missing )? (also in the example below)

CLN: add comments in indexing code CLN: comment out possibly stale kludge fix and wait for explosion CLN: Mark if clause for handling of per-axis tuple indexing with loc PERF: vectorize _spec_to_array_indices, for 3-4x speedup PERF: remove no longer needed list conversion. 1.4x speedup

ENH: add core/indexing.py/_getitem_nested_tuple to handle the nested_tuple cases for partial multi-indexing

… a particular level ENH: remove get_specs/specs_to_index -> replace with get_locs, to directly compute an indexer for a multi-level specification

TST: better error messages when levels are not sorted with core/index/get_locs ENH: add boolean indexer support on per_axis/per_level BUG: handle a multi-level indexed series passed like with a nested tuple of selectors e.g. something like: s.loc['A1':'A3',:,['C1','C3']]

DOC: release notes and issues for mi_slicing

…dex of differeing levels (GH3738)

ENH: allow the axis keyword to short-circuit indexing

ENH: per axis and per level indexing (orig GH6134)

jreback added API Design labels Feb 8, 2014

jreback added this to the 0.14.0 milestone Feb 8, 2014

jreback mentioned this pull request Feb 9, 2014

CLN: implement xs in terms of loc #6249

Closed

jorisvandenbossche reviewed Feb 13, 2014
View reviewed changes

y-p and others added 8 commits February 13, 2014 08:37

CLN: move indexing loc changes to index.py

30eb6db

TST: tests for per_axis_per_level_getitem

bd2e2a1

ENH: add core/indexing.py/_getitem_nested_tuple to handle the nested_tuple cases for partial multi-indexing

ENH: allow core/index/_get_loc_level to deal with a slice indexer for…

7320263

… a particular level ENH: remove get_specs/specs_to_index -> replace with get_locs, to directly compute an indexer for a multi-level specification

DOC: v0.14.0 and indexing doc updates for mi slicing

65a9976

DOC: release notes and issues for mi_slicing

BUG: Raise a TypeError when trying to assign with a rhs of a multi-in…

1068a44

…dex of differeing levels (GH3738)

API: add in IndexSlice indexer shortcut

03284f3

ENH: make it possible to pass keyword argument to .loc

7d70710

ENH: allow the axis keyword to short-circuit indexing

jreback added a commit that referenced this pull request Feb 13, 2014

Merge pull request #6301 from jreback/mi_indexing

2a9e994

ENH: per axis and per level indexing (orig GH6134)

jreback merged commit 2a9e994 into pandas-dev:master Feb 13, 2014

jreback mentioned this pull request Feb 18, 2014

DOC: MultiIndex Indexing Using Slices #5280

Closed

immerrr mentioned this pull request Feb 22, 2014

API: allow the iloc indexer to run off the end and not raise IndexError (GH6296) #6299

Merged

shoyer mentioned this pull request Jun 18, 2014

API: support multiple indexers for .iloc with a MultiIndex #7490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: per axis and per level indexing (orig GH6134) #6301

ENH: per axis and per level indexing (orig GH6134) #6301

jreback commented Feb 8, 2014

ghost commented Feb 8, 2014

jreback commented Feb 8, 2014

ghost commented Feb 8, 2014

jreback commented Feb 9, 2014

dragoljub commented Feb 9, 2014

jreback commented Feb 9, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

immerrr commented Feb 12, 2014

jreback commented Feb 12, 2014

jorisvandenbossche commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 13, 2014

jreback commented Feb 13, 2014

jorisvandenbossche Feb 13, 2014

ENH: per axis and per level indexing (orig GH6134) #6301

ENH: per axis and per level indexing (orig GH6134) #6301

Conversation

jreback commented Feb 8, 2014

ghost commented Feb 8, 2014

jreback commented Feb 8, 2014

ghost commented Feb 8, 2014

jreback commented Feb 9, 2014

dragoljub commented Feb 9, 2014

jreback commented Feb 9, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

immerrr commented Feb 12, 2014

jreback commented Feb 12, 2014

jorisvandenbossche commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 13, 2014

jreback commented Feb 13, 2014

jorisvandenbossche Feb 13, 2014

Choose a reason for hiding this comment