ENH: add per axis, per level indexing with tuples using loc #6134

ghost · 2014-01-27T23:12:15Z

Selection only. Did not check (nor care) about performance at this point.
Full Tests, vbenches and docs to follow in due course.

Please hold off on PEP8 comments or suggestions on how this could be made more complex.
I'd like to get something functional, solid and understandable in master fairly early in 0.14.

@jreback, I don't often wade this deep into the gore. Would welcome your review and suggestions.

Example of setting usage:
http://stackoverflow.com/questions/21539260/assign-to-multiindex-slice/21539479#21539479

xref: #5641, #3057, #4036, and others

In [1]: 
   ...: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 
   ...: ix =pd.MultiIndex.from_product([mklbl('A',5),mklbl('B',7),mklbl('C',4),mklbl('D',2)])
   ...: df=pd.DataFrame(range(len(ix.get_values())),index=ix)

In [5]: df
Out[5]: 
              0
A0 B0 C0 D0   0
         D1   1
      C1 D0   2
         D1   3
      C2 D0   4
         D1   5
      C3 D0   6
         D1   7
   B1 C0 D0   8
         D1   9
      C1 D0  10
         D1  11

In [2]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[2]: 
               0
A1 B0 C1 D0   58
         D1   59
      C3 D0   62
         D1   63
   B1 C1 D0   66
         D1   67
      C3 D0   70
         D1   71
   B2 C1 D0   74
         D1   75
      C3 D0   78
         D1   79
   B3 C1 D0   82
         D1   83
      C3 D0   86
         D1   87
   B4 C1 D0   90
         D1   91
      C3 D0   94
         D1   95
   B5 C1 D0   98
         D1   99
      C3 D0  102
         D1  103
   B6 C1 D0  106
         D1  107
      C3 D0  110
         D1  111
A2 B0 C1 D0  114
         D1  115
      C3 D0  118
         D1  119
   B1 C1 D0  122
         D1  123
      C3 D0  126
         D1  127
   B2 C1 D0  130
         D1  131
      C3 D0  134
         D1  135
             ...

[56 rows x 1 columns]

jreback · 2014-01-28T00:04:13Z

df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

I think accept None as well as slice(None) would be nice (and prob not a big deal)

as a side issue, we can support/eliminate xs by doing this:

df.loc(level=4)[['C1:'C3']],:] e.g. this would pass additional arguments to the .loc constructors (namely level)

jreback · 2014-01-28T00:05:34Z

pandas/core/indexing.py

@@ -725,6 +732,7 @@ def _getitem_lowerdim(self, tup):
                    if len(new_key) == 1:
                        new_key, = new_key

+                # This is an elided recursive call to loc (always?)


no, that's why name is their, you see this with any multi-axis indexing, e.g.df.iloc[3:5,1:2]

jreback · 2014-01-28T00:13:35Z

pandas/core/indexing.py

+
+    return ranges
+
+def _spec_to_array_indices(ix, specs):


this seems like a special case of core/index.py/MultiIndex/_get_loc_level which is a big monster....

your routine nicely encapsulates things and is pretty...should maybe be a method of MultiIndex? (in the future maybe integrate with _get_loc_level, but I think that has lots of special cases

I saw some terrible things in the indexing code. It's practically unreadabele. I fear
that by plugging it in there, we'll remain stuck where we only get a soul brave enough
to improve indexing things once every 18 months. or you have to do it.

_get_loc_level returns a slice right? If I'm not wrong, it returns a slice relative to the
whole array, rather then per level, which I find easier to grok code wise,
and aides the lazy index generation later on. Not that we have a lazy consumer to plug it into.

Ok, how about a compromise. Maybe something like this:

class MultiIndexSlicer(object): def __init__(self, labels): self.labels = labels def get_indexer(self, key): # this basically calls your 2 functions # but keeps the specs internal # also, I think allows one to do _get_loc_level as a separate method call (when this gets refactored

indexer = MultiIndexSlicer(labels).get_indexer(key) return self.obj.iloc[indexer]

(could also pas obj and axis so this MIS object would have more state...

That looks fine. It's a style thing, but I prefer self-contained functions that make
good building blocks over monster objects.

ghost · 2014-01-28T00:13:43Z

I know the None convention is used elsewhere, but it seems awful strange to specify None,
when you're actually requesting everything.

jreback · 2014-01-28T00:15:04Z

actually, maybe you can just to and empty slice, e.g. df.loc[(:,:,:,['C1':'C3'])] but I suspect that doesn't work (maybe elipsis?)

ghost · 2014-01-28T00:23:33Z

You know it doesn't. Ellipsis could work (as the singleton, ... doesn't work inside a tuple either),
but it's not consistent with the rest of the indexing methods. If we did it uniformly as a refactor in the future?

Alternativelty. define something short in the pd namespace to be slice(None) and accept that everywhere.
Similar to np.nan. Not sure.

jreback · 2014-01-28T00:57:48Z

ok...I would be ok with

maybe pd.slice, pd.allslice, pd.islice

need to think of something good

jreback · 2014-01-28T00:58:56Z

But a big 👍 on this....I think sorely needed. also FYI, setting shouldn't be too difficult once getting is done.

jreback · 2014-01-28T00:59:29Z

cc @dragoljub as I know you are wanting this

ghost · 2014-01-28T00:59:56Z

Glad to hear it. I know there's the copy problem, but I'm not sure how you actually get past that.

jreback · 2014-01-28T01:09:00Z

Actually, once the indexer is found setting is a quite striaghtforward assignment of locations to array

kind of similar to

df.iloc[[0,1,3,4,10,12],column_indexer] = 2d numpy arrays here

which is already handled

I agree the copy problem is tricky.

dragoljub · 2014-01-28T01:53:29Z

YES! I do want this. This will make my multi-index DFs much more user friendly, I may be able to avoid having duplicate index and column data moving forward. 👍

Not to get sidetracked but why does the empty slice not work? 😕

jreback · 2014-01-28T02:05:37Z

the ':' syntax only works in the [ ]

which is a problem of we need to have multi axis and multi levels

the slice(None) syntax works fine to indicate give me a full slice for that level (iow don't exclude anything)

I think that is kind of cumbersome and I stead maybe something pd.allslice to represent

more of a syntax thing

ghost · 2014-01-28T02:07:29Z

If you're asking why you can't do df.loc[(foo,:, bar),'baz'], it's because the : token is not
a first class thing, but part of python's grammer (See the subscript rule),
and only recognised inside immediately [] when subscripting, not elsewhere (such as as an element in a tuple).

ghost · 2014-01-28T02:08:59Z

allslice? too spicy.

ghost · 2014-01-28T02:27:33Z

In [6]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 
   ...: ix =pd.MultiIndex.from_product([mklbl('A',25),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
   ...: df=pd.DataFrame(range(len(ix.get_values())),index=ix)
   ...: df=df.sort_index()
   ...: r=pd.core.indexing._tuple_to_mi_locs(df.index,((slice('A1','A3'),slice(None), ['C1','C3'])))
   ...: ixs=pd.core.indexing._spec_to_array_indices(df.index,r)
   ...: 
   ...: print "MultiIndex size: %d" % len(ix)
   ...: %timeit -n10 -r 10 df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
   ...: %timeit -n10 -r 10 df.iloc[ixs]

ix =pd.MultiIndex.from_product([mklbl('A',25),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 113400
10 loops, best of 10: 10.6 ms per loop
10 loops, best of 10: 9.76 ms per loop

ix =pd.MultiIndex.from_product([mklbl('A',45),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 204120
10 loops, best of 10: 13.4 ms per loop
10 loops, best of 10: 12.6 ms per loop

 ix =pd.MultiIndex.from_product([mklbl('A',85),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 385560
10 loops, best of 10: 16.6 ms per loop
10 loops, best of 10: 13.5 ms per loop

After vectorizing it's pretty evenly matched with passing iloc the indices directly.

ghost · 2014-01-28T03:06:05Z

pd.alls?

jreback · 2014-01-28T03:18:54Z

could work

dragoljub · 2014-01-28T04:36:52Z

Just having the option to do slice(None) or None is a great start.

Will this support indexing with a shorter tuple than the number of levels? That would be a nice feature to have.

df.loc[(level0,level1)] for a 5-level index will fill the remaining levels with slice(None).

maybe: pd.uncut, pd.whole

ghost · 2014-01-28T04:55:53Z

You can omit trailing levels right now, it'll work. The example shows this, look at the last, 'D', level.

hayd · 2014-01-28T22:59:52Z

This is really good, MI slicing atm is fiddly...

FYI The only toplevel constant atm is pd.NaT. I think pd.Alls would work be better, and less confusing with all.

I was going to (jokingly) suggest passing the function all (or maybe not) :s, but interestingly...

In [20]: df = pd.DataFrame([[1, 2], [3, 4]])

In [21]: df.loc[(1, all, 2),1]
Out[21]: 
1                           4
<built-in function all>   NaN
2                         NaN
Name: 1, dtype: float64

Users would have to be careful not to accidentally type in wrong thing here if we go with pd.something (I guess that's always the case).

dragoljub · 2014-01-28T23:07:31Z

What about leveraging the fact that Python is case sensitive and using pd.ALL It could be confusing but all caps will make it stand out df.loc[(1, ALL, 2),1]

Maybe accepting a string argument(s) like 'all' in place of a direct alias of slice(None)?

ghost · 2014-01-28T23:15:42Z

Let's leave that for now and pick it up in a subsequent PR. no clear winner.
afaic, @jreback can have the last word but it needs to be made consistent
across pandas.

ghost · 2014-02-04T11:44:22Z

on the allslice,alls, ALL, I suggest we punt and use a non-official _things_to_treat_as_slice_none = (,) with the only guarantee being that that it contains slice(None) by default.

@jtratner has used this trick in several places. I like it.

jreback · 2014-02-04T11:44:39Z

it moves a collection of functions into an object where they have a logical functionality, grouping them for ease of understanding. One of the functions is purely internal (even to the indexing routines) and thus just makes the indexing module even more complicated to understand.

This breaks with the pattern all of the functionality grouped according to what it does in an object. Furthermore you may need to delegate functionality for different index types and/or setting which you might need to keep some state.

I get your approach and what I am suggesting is almost a stylistic change, except to someoneone looking at the module will see a bunch of objects and then a bunch of not-short functions.

jreback · 2014-02-04T11:46:17Z

how would you actually write the all_slice_none??

are we against just using None?

ghost · 2014-02-04T11:53:11Z

That makes sense, basically we already agreed on this here. Grouping by object is fine (and true, consistent with the rest of pandas) and I like the idea of keeping the dispatch logic inside such a thing, that
only hold a clear dispatch logic (with comments). It then only invokes self-contained functions
that do exactly one thing in a readable manner. I like that a lot in fact.

re how to write all_slice_none:

if tuple: foo # means slice
elif list: foo # means discrete list
else isinstance(theThing, all_slice_none_is_any_of_these) # or is in.. etc'

I dislike None, yeah. And I don't think we need to make a call on the matter this way.

We don't really have a WE you know, we just try our best to...

ghost · 2014-02-04T11:54:28Z

The only issue with the slice object thing is that I don't want to hold this up for
a larger supposed refactor with an unknown arrival date. The change I'm fine with,
holding things for it, less so. If you put it in place, I'll play along, or we can refactor later,
which do you prefer?

jreback · 2014-02-04T11:58:45Z

u can refactor would be great

I think this warrants merging soon

as their are a number of things to test/fix after (setting, doc changes, tests with all index types ) - maybe open a new issue with a checklist for this as a master issue

let's open a new issue to defer the ALL change (or not change)

ghost · 2014-02-04T12:06:45Z

u can refactor would be great

I suggested refactroring later or waiting till you put it in place. Where did
me writing it come from? I'm not sure what you ultimately had in mind and wouldn't
want to write it anyway.
The logic is currently inside the index functions already, and pandas is full of pure functions
in util modules and cython files. So I'm willing to go along with you since you feel
strongly about it. Writing it? no.

We can stick with slice(None) only for now, fine.

jreback · 2014-02-04T12:09:44Z

ok np
I'll do a follow up then as soon as you merge

ghost · 2014-02-04T12:14:52Z

I'll add tests and a vbench, and ping you before merging. If you choose to
turn everything into objects later on, I can't stop you. But I really find self-contained
functions much better.
Even if it's inconsistent with the current indexing code (not really)... worth remembering
that the current indexing code is horrifying.

ghost · 2014-02-04T13:28:55Z

@jreback as discussed, closing this for you to pick up and rework as you see fit.

jreback · 2014-02-04T13:30:39Z

fine...go ahead and merge when you are ready with tests and such

jreback · 2014-02-04T13:38:46Z

oh...thought you were going to add tests/vbench...i can pick up after its in master

jreback · 2014-02-04T13:52:37Z

did you have any tests / vbench already done?

immerrr · 2014-02-11T13:24:37Z

@y-p, @jreback, I'm sorry for barging in without looking deeply into the back-story, but did you consider something like this?

In [1]: class IndexMaker(object):
    def __getitem__(self, *args):
        return args
   ...:     

In [2]: idx = IndexMaker()

In [3]: idx[:,:,'foo':'bar']
Out[3]: 
((slice(None, None, None),
  slice(None, None, None),
  slice('foo', 'bar', None)),)

jreback · 2014-02-11T13:28:05Z

@immerrr what does this buy you? It's not a problem creating the syntax at all

immerrr · 2014-02-11T14:08:22Z

@jreback, it seemed to me that the fact that colon-syntax wasn't directly supported in multilevel lookups forcing users to write something like df.loc[(slice(None), slice(None), slice('foo', 'bar'))] ignited a discussion of aliases for slice(None). The idx singleton from above allows to write it as df.loc[idx[:,:,'foo':'bar']] which IMHO is more idiomatic (colons as take-all slices as usual) and avoids the necessity to introduce and handle aliases in special way (reducing the burden of universality on indexing functions).

Given how easy is the implementation, it surprised me that I hadn't seen that come up in the discussion and I suspected I'd missed something in the back-story which ruled out this option. Hence the question.

jreback · 2014-02-12T00:20:12Z

@immerrr with a slight modification that would work (needs to return the [0] of the tuple, but that's just a detail.

The reason it won't work directly is since multi-axis selection needs to work, the following

df.loc['A':'B',:,,'foo'] can mean too many things...

however rewriting

df.loc[idx['A:'B],idx[:,:,'foo']] does make it nicer....

maybe do this as something like:

e.g.

idx = pd.IndexSlice
df.loc[idx['A:'B],idx[:,:,'foo']]

jreback · 2014-02-12T00:20:37Z

@cpcloud @jorisvandenbossche @dragoljub thoughts on this suggestion by @immerrr ?

dragoljub · 2014-02-12T01:35:32Z

It does looks compelling as a convince function to generate tuples of slices. 👍

My most common use case would probably be to specify fixed values for a few levels and just return all the other levels. The key of pd.All = slice(None) for me is that the index level order no longer matters.

Because multi-index is (for lack of a better word) a bunch of tuples under the hood, I like the idea of passing explicit tuples for each axis indexer. That being said if I need to start slicing and dicing my levels I could get used to @immerrr 's suggesion. 😄

Looking at them side by side I'm not sure which I prefer. Can we have both? 😬

df.loc['A', idx[:,:,'foo']]
vs
df.loc['A', (All, All, 'foo')]

jreback · 2014-02-12T01:43:00Z

jreback@33e3ba0

already added....I am calling it pd.IndexSlice.....(which you can then use directly or rename or whatever)...

jreback · 2014-02-12T01:44:43Z

@dragoljub FYI, I have put up some dev builds of master for windows....pretty easy now that its automated.....so http://pandas.pydata.org/pandas-build/dev/

cpcloud · 2014-02-12T02:17:53Z

I'm +1 on IndexSlice a la df.loc[idx[...], ...].

nehalecky · 2014-02-12T04:27:37Z

Wow all, really nice thread! I learned a lot reading it, thank you. I too am looking forward to such a feature for slicing into MultiIndexes like this. IndexSlice makes sense to me. +1

immerrr · 2014-02-12T08:18:14Z

@jreback, sure, I was aiming for a drop-in replacement for tuple of slices, but if you need that self parameter outside you can have it as well. In fact your suggestion has sparkled an idea that I'm going to post as a separate issue for discussion.

jreback · 2014-02-12T10:56:36Z

@dragoljub you are advocating for also allowing All or ALL as a null slice?

easy to create a pd.IndexAll that u can then alias

dragoljub · 2014-02-12T15:20:16Z

Yea that's fine. Let's stick with one good way to do it: pd.IndexSlice. I like that I can always build my one tuples if needed.

ghost self-assigned this Jan 27, 2014

jreback reviewed Jan 28, 2014
View reviewed changes

y-p added 4 commits January 28, 2014 02:10

CLN: add comments in indexing code

120c4c5

CLN: comment out possibly stale kludge fix and wait for explosion

6923ec1

CLN: Mark if clause for handling of per-axis tuple indexing with loc

7070dd1

ENH: support per-axis, per-level indexing with loc[]

e375684

jreback reviewed Jan 28, 2014
View reviewed changes

PERF: vectorize _spec_to_array_indices, for 3-4x speedup

6635b26

PERF: remove no longer needed list conversion. 1.4x speedup

6bee17a

ghost closed this Feb 4, 2014

ghost removed their assignment Feb 7, 2014

jreback mentioned this pull request Feb 8, 2014

ENH: per axis and per level indexing (orig GH6134) #6301

Merged

4 tasks

immerrr mentioned this pull request Feb 12, 2014

ENH: fastpath indexer API proposal (draft) #6328

Closed

This pull request was closed.

ENH: add per axis, per level indexing with tuples using loc #6134

ENH: add per axis, per level indexing with tuples using loc #6134

Conversation

ghost commented Jan 27, 2014

jreback commented Jan 28, 2014

jreback Jan 28, 2014

Choose a reason for hiding this comment

ghost Jan 28, 2014

Choose a reason for hiding this comment

jreback Jan 28, 2014

Choose a reason for hiding this comment

ghost Jan 28, 2014

Choose a reason for hiding this comment

jreback Jan 28, 2014

Choose a reason for hiding this comment

ghost Jan 28, 2014

Choose a reason for hiding this comment

ghost commented Jan 28, 2014

jreback commented Jan 28, 2014

ghost commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback commented Jan 28, 2014

ghost commented Jan 28, 2014

jreback commented Jan 28, 2014

dragoljub commented Jan 28, 2014

jreback commented Jan 28, 2014

ghost commented Jan 28, 2014

ghost commented Jan 28, 2014

ghost commented Jan 28, 2014

ghost commented Jan 28, 2014

jreback commented Jan 28, 2014

dragoljub commented Jan 28, 2014

ghost commented Jan 28, 2014

hayd commented Jan 28, 2014

dragoljub commented Jan 28, 2014

ghost commented Jan 28, 2014

ghost commented Feb 4, 2014

jreback commented Feb 4, 2014

jreback commented Feb 4, 2014

ghost commented Feb 4, 2014

ghost commented Feb 4, 2014

jreback commented Feb 4, 2014

ghost commented Feb 4, 2014

jreback commented Feb 4, 2014

ghost commented Feb 4, 2014

ghost commented Feb 4, 2014

jreback commented Feb 4, 2014

jreback commented Feb 4, 2014

jreback commented Feb 4, 2014

immerrr commented Feb 11, 2014

jreback commented Feb 11, 2014

immerrr commented Feb 11, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

dragoljub commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 12, 2014

cpcloud commented Feb 12, 2014

nehalecky commented Feb 12, 2014

immerrr commented Feb 12, 2014

jreback commented Feb 12, 2014

dragoljub commented Feb 12, 2014