Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add per axis, per level indexing with tuples using loc #6134

Closed
wants to merge 6 commits into from
Closed

ENH: add per axis, per level indexing with tuples using loc #6134

wants to merge 6 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 27, 2014

Selection only. Did not check (nor care) about performance at this point.
Full Tests, vbenches and docs to follow in due course.

Please hold off on PEP8 comments or suggestions on how this could be made more complex.
I'd like to get something functional, solid and understandable in master fairly early in 0.14.

@jreback, I don't often wade this deep into the gore. Would welcome your review and suggestions.

Example of setting usage:
http://stackoverflow.com/questions/21539260/assign-to-multiindex-slice/21539479#21539479

xref: #5641, #3057, #4036, and others

In [1]: 
   ...: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 
   ...: ix =pd.MultiIndex.from_product([mklbl('A',5),mklbl('B',7),mklbl('C',4),mklbl('D',2)])
   ...: df=pd.DataFrame(range(len(ix.get_values())),index=ix)

In [5]: df
Out[5]: 
              0
A0 B0 C0 D0   0
         D1   1
      C1 D0   2
         D1   3
      C2 D0   4
         D1   5
      C3 D0   6
         D1   7
   B1 C0 D0   8
         D1   9
      C1 D0  10
         D1  11
In [2]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[2]: 
               0
A1 B0 C1 D0   58
         D1   59
      C3 D0   62
         D1   63
   B1 C1 D0   66
         D1   67
      C3 D0   70
         D1   71
   B2 C1 D0   74
         D1   75
      C3 D0   78
         D1   79
   B3 C1 D0   82
         D1   83
      C3 D0   86
         D1   87
   B4 C1 D0   90
         D1   91
      C3 D0   94
         D1   95
   B5 C1 D0   98
         D1   99
      C3 D0  102
         D1  103
   B6 C1 D0  106
         D1  107
      C3 D0  110
         D1  111
A2 B0 C1 D0  114
         D1  115
      C3 D0  118
         D1  119
   B1 C1 D0  122
         D1  123
      C3 D0  126
         D1  127
   B2 C1 D0  130
         D1  131
      C3 D0  134
         D1  135
             ...

[56 rows x 1 columns]

@ghost ghost self-assigned this Jan 27, 2014
@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

I think accept None as well as slice(None) would be nice (and prob not a big deal)

as a side issue, we can support/eliminate xs by doing this:

df.loc(level=4)[['C1:'C3']],:] e.g. this would pass additional arguments to the .loc constructors (namely level)

@@ -725,6 +732,7 @@ def _getitem_lowerdim(self, tup):
if len(new_key) == 1:
new_key, = new_key

# This is an elided recursive call to loc (always?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that's why name is their, you see this with any multi-axis indexing, e.g.df.iloc[3:5,1:2]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


return ranges

def _spec_to_array_indices(ix, specs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a special case of core/index.py/MultiIndex/_get_loc_level which is a big monster....

your routine nicely encapsulates things and is pretty...should maybe be a method of MultiIndex? (in the future maybe integrate with _get_loc_level, but I think that has lots of special cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some terrible things in the indexing code. It's practically unreadabele. I fear
that by plugging it in there, we'll remain stuck where we only get a soul brave enough
to improve indexing things once every 18 months. or you have to do it.

_get_loc_level returns a slice right? If I'm not wrong, it returns a slice relative to the
whole array, rather then per level, which I find easier to grok code wise,
and aides the lazy index generation later on. Not that we have a lazy consumer to plug it into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how about a compromise. Maybe something like this:

class MultiIndexSlicer(object):

    def __init__(self, labels):
              self.labels = labels

     def get_indexer(self, key):
             # this basically calls your 2 functions
             # but keeps the specs internal
             # also, I think allows one to do _get_loc_level as a separate method call (when this gets refactored
indexer = MultiIndexSlicer(labels).get_indexer(key)
return self.obj.iloc[indexer]

(could also pas obj and axis so this MIS object would have more state...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine. It's a style thing, but I prefer self-contained functions that make
good building blocks over monster objects.

@ghost
Copy link
Author

ghost commented Jan 28, 2014

I know the None convention is used elsewhere, but it seems awful strange to specify None,
when you're actually requesting everything.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

actually, maybe you can just to and empty slice, e.g. df.loc[(:,:,:,['C1':'C3'])] but I suspect that doesn't work (maybe elipsis?)

@ghost
Copy link
Author

ghost commented Jan 28, 2014

You know it doesn't. Ellipsis could work (as the singleton, ... doesn't work inside a tuple either),
but it's not consistent with the rest of the indexing methods. If we did it uniformly as a refactor in the future?

Alternativelty. define something short in the pd namespace to be slice(None) and accept that everywhere.
Similar to np.nan. Not sure.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

ok...I would be ok with

maybe pd.slice, pd.allslice, pd.islice

need to think of something good

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

But a big 👍 on this....I think sorely needed. also FYI, setting shouldn't be too difficult once getting is done.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

cc @dragoljub as I know you are wanting this

@ghost
Copy link
Author

ghost commented Jan 28, 2014

Glad to hear it. I know there's the copy problem, but I'm not sure how you actually get past that.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

Actually, once the indexer is found setting is a quite striaghtforward assignment of locations to array

kind of similar to

df.iloc[[0,1,3,4,10,12],column_indexer] = 2d numpy arrays here

which is already handled

I agree the copy problem is tricky.

@dragoljub
Copy link

YES! I do want this. This will make my multi-index DFs much more user friendly, I may be able to avoid having duplicate index and column data moving forward. 👍

Not to get sidetracked but why does the empty slice not work? 😕

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

the ':' syntax only works in the [ ]

which is a problem of we need to have multi axis and multi levels

the slice(None) syntax works fine to indicate give me a full slice for that level (iow don't exclude anything)

I think that is kind of cumbersome and I stead maybe something pd.allslice to represent

more of a syntax thing

@ghost
Copy link
Author

ghost commented Jan 28, 2014

If you're asking why you can't do df.loc[(foo,:, bar),'baz'], it's because the : token is not
a first class thing, but part of python's grammer (See the subscript rule),
and only recognised inside immediately [] when subscripting, not elsewhere (such as as an element in a tuple).

@ghost
Copy link
Author

ghost commented Jan 28, 2014

allslice? too spicy.

@ghost
Copy link
Author

ghost commented Jan 28, 2014

In [6]: def mklbl(prefix,n):
   ...:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ...: 
   ...: ix =pd.MultiIndex.from_product([mklbl('A',25),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
   ...: df=pd.DataFrame(range(len(ix.get_values())),index=ix)
   ...: df=df.sort_index()
   ...: r=pd.core.indexing._tuple_to_mi_locs(df.index,((slice('A1','A3'),slice(None), ['C1','C3'])))
   ...: ixs=pd.core.indexing._spec_to_array_indices(df.index,r)
   ...: 
   ...: print "MultiIndex size: %d" % len(ix)
   ...: %timeit -n10 -r 10 df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
   ...: %timeit -n10 -r 10 df.iloc[ixs]

ix =pd.MultiIndex.from_product([mklbl('A',25),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 113400
10 loops, best of 10: 10.6 ms per loop
10 loops, best of 10: 9.76 ms per loop

ix =pd.MultiIndex.from_product([mklbl('A',45),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 204120
10 loops, best of 10: 13.4 ms per loop
10 loops, best of 10: 12.6 ms per loop

 ix =pd.MultiIndex.from_product([mklbl('A',85),mklbl('B',27),mklbl('C',14),mklbl('D',12)])
MultiIndex size: 385560
10 loops, best of 10: 16.6 ms per loop
10 loops, best of 10: 13.5 ms per loop

After vectorizing it's pretty evenly matched with passing iloc the indices directly.

@ghost
Copy link
Author

ghost commented Jan 28, 2014

pd.alls?

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

could work

@dragoljub
Copy link

Just having the option to do slice(None) or None is a great start.

Will this support indexing with a shorter tuple than the number of levels? That would be a nice feature to have.

df.loc[(level0,level1)] for a 5-level index will fill the remaining levels with slice(None).

maybe: pd.uncut, pd.whole

@ghost
Copy link
Author

ghost commented Jan 28, 2014

You can omit trailing levels right now, it'll work. The example shows this, look at the last, 'D', level.

@hayd
Copy link
Contributor

hayd commented Jan 28, 2014

This is really good, MI slicing atm is fiddly...

FYI The only toplevel constant atm is pd.NaT. I think pd.Alls would work be better, and less confusing with all.

I was going to (jokingly) suggest passing the function all (or maybe not) :s, but interestingly...

In [20]: df = pd.DataFrame([[1, 2], [3, 4]])

In [21]: df.loc[(1, all, 2),1]
Out[21]: 
1                           4
<built-in function all>   NaN
2                         NaN
Name: 1, dtype: float64

Users would have to be careful not to accidentally type in wrong thing here if we go with pd.something (I guess that's always the case).

@dragoljub
Copy link

What about leveraging the fact that Python is case sensitive and using pd.ALL It could be confusing but all caps will make it stand out df.loc[(1, ALL, 2),1]

Maybe accepting a string argument(s) like 'all' in place of a direct alias of slice(None)?

@ghost
Copy link
Author

ghost commented Jan 28, 2014

Let's leave that for now and pick it up in a subsequent PR. no clear winner.
afaic, @jreback can have the last word but it needs to be made consistent
across pandas.

@ghost
Copy link
Author

ghost commented Feb 4, 2014

on the allslice,alls, ALL, I suggest we punt and use a non-official _things_to_treat_as_slice_none = (,) with the only guarantee being that that it contains slice(None) by default.

@jtratner has used this trick in several places. I like it.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

it moves a collection of functions into an object where they have a logical functionality, grouping them for ease of understanding. One of the functions is purely internal (even to the indexing routines) and thus just makes the indexing module even more complicated to understand.

This breaks with the pattern all of the functionality grouped according to what it does in an object. Furthermore you may need to delegate functionality for different index types and/or setting which you might need to keep some state.

I get your approach and what I am suggesting is almost a stylistic change, except to someoneone looking at the module will see a bunch of objects and then a bunch of not-short functions.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

how would you actually write the all_slice_none??

are we against just using None?

@ghost
Copy link
Author

ghost commented Feb 4, 2014

That makes sense, basically we already agreed on this here. Grouping by object is fine (and true, consistent with the rest of pandas) and I like the idea of keeping the dispatch logic inside such a thing, that
only hold a clear dispatch logic (with comments). It then only invokes self-contained functions
that do exactly one thing in a readable manner. I like that a lot in fact.

re how to write all_slice_none:

if tuple: foo # means slice
elif list: foo # means discrete list
else isinstance(theThing, all_slice_none_is_any_of_these) # or is in.. etc'

I dislike None, yeah. And I don't think we need to make a call on the matter this way.

We don't really have a WE you know, we just try our best to...

@ghost
Copy link
Author

ghost commented Feb 4, 2014

The only issue with the slice object thing is that I don't want to hold this up for
a larger supposed refactor with an unknown arrival date. The change I'm fine with,
holding things for it, less so. If you put it in place, I'll play along, or we can refactor later,
which do you prefer?

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

u can refactor would be great

I think this warrants merging soon

as their are a number of things to test/fix after (setting, doc changes, tests with all index types ) - maybe open a new issue with a checklist for this as a master issue

let's open a new issue to defer the ALL change (or not change)

@ghost
Copy link
Author

ghost commented Feb 4, 2014

u can refactor would be great

I suggested refactroring later or waiting till you put it in place. Where did
me writing it come from? I'm not sure what you ultimately had in mind and wouldn't
want to write it anyway.
The logic is currently inside the index functions already, and pandas is full of pure functions
in util modules and cython files. So I'm willing to go along with you since you feel
strongly about it. Writing it? no.

We can stick with slice(None) only for now, fine.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

ok np
I'll do a follow up then as soon as you merge

@ghost
Copy link
Author

ghost commented Feb 4, 2014

I'll add tests and a vbench, and ping you before merging. If you choose to
turn everything into objects later on, I can't stop you. But I really find self-contained
functions much better.
Even if it's inconsistent with the current indexing code (not really)... worth remembering
that the current indexing code is horrifying.

@ghost
Copy link
Author

ghost commented Feb 4, 2014

@jreback as discussed, closing this for you to pick up and rework as you see fit.

@ghost ghost closed this Feb 4, 2014
@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

fine...go ahead and merge when you are ready with tests and such

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

oh...thought you were going to add tests/vbench...i can pick up after its in master

@jreback
Copy link
Contributor

jreback commented Feb 4, 2014

did you have any tests / vbench already done?

@ghost ghost removed their assignment Feb 7, 2014
@immerrr
Copy link
Contributor

immerrr commented Feb 11, 2014

@y-p, @jreback, I'm sorry for barging in without looking deeply into the back-story, but did you consider something like this?

In [1]: class IndexMaker(object):
    def __getitem__(self, *args):
        return args
   ...:     

In [2]: idx = IndexMaker()

In [3]: idx[:,:,'foo':'bar']
Out[3]: 
((slice(None, None, None),
  slice(None, None, None),
  slice('foo', 'bar', None)),)

@jreback
Copy link
Contributor

jreback commented Feb 11, 2014

@immerrr what does this buy you? It's not a problem creating the syntax at all

@immerrr
Copy link
Contributor

immerrr commented Feb 11, 2014

@jreback, it seemed to me that the fact that colon-syntax wasn't directly supported in multilevel lookups forcing users to write something like df.loc[(slice(None), slice(None), slice('foo', 'bar'))] ignited a discussion of aliases for slice(None). The idx singleton from above allows to write it as df.loc[idx[:,:,'foo':'bar']] which IMHO is more idiomatic (colons as take-all slices as usual) and avoids the necessity to introduce and handle aliases in special way (reducing the burden of universality on indexing functions).

Given how easy is the implementation, it surprised me that I hadn't seen that come up in the discussion and I suspected I'd missed something in the back-story which ruled out this option. Hence the question.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

@immerrr with a slight modification that would work (needs to return the [0] of the tuple, but that's just a detail.

The reason it won't work directly is since multi-axis selection needs to work, the following

df.loc['A':'B',:,,'foo'] can mean too many things...

however rewriting

df.loc[idx['A:'B],idx[:,:,'foo']] does make it nicer....

maybe do this as something like:

e.g.

idx = pd.IndexSlice
df.loc[idx['A:'B],idx[:,:,'foo']]

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

@cpcloud @jorisvandenbossche @dragoljub thoughts on this suggestion by @immerrr ?

@dragoljub
Copy link

It does looks compelling as a convince function to generate tuples of slices. 👍

My most common use case would probably be to specify fixed values for a few levels and just return all the other levels. The key of pd.All = slice(None) for me is that the index level order no longer matters.

Because multi-index is (for lack of a better word) a bunch of tuples under the hood, I like the idea of passing explicit tuples for each axis indexer. That being said if I need to start slicing and dicing my levels I could get used to @immerrr 's suggesion. 😄

Looking at them side by side I'm not sure which I prefer. Can we have both? 😬

df.loc['A', idx[:,:,'foo']]
vs
df.loc['A', (All, All, 'foo')]

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

jreback@33e3ba0

already added....I am calling it pd.IndexSlice.....(which you can then use directly or rename or whatever)...

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

@dragoljub FYI, I have put up some dev builds of master for windows....pretty easy now that its automated.....so http://pandas.pydata.org/pandas-build/dev/

@cpcloud
Copy link
Member

cpcloud commented Feb 12, 2014

I'm +1 on IndexSlice a la df.loc[idx[...], ...].

@nehalecky
Copy link
Contributor

Wow all, really nice thread! I learned a lot reading it, thank you. I too am looking forward to such a feature for slicing into MultiIndexes like this. IndexSlice makes sense to me. +1

@immerrr
Copy link
Contributor

immerrr commented Feb 12, 2014

@jreback, sure, I was aiming for a drop-in replacement for tuple of slices, but if you need that self parameter outside you can have it as well. In fact your suggestion has sparkled an idea that I'm going to post as a separate issue for discussion.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

@dragoljub you are advocating for also allowing All or ALL as a null slice?

easy to create a pd.IndexAll that u can then alias

@dragoljub
Copy link

Yea that's fine. Let's stick with one good way to do it: pd.IndexSlice. I like that I can always build my one tuples if needed.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants