# iloc with boolean mask #3631

Closed
opened this issue May 17, 2013 · 42 comments · Fixed by #3635

Contributor

### hayd commented May 17, 2013

 Currently masking by boolean vectors it doesn't matter which syntax you use: ``````df[mask] df.iloc[mask] df.loc[mask] `````` are all equivalent. Should mask `df.iloc[mask]` mask by position? (this makes sense if mask is integer index). This SO question.
Contributor

### jreback commented May 17, 2013

 normally the generated masks have the same index as to what you are doing, e.g in your example in the SO question. I think `.iloc` could/should do this, does make sense. There is an alignment step for the mask in `.where`
Contributor

### snth commented May 17, 2013

 Here's an example to summarise the current behaviours: ```locs = np.arange(4) nums = 2**locs reps = map(bin, nums) df = pd.DataFrame({'locs':locs, 'nums':nums}, reps) print df for idx in [None, 'index', 'locs']: mask = (df.nums>2).values if idx: mask = pd.Series(mask, list(reversed(getattr(df, idx)))) for method in ['', '.loc', '.iloc']: try: if method: accessor = getattr(df, method[1:]) else: accessor = df ans = bin(accessor[mask]['nums'].sum()) except Exception, e: ans = str(e) print "{:>5s}: df{}[mask].sum()=={}".format(idx, method, ans)``` with output ``` locs nums 0b1 0 1 0b10 1 2 0b100 2 4 0b1000 3 8 None: df[mask].sum()==0b1100 None: df.loc[mask].sum()==0b1100 None: df.iloc[mask].sum()==0b1100 index: df[mask].sum()==0b11 index: df.loc[mask].sum()==0b11 index: df.iloc[mask].sum()==0b11 locs: df[mask].sum()==Unalignable boolean Series key provided locs: df.loc[mask].sum()==Unalignable boolean Series key provided locs: df.iloc[mask].sum()==Unalignable boolean Series key provided``` If I'm understanding your discussion correctly then in the last line the output should also be 0b11.
Contributor

### jreback commented May 17, 2013

 yes...that looks right, if you happen to think of a simple example where you would actually use this pls post that keep in mind that masks are the same length as what you are indexing, so I don't think there is ever ambiguity, but I could be wrong
Contributor

### snth commented May 17, 2013

 Fair enough, I can't actually think of anything. It just seems that by symmetry the behaviour should be there. Similarly, it doesn't seem quite right that in line 6 of my example above, `df.iloc[mask]` is actually realigning the mask based on mask.index rather than throwing an error. Given these observations, I would probably vote for the following behaviour: df[mask] should only work with .values in the order they're given and do no realignment. Perhaps for performance reasons. df.loc[mask] should work as it currently does and realign based on mask.index if present. If mask.index is of the wrong type it should throw an Error. df.iloc[mask] should realign based on mask.index if present and of integer type. Otherwise throw an error. It's true that I haven't thought about this for more than 10 minutes though so if there are many use cases where this causes a problem then nevermind.
Contributor Author

### hayd commented May 17, 2013

 I think it's possible there could be an ambiguity, if the index is in a different order (e.g. was taken from somewhere else where it may well mean the location rather than the label). Also, I'm given a `/core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index.` somewhere along the way messing with integer index booleans. Correct me if I wrong, but isn't "realign based on mask.index" different from location?
Contributor

### jreback commented May 17, 2013

 Still thinking about this, but I think a nice feature of boolean masking is that since we don't ever have an ambiguity whether its label or position based (as the mask must be the same length as what you are indexing), then it can be used in either `.loc` or `.iloc`. I don't think restricting this is a good idea. The basic question is do we drop the index effectively (and make it not matter) when its the right length? e.g. should `df[mask_with_index]` and `df[mask_ndarray]` be the same? or the first align (currently I retract my earlier statement, I don't think there is an alignment on the index itself) but this may depend, e.g. `iloc/loc` don't align, but I think `[]` might....
Contributor

### jreback commented May 17, 2013

 @hayd you make a good point, but to my knowledge that is the issue with an unlabeled index, the user has to make sure its in the right order, pandas cannot help (but the case we are talking about it could help, by making sure the index is aligned)
Contributor

### snth commented May 17, 2013

 @jreback With regards to the alignment, see my example above. I threw in the reversed(...) to see whether it realigns based on the labels or not. The results above show that [], .loc and .iloc all perform an alignment step based on the labels when these are in a different order. Therefore I think the fact that the length is the same is lulling you into a false sense of security. Apologies if my example isn't clear. I thought the binary representation was a nice way of concisely summarising which items were selected or not. The bottom line is that the results differ between ndarray and pd.Series because in the pd.Series case the .index is used to do an alignment step first. Also there's mistake in my output because it should read: ``````df[mask]['nums'].sum()==... ``````
Contributor

### jreback commented May 17, 2013

 @snth you are right....I was eye-balling the code....as its a failry tricky path ok....so bottom line is wether to make `iloc` align on the labels or position (for a boolean mask), but that itself is somewhat of an issue, the is certainly probably and likely that the index is not integer-based, so then `iloc` should throw an error So from the original SO question, this should then raise? ``````In [1]: df = pd.DataFrame(range(5), list('ABCDE'), columns=['a']) In [2]: mask = (df.a%2 == 0) In [3]: mask Out[3]: A True B False C True D False E True Name: a, dtype: bool In [4]: df.iloc[mask] Out[4]: a A 0 C 2 E 4 `````` What about this? ``````In [5]: mask.nonzero() Out[5]: (array([0, 2, 4]),) In [6]: mask.nonzero()[0] Out[6]: array([0, 2, 4]) In [7]: df.iloc[mask.nonzero()[0]] Out[7]: a A 0 C 2 E 4 In [8]: df.iloc[Series(mask.nonzero()[0])] Out[8]: a A 0 C 2 E 4 In [9]: Series(mask.nonzero()[0]) Out[9]: 0 0 1 2 2 4 dtype: int64 `````` I there is NO alignment happening, instead its using the values to actually index
Contributor Author

### hayd commented May 17, 2013

 I think iloc should throw an error if it's not integer based, it should definitely use position. These integers needn't be in order, just like the labels needn't be in order. Putting an array to `iloc` should do what it does now, but a boolean Series is a different beast: ``````msk = pd.Series([True, True, True, False, False]) df.iloc[msk] == df.iloc[0:3] `````` ?
Contributor

### snth commented May 17, 2013

 @jreback I agree that in your first example by my reasoning In[4] should raise an Exception rather than realign based on labels within in .iloc. I don't understand your second example. There's no boolean indexing involved there and they just seem to be examples of .iloc. Is that what's happening in the source code?
Contributor

### jreback commented May 17, 2013

 @snth I was trying to have it index with a Series that had a different index, (that is essentially what boolean masking does)
Contributor Author

### hayd commented May 17, 2013

 Ok, if you had the one above, there is an easy workaround (with arrays): ``````msk = pd.Series([True, True, True, False, False]) df.iloc[msk.values] # instead of df.iloc[msk] `````` But if your index was not in order: ``````msk = pd.Series([True, True, True, False, False], index=[1, 2, 3, 4, 0]) df.iloc[msk] == df[1:4] `````` ?
Contributor

### snth commented May 17, 2013

 @hayd In terms of workarounds you could probably do ``````df.iloc[msk.index][msk.values] ``````
Contributor

### jreback commented May 17, 2013

 Nothing else broke...I'll put up the PR in a minute....pls try out ``````In [3]: df = DataFrame(range(5), list('ABCDE'), columns=['a']) In [4]: mask = (df.a%2 == 0) In [5]: df.iloc[mask] ValueError: iLocation based boolean indexing can only have an integer index [string] was inferred ``````
referenced this issue May 17, 2013
Contributor

### jreback commented May 17, 2013

 @hayd I think if the user tries what you suggest (with an index not in order), they should be shot :)
Contributor Author

### hayd commented May 17, 2013

 ha! Well, I'm sulking! Ignoring the index just seems a little dodge... I think we should either: raise whatever if you pass in a (boolean) Series to `iloc` (maybe with a NotImplementedError :p). implement analgous behaviour to `[]`/`loc` * Surely `iloc` (for boolean masking) is only for integer location, so if mask.index isn't integer they are calling the wrong thing... and we shouldn't let them. Otherwise it's just salt for `df.iloc[msk.values]`, and if it is, let people do that. *I not convinced the (granted, obtuse) example is impossible to envisage.
Contributor

### jreback commented May 17, 2013

 @hayd does the PR not do the first? (e.g. raises)
Contributor Author

### hayd commented May 17, 2013

 No... at least I didn't think so, (from your pr): ``````df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1)) msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0]) In [3]: df[msk] pandas/core/frame.py:1999: UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning) Out[3]: a 2 2 1 3 In [4]: df.iloc[msk] # should raise Out[4]: a 2 2 1 3 In [5]: df.iloc[1:3] # or should be this Out[5]: a 3 1 2 2 `````` At the moment `iloc` isn't getting by location, but rather by label (and that's true in the other examples, just it was getting the same results).. I think it should just be disabled for boolean indexing (at least for now). ? hides
Contributor

### jreback commented May 17, 2013

 hmm...I see what you mean, so essentially eliminate boolean masking from `iloc` entirely?
Contributor Author

### hayd commented May 17, 2013

 Yeah... perhaps an optimistic `NotImplementedError`. :p
Contributor

### jreback commented May 17, 2013

 hahha...ok will fix
Contributor

### jreback commented May 17, 2013

 @snth output from your script with my PR `````` locs nums 0b1 0 1 0b10 1 2 0b100 2 4 0b1000 3 8 None: df[mask].sum()==0b1100 None: df.loc[mask].sum()==0b1100 None: df.iloc[mask].sum()==0b1100 /mnt/home/jreback/pandas/pandas/core/frame.py:2001: UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning) index: df[mask].sum()==0b11 index: df.loc[mask].sum()==0b11 index: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask locs: df[mask].sum()==Unalignable boolean Series key provided locs: df.loc[mask].sum()==Unalignable boolean Series key provided locs: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask ``````
Contributor

### jreback commented May 17, 2013

 @hayd @snth I update the release notes / v0.11.1 whatsnew. pls take a look and see if they tell what has changed are not too confusing...thxs
Contributor Author

### hayd commented May 17, 2013

 I think the notes look clear and understandable. I'm happy. :)
Contributor

### jreback commented May 17, 2013

 any other cases you think?
Contributor

### jreback commented May 17, 2013

 will merge maybe in a day or 2....if anyone thinks of anything
Contributor

### jreback commented May 17, 2013

 @y-p any thoughts?

### ghost commented May 18, 2013

 Some twists and turns in the discussion, not sure I got it all. Here's my take on the discussion above, does it match the PR? When given a labeled indexer, pandas implicitly aligns. That's the rule. If the user wants the indexing to behave as if he passed in an array, he should pass in an array. Then, wrt what should we align a bool indexer passed in to iloc? Since `.loc` can always be used, `.iloc` obviously shouldn't duplicate the behaviour by aligning wrt to the underlying frame labels. Hence, should raise when passed a labeled indexer with non-integer index. An indexer with integer labels given to iloc should be realigned (re 1.), since we've established it shouldn't do that wrt to labels, the only alternatives is wrt position. Since that doesn't ring like a very common use case, raising `NotImplementedError` is perfectly fine until there's a demand for it.
Contributor Author

### hayd commented May 18, 2013

 @y-p I think that's an excellent summary. The only thing atm (in Jeff's current PR) it raises a `ValueError`, but maybe `NotImplementedError` ("one day") better describes it.
Contributor

### jreback commented May 18, 2013

 So I disambuigated these 2 cases, which I think corresponds to @y-p (and @hayd) points ``````In [6]: df = DataFrame(range(5), list('ABCDE'), columns=['a']) In [7]: mask = (df.a%2 == 0) In [8]: df Out[8]: a A 0 B 1 C 2 D 3 E 4 In [9]: mask Out[9]: A True B False C True D False E True Name: a, dtype: bool In [10]: df.iloc[mask] ValueError: iLocation based boolean indexing cannot use an indexable as a mask In [11]: mask.index=range(len(mask)) In [12]: mask Out[12]: 0 True 1 False 2 True 3 False 4 True Name: a, dtype: bool In [13]: df.iloc[mask] NotImplementedError: iLocation based boolean indexing on an integer type is not available ``````
Contributor Author

### hayd commented May 18, 2013

 @jreback Yes, that is what we should be doing! :)

### jreback closed this in #3635 May 19, 2013

Contributor Author

### hayd commented May 24, 2013

 ``````In [28]: df1 Out[28]: 1 2 3 4 1983-02-16 512 517 510 514 1983-02-17 513 520 513 517 1983-02-18 500 500 500 500 1983-02-21 505 505 496 496 In [29]: msk = df1.apply(lambda col: df[1] != col).any(axis=1) In [30]: df1.iloc[msk] `````` sad face
Contributor

### jreback commented May 24, 2013

 Sad because this PR tool this away? I thought that was the point? ``````In [10]: df = DataFrame(np.random.randint(490,520,size=16).reshape(4,4),index=date_range('1983-02-16',periods=4)) In [11]: df.iloc[2] = 500 In [12]: df Out[12]: 0 1 2 3 1983-02-16 495 510 500 493 1983-02-17 519 517 508 504 1983-02-18 500 500 500 500 1983-02-19 514 519 498 503 In [13]: msk = df.apply(lambda col: df[1] != col).any(axis=1) In [14]: msk Out[14]: 1983-02-16 True 1983-02-17 True 1983-02-18 False 1983-02-19 True Freq: D, dtype: bool In [15]: df.iloc[msk] ValueError: iLocation based boolean indexing cannot use an indexable as a mask In [16]: df.loc[msk] Out[16]: 0 1 2 3 1983-02-16 495 510 500 493 1983-02-17 519 517 508 504 1983-02-19 514 519 498 503 ``````
Contributor Author

### hayd commented May 24, 2013

 I totally did not understand that that worked in your pr. I thought it gave a not implemented error! (and you'd have to change `msk.index`). Hmmm
Contributor

### jreback commented May 24, 2013

 Those were the cases we separated above, with labels is conceptually wrong, you cant `iloc` with labels, this is ok, but we are disallowing it ``````In [6]: msk.index=range(4) In [7]: msk Out[7]: 0 True 1 True 2 False 3 True dtype: bool In [8]: df.iloc[msk] NotImplementedError: iLocation based boolean indexing on an integer type is not available ``````
Contributor Author

### hayd commented May 24, 2013

 Sorry, this made little sense without the context, which you found: http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame So you can use integer mask with `loc` if you index is not integer, I didn't realise that. Not sure where I am on the semantics (this stuff is confusing).
Contributor

### jreback commented May 24, 2013

 I am not sure I understand your last?
Contributor Author

### hayd commented May 24, 2013

 in the above I'm using position in `loc`. Compare to using `df.loc[0]` where I get a key error. I realise that it is obviously the intent of the user to mask like that (since the df doesn't have integer index) but semantically they are using `iloc`, and I worry that one day if they happen to be using this mask technique with an integer indexed df with `loc` then they'll get unexpected results. These feels like the `ix` thing...
Contributor

### jreback commented May 24, 2013

 When you say above, you mean your example whre you do `sad face`? or this example? ``````In [1]: df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1)) In [2]: msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0]) In [3]: df[msk] pandas/core/frame.py:2013: UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning) Out[3]: a 2 2 1 3 In [4]: df.iloc[msk] NotImplementedError: iLocation based boolean indexing on an integer type is not available In [5]: df.iloc[1:3] Out[5]: a 3 1 2 2 In [6]: df.loc[msk] Out[6]: a 2 2 1 3 `````` I don't see where there is possbile confusion? `iloc` will give an error if you try to index with a mask (either value or not implemented) `loc` is by definition label based indexing, again you are indexing, it will be `label` based (and NEVER positional), that's the point (I think `ix` would fall back and that's of course why we created `loc`) can you give an example of where you think there is a problem?
Contributor Author

### hayd commented May 24, 2013

 I take it all back. I was sure when I did this before I needed to set the index (because somewhere along the line it had gone). Now I see `msk.index == df1.index` anyway, so I was talking utter rubbish. Sorry!
Contributor

### jreback commented May 24, 2013

 its nice having a skeptical eye! thanks
referenced this issue Jul 2, 2015
referenced this issue Apr 27, 2016