Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iloc with boolean mask #3631

Closed
hayd opened this issue May 17, 2013 · 42 comments · Fixed by #3635

Comments

@hayd
Copy link
Contributor

commented May 17, 2013

Currently masking by boolean vectors it doesn't matter which syntax you use:

df[mask]
df.iloc[mask]
df.loc[mask]

are all equivalent. Should mask df.iloc[mask] mask by position? (this makes sense if mask is integer index).

This SO question.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

normally the generated masks have the same index as to what you are doing, e.g in your example in the SO question.

I think .iloc could/should do this, does make sense. There is an alignment step for the mask in .where

@snth

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

Here's an example to summarise the current behaviours:

locs = np.arange(4)
nums = 2**locs
reps = map(bin, nums)
df = pd.DataFrame({'locs':locs, 'nums':nums}, reps)
print df
for idx in [None, 'index', 'locs']:
    mask = (df.nums>2).values
    if idx:
        mask = pd.Series(mask, list(reversed(getattr(df, idx))))
    for method in ['', '.loc', '.iloc']:
        try:
            if method:
                accessor = getattr(df, method[1:])
            else:
                accessor = df
            ans = bin(accessor[mask]['nums'].sum())
        except Exception, e:
            ans = str(e)
        print "{:>5s}: df{}[mask].sum()=={}".format(idx, method, ans)

with output

        locs  nums
0b1        0     1
0b10       1     2
0b100      2     4
0b1000     3     8

 None: df[mask].sum()==0b1100
 None: df.loc[mask].sum()==0b1100
 None: df.iloc[mask].sum()==0b1100
index: df[mask].sum()==0b11
index: df.loc[mask].sum()==0b11
index: df.iloc[mask].sum()==0b11
 locs: df[mask].sum()==Unalignable boolean Series key provided
 locs: df.loc[mask].sum()==Unalignable boolean Series key provided
 locs: df.iloc[mask].sum()==Unalignable boolean Series key provided

If I'm understanding your discussion correctly then in the last line the output should also be 0b11.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

yes...that looks right, if you happen to think of a simple example where you would actually use this pls post that

keep in mind that masks are the same length as what you are indexing, so I don't think there is ever ambiguity, but I could be wrong

@snth

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

Fair enough, I can't actually think of anything. It just seems that by symmetry the behaviour should be there.

Similarly, it doesn't seem quite right that in line 6 of my example above, df.iloc[mask] is actually realigning the mask based on mask.index rather than throwing an error.

Given these observations, I would probably vote for the following behaviour:

  • df[mask] should only work with .values in the order they're given and do no realignment. Perhaps for performance reasons.
  • df.loc[mask] should work as it currently does and realign based on mask.index if present. If mask.index is of the wrong type it should throw an Error.
  • df.iloc[mask] should realign based on mask.index if present and of integer type. Otherwise throw an error.

It's true that I haven't thought about this for more than 10 minutes though so if there are many use cases where this causes a problem then nevermind.

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

I think it's possible there could be an ambiguity, if the index is in a different order (e.g. was taken from somewhere else where it may well mean the location rather than the label).

Also, I'm given a /core/frame.py:1943: UserWarning: Boolean Series key will be reindexed to match DataFrame index. somewhere along the way messing with integer index booleans.

Correct me if I wrong, but isn't "realign based on mask.index" different from location?

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

Still thinking about this, but I think a nice feature of boolean masking is that since we don't ever have an ambiguity whether its label or position based (as the mask must be the same length as what you are indexing), then it can be used in either .loc or .iloc. I don't think restricting this is a good idea.

The basic question is do we drop the index effectively (and make it not matter) when its the right length?

e.g. should

df[mask_with_index] and df[mask_ndarray] be the same?

or the first align (currently I retract my earlier statement, I don't think there is an alignment on the index itself)

but this may depend, e.g. iloc/loc don't align, but I think [] might....

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@hayd you make a good point, but to my knowledge that is the issue with an unlabeled index, the user has to make sure its in the right order, pandas cannot help (but the case we are talking about it could help, by making sure the index is aligned)

@snth

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@jreback With regards to the alignment, see my example above. I threw in the reversed(...) to see whether it realigns based on the labels or not. The results above show that [], .loc and .iloc all perform an alignment step based on the labels when these are in a different order. Therefore I think the fact that the length is the same is lulling you into a false sense of security.

Apologies if my example isn't clear. I thought the binary representation was a nice way of concisely summarising which items were selected or not. The bottom line is that the results differ between ndarray and pd.Series because in the pd.Series case the .index is used to do an alignment step first.

Also there's mistake in my output because it should read:

df[mask]['nums'].sum()==...
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@snth you are right....I was eye-balling the code....as its a failry tricky path

ok....so bottom line is wether to make iloc align on the labels or position (for a boolean mask), but that itself is somewhat of an issue, the is certainly probably and likely that the index is not integer-based, so then iloc should throw an error

So from the original SO question, this should then raise?

In [1]: df = pd.DataFrame(range(5), list('ABCDE'), columns=['a'])

In [2]: mask = (df.a%2 == 0)

In [3]: mask
Out[3]: 
A     True
B    False
C     True
D    False
E     True
Name: a, dtype: bool

In [4]: df.iloc[mask]
Out[4]: 
   a
A  0
C  2
E  4

What about this?

In [5]: mask.nonzero()
Out[5]: (array([0, 2, 4]),)

In [6]: mask.nonzero()[0]
Out[6]: array([0, 2, 4])

In [7]: df.iloc[mask.nonzero()[0]]
Out[7]: 
   a
A  0
C  2
E  4

In [8]: df.iloc[Series(mask.nonzero()[0])]
Out[8]: 
   a
A  0
C  2
E  4

In [9]: Series(mask.nonzero()[0])
Out[9]: 
0    0
1    2
2    4
dtype: int64

I there is NO alignment happening, instead its using the values to actually index

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

I think iloc should throw an error if it's not integer based, it should definitely use position.

These integers needn't be in order, just like the labels needn't be in order.

Putting an array to iloc should do what it does now, but a boolean Series is a different beast:

msk = pd.Series([True, True, True, False, False])
df.iloc[msk] == df.iloc[0:3]

?

@snth

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@jreback I agree that in your first example by my reasoning In[4] should raise an Exception rather than realign based on labels within in .iloc.

I don't understand your second example. There's no boolean indexing involved there and they just seem to be examples of .iloc. Is that what's happening in the source code?

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@snth I was trying to have it index with a Series that had a different index, (that is essentially what boolean masking does)

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

Ok, if you had the one above, there is an easy workaround (with arrays):

msk = pd.Series([True, True, True, False, False])
df.iloc[msk.values]  # instead of df.iloc[msk]

But if your index was not in order:

msk = pd.Series([True, True, True, False, False], index=[1, 2, 3, 4, 0])
df.iloc[msk] == df[1:4]

?

@snth

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@hayd In terms of workarounds you could probably do

df.iloc[msk.index][msk.values]
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

Nothing else broke...I'll put up the PR in a minute....pls try out

In [3]: df = DataFrame(range(5), list('ABCDE'), columns=['a'])

In [4]: mask = (df.a%2 == 0)

In [5]: df.iloc[mask]
ValueError: iLocation based boolean indexing can only have an integer index [string] was inferred
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@hayd I think if the user tries what you suggest (with an index not in order), they should be shot :)

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

ha! Well, I'm sulking!

Ignoring the index just seems a little dodge... I think we should either:

  • raise whatever if you pass in a (boolean) Series to iloc (maybe with a NotImplementedError :p).
  • implement analgous behaviour to []/loc *

Surely iloc (for boolean masking) is only for integer location, so if mask.index isn't integer they are calling the wrong thing... and we shouldn't let them.

Otherwise it's just salt for df.iloc[msk.values], and if it is, let people do that.

*I not convinced the (granted, obtuse) example is impossible to envisage.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@hayd does the PR not do the first? (e.g. raises)

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

No... at least I didn't think so, (from your pr):

df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1))
msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0])

In [3]: df[msk]
pandas/core/frame.py:1999: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[3]:
   a
2  2
1  3

In [4]: df.iloc[msk]  # should raise
Out[4]:
   a
2  2
1  3

In [5]: df.iloc[1:3]  # or should be this
Out[5]:
   a
3  1
2  2

At the moment iloc isn't getting by location, but rather by label (and that's true in the other examples, just it was getting the same results).. I think it should just be disabled for boolean indexing (at least for now).

?

hides

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

hmm...I see what you mean, so essentially eliminate boolean masking from iloc entirely?

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

Yeah... perhaps an optimistic NotImplementedError. :p

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

hahha...ok will fix

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@snth output from your script with my PR

        locs  nums
0b1        0     1
0b10       1     2
0b100      2     4
0b1000     3     8
 None: df[mask].sum()==0b1100
 None: df.loc[mask].sum()==0b1100
 None: df.iloc[mask].sum()==0b1100
/mnt/home/jreback/pandas/pandas/core/frame.py:2001: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
index: df[mask].sum()==0b11
index: df.loc[mask].sum()==0b11
index: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask
 locs: df[mask].sum()==Unalignable boolean Series key provided
 locs: df.loc[mask].sum()==Unalignable boolean Series key provided
 locs: df.iloc[mask].sum()==iLocation based boolean indexing cannot use an indexable as a mask
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@hayd @snth I update the release notes / v0.11.1 whatsnew. pls take a look and see if they tell what has changed are not too confusing...thxs

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2013

I think the notes look clear and understandable. I'm happy. :)

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

any other cases you think?

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

will merge maybe in a day or 2....if anyone thinks of anything

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 17, 2013

@y-p any thoughts?

@ghost

This comment has been minimized.

Copy link

commented May 18, 2013

Some twists and turns in the discussion, not sure I got it all.

Here's my take on the discussion above, does it match the PR?

  1. When given a labeled indexer, pandas implicitly aligns. That's the rule.
  2. If the user wants the indexing to behave as if he passed in an array, he should pass in
    an array.
  3. Then, wrt what should we align a bool indexer passed in to iloc?
    • Since .loc can always be used, .iloc obviously shouldn't duplicate the behaviour by aligning wrt to the underlying frame labels. Hence, should raise when passed a labeled indexer with non-integer index.
    • An indexer with integer labels given to iloc should be realigned (re 1.), since we've established it shouldn't do that wrt to labels, the only alternatives is wrt position.
    • Since that doesn't ring like a very common use case, raising NotImplementedError is perfectly fine until there's a demand for it.
@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2013

@y-p I think that's an excellent summary.

The only thing atm (in Jeff's current PR) it raises a ValueError, but maybe NotImplementedError ("one day") better describes it.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 18, 2013

So I disambuigated these 2 cases, which I think corresponds to @y-p (and @hayd) points

In [6]: df = DataFrame(range(5), list('ABCDE'), columns=['a'])

In [7]: mask = (df.a%2 == 0)

In [8]: df
Out[8]: 
   a
A  0
B  1
C  2
D  3
E  4

In [9]: mask
Out[9]: 
A     True
B    False
C     True
D    False
E     True
Name: a, dtype: bool

In [10]: df.iloc[mask]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask


In [11]: mask.index=range(len(mask))

In [12]: mask
Out[12]: 
0     True
1    False
2     True
3    False
4     True
Name: a, dtype: bool

In [13]: df.iloc[mask]
NotImplementedError: iLocation based boolean indexing on an integer type is not available
@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2013

@jreback Yes, that is what we should be doing! :)

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2013

In [28]: df1
Out[28]:
              1    2    3    4
1983-02-16  512  517  510  514
1983-02-17  513  520  513  517
1983-02-18  500  500  500  500
1983-02-21  505  505  496  496

In [29]: msk = df1.apply(lambda col: df[1] != col).any(axis=1)

In [30]: df1.iloc[msk]

sad face

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 24, 2013

Sad because this PR tool this away?
I thought that was the point?

In [10]: df = DataFrame(np.random.randint(490,520,size=16).reshape(4,4),index=date_range('1983-02-16',periods=4))

In [11]: df.iloc[2] = 500

In [12]: df
Out[12]: 
              0    1    2    3
1983-02-16  495  510  500  493
1983-02-17  519  517  508  504
1983-02-18  500  500  500  500
1983-02-19  514  519  498  503

In [13]: msk = df.apply(lambda col: df[1] != col).any(axis=1)

In [14]: msk
Out[14]: 
1983-02-16     True
1983-02-17     True
1983-02-18    False
1983-02-19     True
Freq: D, dtype: bool

In [15]: df.iloc[msk]
ValueError: iLocation based boolean indexing cannot use an indexable as a mask

In [16]: df.loc[msk]
Out[16]: 
              0    1    2    3
1983-02-16  495  510  500  493
1983-02-17  519  517  508  504
1983-02-19  514  519  498  503
@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2013

I totally did not understand that that worked in your pr. I thought it gave a not implemented error! (and you'd have to change msk.index). Hmmm

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 24, 2013

Those were the cases we separated
above, with labels is conceptually wrong, you cant iloc with labels,
this is ok, but we are disallowing it

In [6]: msk.index=range(4)

In [7]: msk
Out[7]: 
0     True
1     True
2    False
3     True
dtype: bool

In [8]: df.iloc[msk]
NotImplementedError: iLocation based boolean indexing on an integer type is not available
@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2013

Sorry, this made little sense without the context, which you found: http://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-data-frame

So you can use integer mask with loc if you index is not integer, I didn't realise that. Not sure where I am on the semantics (this stuff is confusing).

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 24, 2013

I am not sure I understand your last?

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2013

in the above I'm using position in loc. Compare to using df.loc[0] where I get a key error.

I realise that it is obviously the intent of the user to mask like that (since the df doesn't have integer index) but semantically they are using iloc, and I worry that one day if they happen to be using this mask technique with an integer indexed df with loc then they'll get unexpected results. These feels like the ix thing...

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 24, 2013

When you say above, you mean your example whre you do sad face? or this example?

In [1]: df = pd.DataFrame(range(5), columns=['a'], index=range(4, -1, -1))

In [2]: msk = pd.Series([True, True, False, False, False], index=[1,2,3,4,0])

In [3]: df[msk]
pandas/core/frame.py:2013: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[3]: 
   a
2  2
1  3

In [4]: df.iloc[msk]
NotImplementedError: iLocation based boolean indexing on an integer type is not available

In [5]: df.iloc[1:3]
Out[5]: 
   a
3  1
2  2

In [6]: df.loc[msk]
Out[6]: 
   a
2  2
1  3

I don't see where there is possbile confusion? iloc will give an error if you try to index with a mask (either value or not implemented)

loc is by definition label based indexing, again you are indexing, it will be label based (and NEVER positional), that's the point (I think ix would fall back and that's of course why we created loc)

can you give an example of where you think there is a problem?

@hayd

This comment has been minimized.

Copy link
Contributor Author

commented May 24, 2013

I take it all back. I was sure when I did this before I needed to set the index (because somewhere along the line it had gone). Now I see msk.index == df1.index anyway, so I was talking utter rubbish. Sorry!

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 24, 2013

its nice having a skeptical eye! thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.