Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Now boolean operators work with NDFrames? #4633

Closed
cpcloud opened this issue Aug 22, 2013 · 22 comments

Comments

Projects
None yet
6 participants
@cpcloud
Copy link
Member

commented Aug 22, 2013

Awkward side effect of the new and improved Series probably, I typed something wrong and was shocked to not get the usual numpy ValueError barf:

In [57]: a, b = Series(rand(10) > 0.5), Series(rand(10) > 0.5)

In [58]: a
Out[58]:
0     True
1     True
2    False
3    False
4     True
5    False
6    False
7     True
8     True
9     True
dtype: bool

In [59]: b
Out[59]:
0     True
1     True
2     True
3     True
4    False
5     True
6    False
7     True
8     True
9     True
dtype: bool

In [60]: a and b # what?! no way!!
Out[60]:
0     True
1     True
2     True
3     True
4    False
5     True
6    False
7     True
8     True
9     True
dtype: bool

In [61]: a & b # whew
Out[61]:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7     True
8     True
9     True
dtype: bool
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2013

In [11]: dfa, dfb = DataFrame(randn(10,2)),DataFrame(randn(10,2))

In [13]: dfa
Out[13]: 
          0         1
0  1.032736 -0.981451
1  0.154600  1.458225
2  0.688022  0.793732
3 -1.171007 -0.643027
4 -0.559731 -0.019199
5  1.079594 -0.549702
6 -1.803250  1.319622
7  0.602654  0.550604
8 -0.908063 -1.304829
9  0.103192  1.542427

In [14]: dfb
Out[14]: 
          0         1
0 -0.445965 -1.055636
1  0.929273  0.160086
2  0.302718 -0.653348
3 -1.227953 -1.627101
4  1.832395 -0.036561
5  0.806457  1.730636
6  1.805333 -0.299774
7  0.283562 -0.263815
8  0.749068  0.589955
9  0.523333 -2.085325

In [15]: dfa and dfb
Out[15]: 
          0         1
0 -0.445965 -1.055636
1  0.929273  0.160086
2  0.302718 -0.653348
3 -1.227953 -1.627101
4  1.832395 -0.036561
5  0.806457  1.730636
6  1.805333 -0.299774
7  0.283562 -0.263815
8  0.749068  0.589955
9  0.523333 -2.085325
@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2013

This seems like a good thing. Now truthy-ness of Series/DataFrames is consistent, empty Series are Falsey.

In [6]: s1 = pd.Series([1])

In [7]: s_ = pd.Series()

In [8]: s_ or s
Out[8]: 
0    1
dtype: int64

Not sure I can use this anywhere, but I'll try.

@miketkelly

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2013

Consistency aside, this seems likely to lead to lots of subtle bugs when people use df[a or b] when they mean df[a | b].

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

Along the lines of @mtkni's comment : is it a good thing that Series behaves more like a Python container vs a numpy container?

@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

Ew:

In [1]: bool(np.array([]))
Out[1]: False

In [2]: bool(np.array([1]))
Out[2]: True

In [3]: bool(np.array([1, 2]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-5ba97928842c> in <module>()
----> 1 bool(np.array([1, 2]))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

That said, you're probably right that and/or should raise. :(

@kjordahl

This comment has been minimized.

Copy link

commented Aug 23, 2013

You also get fun things like this:

>>> a = Series(rand(10))
>>> bool(a == 0)
True
>>> bool(a == 1)
True

I think it should raise as well. It's doing numpy-like comparisons and broadcasting, not throwing an exception can allow subtle bugs to go undetected.

Possibly related: #1069 (old but apparently reverted), #4576, #4537

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

this like this I think are easily caught

I had to add a couple of coercion method which can handle these cases

eg bool , int etc

@kjordahl

This comment has been minimized.

Copy link

commented Aug 23, 2013

What's the use case for allowing bool(s) to return True for any series with nonzero length? It has never been the case in previous releases, since Series inherited from numpy.ndarray. This is new behavior, and IMO it does not add anything useful, only a source of bugs.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

oh I agree

however we have a couple of tests that would fail otherwise so need to fix those

iIIRC there was a couple of places this is relied upon (which is just wrong)

@cpcloud pls mark this and we can take a look

@kjordahl

This comment has been minimized.

Copy link

commented Aug 23, 2013

OK, thanks.

It looks like raising ValueError for truth tests on DataFrames (which had been introduced in PR #1073) was reverted by PR #3696. I would argue that should be restored as well.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

yeh these should all be consistent
I'll take alook tomorrow

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

In [17]: s = Series(randn(4))

In [18]: df = DataFrame(randn(10,2))

In [19]: s_empty = Series()

In [20]: df_empty = DataFrame()

This is from 0.12

In [21]: bool(s)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [22]: bool(s_empty)
Out[22]: False

In [23]: bool(df)
Out[23]: True

In [24]: bool(df_empty)
Out[24]: False

master

In [5]: bool(s)
Out[5]: True

In [6]: bool(s_empty)
Out[6]: False

In [7]: bool(df)
Out[7]: True

In [8]: bool(df_empty)
Out[8]: False

These are the same in 0.12 and master, e.g. a 1-element Series is special cased
to evaluate to return the boolean of a single value

In [9]: bool(Series([True]))
Out[9]: True

In [10]: bool(Series([False]))
Out[10]: False

I am going to argue that 0.12 is inconsistent and master is correct

However, one could have a point of view that ALL of these should raise

any thoughts?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

@wesm ?

@kjordahl

This comment has been minimized.

Copy link

commented Aug 23, 2013

pandas 0.11.0:

>>> bool(s)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> bool(s_empty)
False

>>> bool(df)
ValueError: Cannot call bool() on DataFrame.

>>> bool(df_empty)
ValueError: Cannot call bool() on DataFrame.

To me, it looks like this is the right behavior (with the possible exception of bool(df_empty)).

@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

I'm strongly of the opinion it should be all or nothing. The numpy behaviour is stupid.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

Changing to this (for all NDFrame, and eliminating the special case from Series)

    def __nonzero__(self):
        raise ValueError("cannot use __nonzero__ in NDframe")
    __bool__ = __nonzero__

causes very few failures, (aside from the ones I am purposely testing), really just a couple of test cases that are not formed well

e.g. things like these, which are simply incorrect usages....will change the PR and eveyone can look

s1 = Series.....
s2 = Series....
assert s1 == s2
if s1:
    .....
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

this is a revert to #1073/#1069

now a call to nonzero raises TypeError ALWAYS

The following is the behavior

In [17]: s = Series(randn(4))

In [18]: df = DataFrame(randn(10,2))

In [19]: s_empty = Series()

In [20]: df_empty = DataFrame()

In [5]: bool(s)
Out[5]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [6]: bool(s_empty)
Out[6]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [7]: bool(df)
Out[7]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [8]: bool(df_empty)
Out[8]: TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

And prevents these fun ones (same for Series/Panel)

In [4]: df1 = DataFrame(np.ones((4,4)))

In [5]: df2 = DataFrame(np.zeros((4,4)))

In [6]: df1 and df2
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [7]: df1 or df2
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [8]: not df1
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons

In [9]: def f():
   ...:     if df1:
   ...:         print("this is cool")
   ...:         

In [10]: f()
TypeError: cannot use __nonzero__ in NDframe
  try using empty to evaluate whether the object is empty  or .all()/.any() for elementwise comparisons
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

anyone think this should be a ValueError (like numpy)?

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

I think you should change the error message - the numpy error message is
clearer. (and most users aren't going to understand why bool(...) or if
... ends up with something about nonzero) Whether it should be
TypeError or ValueError: I think it makes sense to follow numpy's lead
here, if only for compatibility - no strong preference for me.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

@jtratner
do you want to suggest something?

numpy

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

make the same? I also want to mention empty?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

revised to be the same as numpy

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2013

see revised PR for updates on the error messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.