Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

ssalonen · 2013-08-15T06:27:50Z

It seems that when comparing DataFrame to list or tuple of values (lenght of DataFrame columns), the resulting boolean DataFrame is incorrect.

>>> import pandas
>>> print pandas.__version__
0.12.0
>>> import numpy as np
>>> print np.__version__
1.6.1
>>> df=pandas.DataFrame([ [-1, 0], [1, 2] ])
>>> df > (0, 1)
       0      1
0  False  False
1  False   True
>>> df > [0, 1]
       0      1
0  False  False
1  False   True

``'

Comparison with numpy array works as expected:
```python
>>> df > np.array([0, 1])
       0      1
0  False  False
1   True   True
``'

Note that the comparison behaved correctly at least in pandas 0.9.0:
```python
>>> import pandas
>>> print pandas.__version__
0.9.0
>>> df=pandas.DataFrame([ [-1, 0], [1, 2] ])
>>> df > (0, 1)
       0      1
0  False  False
1   True   True
>>> df > [0, 1]
       0      1
0  False  False
1   True   True
``'

jreback · 2013-08-15T12:03:15Z

These yield different shapes (which is confusing as you are using a 2x2 frame)

A single list/tuple becomes a column, while a list-of-list yields rows

In [25]: df = DataFrame(np.arange(6).reshape(3,2))

In [26]: df>(2,2,2)
Out[26]: 
       0      1
0  False  False
1  False   True
2   True   True
I
n [30]: pd.DataFrame([2,2,2])
Out[30]: 
   0
0  2
1  2
2  2

In [31]: pd.DataFrame([[2,2,2]])
Out[31]: 
   0  1  2
0  2  2  2

In [29]: pd.DataFrame(np.array([2,2]))
Out[29]: 
   0
0  2
1  2

In [27]: df>np.array((2,2))
Out[27]: 
       0      1
0  False  False
1  False   True
2   True   True

ssalonen · 2013-08-15T17:05:10Z

OK I see my example was a bit shady because of the 2x2 size. Here's an example with 3x1 dataframe:

In [10]: df=pd.DataFrame(np.arange(6).reshape((3,2)))

In [11]: df
Out[11]:
   0  1
0  0  1
1  2  3
2  4  5

If list/tuple really becomes a column, why does not 1D array?

In [12]: df > [2, 2]
Out[12]:
      0     1
0  True  True
1  True  True
2  True  True

In [13]: df > np.array([2, 2])
Out[13]:
       0      1
0  False  False
1  False   True
2   True   True

I think both should result in the same since both the list and 1D array are... well, 1D objects.

This would also match what happens with numpy 2D arrays:

In [17]: df.values > [2, 2]
Out[17]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Similar example that confuses me:

In [24]: row_vector = np.atleast_2d([2,2])
In [25]: df > row_vector
Out[25]:
      0     1
0  True  True
1  True  True
2  True  True

In [26]: df.values > row_vector
Out[26]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Would not it be logical in this case that row_vector with shape (1,2) would be broadcasted to (3,2) before comparison?

EDIT: these examples were with pandas 0.12 and numpy 1.7.1

jreback · 2013-08-15T17:22:09Z

actually that's not correct

a passed ndarray is represented as it comes (as it defines rows/columns).
A tuple or list (which is what you are presenting) is passed as a list-of-lists/tuples.
These are rows.

In [5]: pd.DataFrame([[1,2,3],[4,5,6]])
Out[5]: 
   0  1  2
0  1  2  3
1  4  5  6

In [6]: pd.DataFrame([1,2,3])
Out[6]: 
   0
0  1
1  2
2  3

In [7]: pd.DataFrame([[1,2,3]])
Out[7]: 
   0  1  2
0  1  2  3

This is exactly the same as numpy behavior. There isn't any implicit broadcasting,

In [13]: np.array([[1,2,3],[4,5,6]])
Out[13]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [14]: np.array([[1,2,3],[4,5,6]]).shape
Out[14]: (2, 3)

In [15]: np.array([1,2,3])
Out[15]: array([1, 2, 3])

In [16]: np.array([1,2,3]).shape
Out[16]: (3,)

In [17]: np.array([[1,2,3]])
Out[17]: array([[1, 2, 3]])

In [18]: np.array([[1,2,3]]).shape
Out[18]: (1, 3)

You are just passing a list which is a column, that's it.

Remember that since you are not passing an index/columns, pandas has to follow
a defined behavior. If you had passed an index/columns then it WILL align on the index.

mairas · 2013-08-16T08:44:43Z

I would like to disagree about pandas following numpy behaviour here. Firstly, 1d numpy arrays do not have a defined direction - they are just 1d vectors. For example,

In [30]: a = np.array([1,2,3])

In [31]: a.shape
Out[31]: (3,)

In [32]: a
Out[32]: array([1, 2, 3])

In [33]: a.T
Out[33]: array([1, 2, 3])

In [34]: a==a.T
Out[34]: array([ True,  True,  True], dtype=bool)

Therefore, it is in my opinion rather dangerous to assume that lists or tuples don't have a shape but that 1d arrays would. I believe they should behave identically.

Second issue is that numpy does broadcasts with comparison operators, just as ssalonen showed above. I guess it would be OK if pandas didn't, but that should be an explicit and documented deviation from numpy semantics.

Third, regardless of the broadcasts, I believe the comparison operators in pandas are quite broken at the moment:

In [49]: df = pd.DataFrame(np.arange(6).reshape((3,2)))

In [50]: b = np.array([2, 2])

In [51]: b_r = np.atleast_2d([2,2])

In [52]: b_c = b_r.T

In [53]: df > b
Out[53]:
       0      1
0  False  False
1  False   True
2   True   True

In [54]: df > b_r
Out[54]:
      0     1
0  True  True
1  True  True
2  True  True

In [55]: df > b_c
Out[55]:
       0      1
0  False  False
1  False   True
2   True   True

I don't quite understand the element-wise comparisons made in the example above. Some broadcasts are necessarily made, but not in any logical fashion.

Also, the equality operator should work with the same semantics as greater than. However, it does not:

In [60]: df == b
Out[60]:
       0      1
0  False  False
1   True  False
2  False  False

In [61]: df == b_r
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: Could not compare [array([[2, 2]])] with block values

In [62]: df == b_c
Out[62]:
       0      1
0  False  False
1   True  False
2  False  False

To me it would seem that 1d and column vectors behave as broadcast row vectors in the comparison operators, while row vectors are more thoroughly broken. :-)

jreback · 2013-08-16T20:20:44Z

ok...

so the results from df>b, df>b_r, df>b_c should all be all the same?

df>b_r has a broadcasting issue (it is very strict and won't allow even a transpose, maybe I can relax it)

jreback · 2013-08-16T20:59:00Z

@ssalonen ok....give a try with #4585

mairas · 2013-08-19T08:22:04Z

so the results from df>b, df>b_r, df>b_c should all be all the same?

I don't think they should be the same. Rather, unless there any pandas-specific index alignment is performed in the comparisons, the behaviour should follow that of numpy:

In [64]: df.values > b
Out[64]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [65]: df.values > b_c
[...]

ValueError: operands could not be broadcast together with shapes (3,2) (2,1) 

In [66]: df.values > b_r
Out[66]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

The above would seem to indicate that numpy treats 1d vectors as row vectors, so please disregard anything I wrote about it earlier. ;-)

If you maintain a strong opinion that broadcasts should be avoided, then exceptions should be thrown for df>b and df>b_r, too.

jreback · 2013-08-19T13:35:12Z

Some color. If the rhs side is a pure-numpy array, there is NO alignment done (as we would
just be guessing of how to align), so it IS essentially numpy behavior, but errors are caught
and dealt with. In the current PR, I try the comparison, and if it fails for a broadcasting error
try the transpose of the rhs, so you get the (PR behavior, e.g. df>b_r works).

If I turn off the broadcast catching

df>b_r doesn't error as the default for an error is to pass it thru (e.g. if you have an invalid column, say comparing strings and ints or whatever, this can be turned off but is normally on)

I think you would like to see df>b_r and df==b_r always return an error (whereas the others are ok
and return the same results)?

SEE THE PR...I updated to so waht I suggested

(Pdb) df>b
       0      1
0  False  False
1  False   True
2   True   True
(Pdb) df>b_r
      0     1
0  True  True
1  True  True
2  True  True
(Pdb) df>b_c
       0      1
0  False  False
1  False   True
2   True   True

(Pdb) df == b
       0      1
0  False  False
1   True  False
2  False  False
(Pdb) df == b_r
*** ValueError: Could not compare [array([[2, 2]])] with block values
(Pdb) df == b_c
       0      1
0  False  False
1   True  False
2  False  False

jreback · 2013-08-20T20:38:47Z

@mairas did you take a look at the PR? I believe it solves all of the open questions....

ssalonen · 2013-08-21T09:12:46Z

To me it looks like we are going to right direction

Since we should have numpy behaviour here (as no alignment is done), expected results should be same as with df.values > x?

df.values > b
Out[247]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

df.values > b_r
Out[248]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Note the different result in the second example above.

Incompatible shapes should result in exception since broadcast is not possible. Numpy does not do automatical transpose in this case which I think is a good thing.

df.values > b_c
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-249-4e61d2a85a75> in <module>()
----> 1 df.values > b_c

ValueError: operands could not be broadcast together with shapes (3,2) (2,1)

Numpy implementation is a bit different in == comparison, see the following examples

df.values == b
Out[259]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

df.values == b_r
Out[260]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

df.values == b_c
Out[261]: False

Especially the final example is interesting; no exception is raised even though inequality comparison raises one.

Examples with numpy 1.6.1

jreback · 2013-08-21T14:39:57Z

In [1]: df = DataFrame(np.arange(6).reshape((3,2)))

In [2]:  b = np.array([2, 2])

In [3]:  b_r = np.atleast_2d([2,2])

In [4]:  b_c = b_r.T

In [5]: df>b
Out[5]: 
       0      1
0  False  False
1  False   True
2   True   True

In [6]: df.values>b
Out[6]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [7]: df>b_r
Out[7]: 
       0      1
0  False  False
1  False   True
2   True   True

In [8]: df.values>b_r
Out[8]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [9]: df>b_c
ValueError: cannot broadcast shape [(3, 2)] with block values [(2, 1)]

In [10]: df.values>b_c
ValueError: operands could not be broadcast together with shapes (3,2) (2,1) 

In [11]: df == b
Out[11]: 
       0      1
0  False  False
1   True  False
2  False  False

In [12]: df.values == b
Out[12]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

In [13]: df==b_r
Out[13]: 
       0      1
0  False  False
1   True  False
2  False  False

In [14]: df.values==b_r
Out[14]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

Numpy does weird things like this (bottom example), but we will raise
I believe it is actually a NotImplemented Type which is interpreted as False (but still very weird)

In [15]: df==b_c
ValueError: cannot broadcast shape [(3, 2)] with block values [(2, 1)]

In [16]: df.values==b_c
Out[16]: False

ssalonen · 2013-08-22T06:38:40Z

I agree that pandas should raise with equals-operator.

The examples and pull request test cases did not include dataframe comparison to list/tuple. I believe they behave the same way as numpy 1D array, right?

jreback · 2013-08-22T12:20:07Z

The examples are there now (in the PR page)

jreback · 2013-08-23T14:16:52Z

@mairas @ssalonen what do you think, any more cases?

mairas · 2013-08-23T14:54:43Z

Sorry for not replying earlier - saw shiny things elsewhere. :-)

The semantics look great to me now! Thanks for bearing with us! :-)

Cheers,

ma.

On Aug 23, 2013, at 17:17, jreback notifications@github.com wrote:

@mairas @ssalonen what do you think, any more cases?

—
Reply to this email directly or view it on GitHub.

sinhrks · 2016-07-16T22:41:18Z

The behavior may be changed after #13637. Pls comment if any thoughts.

jreback mentioned this issue Aug 16, 2013

BUG: Fix boolean comparison with a DataFrame on the lhs, and a list/tuple on the rhs GH4576 #4585

Merged

kjordahl mentioned this issue Aug 23, 2013

Now boolean operators work with NDFrames? #4633

Closed

jreback closed this as completed in #4585 Aug 26, 2013

sinhrks mentioned this issue Jul 16, 2016

API: Index/Series/DataFrame op 1-d list-like coercion #13637

Closed

jbrockmendel mentioned this issue Sep 28, 2018

Use align_method in comp_method_FRAME #22880

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

ssalonen commented Aug 15, 2013

jreback commented Aug 15, 2013

ssalonen commented Aug 15, 2013

jreback commented Aug 15, 2013

mairas commented Aug 16, 2013

jreback commented Aug 16, 2013

jreback commented Aug 16, 2013

mairas commented Aug 19, 2013

jreback commented Aug 19, 2013

jreback commented Aug 20, 2013

ssalonen commented Aug 21, 2013

jreback commented Aug 21, 2013

ssalonen commented Aug 22, 2013

jreback commented Aug 22, 2013

jreback commented Aug 23, 2013

mairas commented Aug 23, 2013

sinhrks commented Jul 16, 2016

Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

Comments

ssalonen commented Aug 15, 2013

jreback commented Aug 15, 2013

ssalonen commented Aug 15, 2013

jreback commented Aug 15, 2013

mairas commented Aug 16, 2013

jreback commented Aug 16, 2013

jreback commented Aug 16, 2013

mairas commented Aug 19, 2013

jreback commented Aug 19, 2013

jreback commented Aug 20, 2013

ssalonen commented Aug 21, 2013

jreback commented Aug 21, 2013

ssalonen commented Aug 22, 2013

jreback commented Aug 22, 2013

jreback commented Aug 23, 2013

mairas commented Aug 23, 2013

sinhrks commented Jul 16, 2016