Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 0.12 unexpected results when comparing dataframe to list or tuple #4576

Closed
ssalonen opened this issue Aug 15, 2013 · 16 comments · Fixed by #4585

Comments

@ssalonen
Copy link
Contributor

commented Aug 15, 2013

It seems that when comparing DataFrame to list or tuple of values (lenght of DataFrame columns), the resulting boolean DataFrame is incorrect.

>>> import pandas
>>> print pandas.__version__
0.12.0
>>> import numpy as np
>>> print np.__version__
1.6.1
>>> df=pandas.DataFrame([ [-1, 0], [1, 2] ])
>>> df > (0, 1)
       0      1
0  False  False
1  False   True
>>> df > [0, 1]
       0      1
0  False  False
1  False   True

``'

Comparison with numpy array works as expected:
```python
>>> df > np.array([0, 1])
       0      1
0  False  False
1   True   True
``'

Note that the comparison behaved correctly at least in pandas 0.9.0:
```python
>>> import pandas
>>> print pandas.__version__
0.9.0
>>> df=pandas.DataFrame([ [-1, 0], [1, 2] ])
>>> df > (0, 1)
       0      1
0  False  False
1   True   True
>>> df > [0, 1]
       0      1
0  False  False
1   True   True
``'
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2013

These yield different shapes (which is confusing as you are using a 2x2 frame)

A single list/tuple becomes a column, while a list-of-list yields rows

In [25]: df = DataFrame(np.arange(6).reshape(3,2))

In [26]: df>(2,2,2)
Out[26]: 
       0      1
0  False  False
1  False   True
2   True   True
I
n [30]: pd.DataFrame([2,2,2])
Out[30]: 
   0
0  2
1  2
2  2

In [31]: pd.DataFrame([[2,2,2]])
Out[31]: 
   0  1  2
0  2  2  2
In [29]: pd.DataFrame(np.array([2,2]))
Out[29]: 
   0
0  2
1  2

In [27]: df>np.array((2,2))
Out[27]: 
       0      1
0  False  False
1  False   True
2   True   True
@ssalonen

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2013

OK I see my example was a bit shady because of the 2x2 size. Here's an example with 3x1 dataframe:

In [10]: df=pd.DataFrame(np.arange(6).reshape((3,2)))

In [11]: df
Out[11]:
   0  1
0  0  1
1  2  3
2  4  5

If list/tuple really becomes a column, why does not 1D array?

In [12]: df > [2, 2]
Out[12]:
      0     1
0  True  True
1  True  True
2  True  True
In [13]: df > np.array([2, 2])
Out[13]:
       0      1
0  False  False
1  False   True
2   True   True

I think both should result in the same since both the list and 1D array are... well, 1D objects.

This would also match what happens with numpy 2D arrays:

In [17]: df.values > [2, 2]
Out[17]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Similar example that confuses me:

In [24]: row_vector = np.atleast_2d([2,2])
In [25]: df > row_vector
Out[25]:
      0     1
0  True  True
1  True  True
2  True  True
In [26]: df.values > row_vector
Out[26]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Would not it be logical in this case that row_vector with shape (1,2) would be broadcasted to (3,2) before comparison?

EDIT: these examples were with pandas 0.12 and numpy 1.7.1

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2013

actually that's not correct

a passed ndarray is represented as it comes (as it defines rows/columns).
A tuple or list (which is what you are presenting) is passed as a list-of-lists/tuples.
These are rows.

In [5]: pd.DataFrame([[1,2,3],[4,5,6]])
Out[5]: 
   0  1  2
0  1  2  3
1  4  5  6

In [6]: pd.DataFrame([1,2,3])
Out[6]: 
   0
0  1
1  2
2  3

In [7]: pd.DataFrame([[1,2,3]])
Out[7]: 
   0  1  2
0  1  2  3

This is exactly the same as numpy behavior. There isn't any implicit broadcasting,

In [13]: np.array([[1,2,3],[4,5,6]])
Out[13]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [14]: np.array([[1,2,3],[4,5,6]]).shape
Out[14]: (2, 3)

In [15]: np.array([1,2,3])
Out[15]: array([1, 2, 3])

In [16]: np.array([1,2,3]).shape
Out[16]: (3,)

In [17]: np.array([[1,2,3]])
Out[17]: array([[1, 2, 3]])

In [18]: np.array([[1,2,3]]).shape
Out[18]: (1, 3)

You are just passing a list which is a column, that's it.

Remember that since you are not passing an index/columns, pandas has to follow
a defined behavior. If you had passed an index/columns then it WILL align on the index.

@mairas

This comment has been minimized.

Copy link

commented Aug 16, 2013

I would like to disagree about pandas following numpy behaviour here. Firstly, 1d numpy arrays do not have a defined direction - they are just 1d vectors. For example,

In [30]: a = np.array([1,2,3])

In [31]: a.shape
Out[31]: (3,)

In [32]: a
Out[32]: array([1, 2, 3])

In [33]: a.T
Out[33]: array([1, 2, 3])

In [34]: a==a.T
Out[34]: array([ True,  True,  True], dtype=bool)

Therefore, it is in my opinion rather dangerous to assume that lists or tuples don't have a shape but that 1d arrays would. I believe they should behave identically.

Second issue is that numpy does broadcasts with comparison operators, just as ssalonen showed above. I guess it would be OK if pandas didn't, but that should be an explicit and documented deviation from numpy semantics.

Third, regardless of the broadcasts, I believe the comparison operators in pandas are quite broken at the moment:

In [49]: df = pd.DataFrame(np.arange(6).reshape((3,2)))

In [50]: b = np.array([2, 2])

In [51]: b_r = np.atleast_2d([2,2])

In [52]: b_c = b_r.T

In [53]: df > b
Out[53]:
       0      1
0  False  False
1  False   True
2   True   True

In [54]: df > b_r
Out[54]:
      0     1
0  True  True
1  True  True
2  True  True

In [55]: df > b_c
Out[55]:
       0      1
0  False  False
1  False   True
2   True   True

I don't quite understand the element-wise comparisons made in the example above. Some broadcasts are necessarily made, but not in any logical fashion.

Also, the equality operator should work with the same semantics as greater than. However, it does not:

In [60]: df == b
Out[60]:
       0      1
0  False  False
1   True  False
2  False  False

In [61]: df == b_r
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: Could not compare [array([[2, 2]])] with block values

In [62]: df == b_c
Out[62]:
       0      1
0  False  False
1   True  False
2  False  False

To me it would seem that 1d and column vectors behave as broadcast row vectors in the comparison operators, while row vectors are more thoroughly broken. :-)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2013

ok...

so the results from df>b, df>b_r, df>b_c should all be all the same?

df>b_r has a broadcasting issue (it is very strict and won't allow even a transpose, maybe I can relax it)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2013

@ssalonen ok....give a try with #4585

@mairas

This comment has been minimized.

Copy link

commented Aug 19, 2013

so the results from df>b, df>b_r, df>b_c should all be all the same?

I don't think they should be the same. Rather, unless there any pandas-specific index alignment is performed in the comparisons, the behaviour should follow that of numpy:

In [64]: df.values > b
Out[64]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [65]: df.values > b_c
[...]

ValueError: operands could not be broadcast together with shapes (3,2) (2,1) 

In [66]: df.values > b_r
Out[66]:
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

The above would seem to indicate that numpy treats 1d vectors as row vectors, so please disregard anything I wrote about it earlier. ;-)

If you maintain a strong opinion that broadcasts should be avoided, then exceptions should be thrown for df>b and df>b_r, too.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2013

Some color. If the rhs side is a pure-numpy array, there is NO alignment done (as we would
just be guessing of how to align), so it IS essentially numpy behavior, but errors are caught
and dealt with. In the current PR, I try the comparison, and if it fails for a broadcasting error
try the transpose of the rhs, so you get the (PR behavior, e.g. df>b_r works).

If I turn off the broadcast catching

df>b_r doesn't error as the default for an error is to pass it thru (e.g. if you have an invalid column, say comparing strings and ints or whatever, this can be turned off but is normally on)

I think you would like to see df>b_r and df==b_r always return an error (whereas the others are ok
and return the same results)?

SEE THE PR...I updated to so waht I suggested

(Pdb) df>b
       0      1
0  False  False
1  False   True
2   True   True
(Pdb) df>b_r
      0     1
0  True  True
1  True  True
2  True  True
(Pdb) df>b_c
       0      1
0  False  False
1  False   True
2   True   True
(Pdb) df == b
       0      1
0  False  False
1   True  False
2  False  False
(Pdb) df == b_r
*** ValueError: Could not compare [array([[2, 2]])] with block values
(Pdb) df == b_c
       0      1
0  False  False
1   True  False
2  False  False
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2013

@mairas did you take a look at the PR? I believe it solves all of the open questions....

@ssalonen

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2013

To me it looks like we are going to right direction

Since we should have numpy behaviour here (as no alignment is done), expected results should be same as with df.values > x?

df.values > b
Out[247]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

df.values > b_r
Out[248]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

Note the different result in the second example above.

Incompatible shapes should result in exception since broadcast is not possible. Numpy does not do automatical transpose in this case which I think is a good thing.

df.values > b_c
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-249-4e61d2a85a75> in <module>()
----> 1 df.values > b_c

ValueError: operands could not be broadcast together with shapes (3,2) (2,1) 

Numpy implementation is a bit different in == comparison, see the following examples

df.values == b
Out[259]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

df.values == b_r
Out[260]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

df.values == b_c
Out[261]: False

Especially the final example is interesting; no exception is raised even though inequality comparison raises one.

Examples with numpy 1.6.1

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2013

In [1]: df = DataFrame(np.arange(6).reshape((3,2)))

In [2]:  b = np.array([2, 2])

In [3]:  b_r = np.atleast_2d([2,2])

In [4]:  b_c = b_r.T

In [5]: df>b
Out[5]: 
       0      1
0  False  False
1  False   True
2   True   True

In [6]: df.values>b
Out[6]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [7]: df>b_r
Out[7]: 
       0      1
0  False  False
1  False   True
2   True   True

In [8]: df.values>b_r
Out[8]: 
array([[False, False],
       [False,  True],
       [ True,  True]], dtype=bool)

In [9]: df>b_c
ValueError: cannot broadcast shape [(3, 2)] with block values [(2, 1)]

In [10]: df.values>b_c
ValueError: operands could not be broadcast together with shapes (3,2) (2,1) 

In [11]: df == b
Out[11]: 
       0      1
0  False  False
1   True  False
2  False  False

In [12]: df.values == b
Out[12]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

In [13]: df==b_r
Out[13]: 
       0      1
0  False  False
1   True  False
2  False  False

In [14]: df.values==b_r
Out[14]: 
array([[False, False],
       [ True, False],
       [False, False]], dtype=bool)

Numpy does weird things like this (bottom example), but we will raise
I believe it is actually a NotImplemented Type which is interpreted as False (but still very weird)

In [15]: df==b_c
ValueError: cannot broadcast shape [(3, 2)] with block values [(2, 1)]

In [16]: df.values==b_c
Out[16]: False
@ssalonen

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2013

I agree that pandas should raise with equals-operator.

The examples and pull request test cases did not include dataframe comparison to list/tuple. I believe they behave the same way as numpy 1D array, right?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2013

The examples are there now (in the PR page)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2013

@mairas @ssalonen what do you think, any more cases?

@mairas

This comment has been minimized.

Copy link

commented Aug 23, 2013

Sorry for not replying earlier - saw shiny things elsewhere. :-)

The semantics look great to me now! Thanks for bearing with us! :-)

Cheers,

ma.

On Aug 23, 2013, at 17:17, jreback notifications@github.com wrote:

@mairas @ssalonen what do you think, any more cases?


Reply to this email directly or view it on GitHub.

@sinhrks

This comment has been minimized.

Copy link
Member

commented Jul 16, 2016

The behavior may be changed after #13637. Pls comment if any thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.