Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

Closed
dhirschfeld opened this Issue Dec 1, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@dhirschfeld
Copy link
Contributor

dhirschfeld commented Dec 1, 2017

According to my understanding of how boolean indexing should work the below test should pass:

import numpy as np
from numpy.random import randn
import pandas as pd

df = pd.DataFrame(randn(10, 5))
df1 = df.copy(); df2 = df.copy()
INVALID = (df < -1)|(df > 1)
df1[INVALID.values] = NaN
df2[INVALID] = NaN
assert df1.equals(df2)

i.e. it shouldn't matter whether or not the boolean mask is a DataFrame or a plain ndarray

Problem description

As can be seen in the picture below, when indexing a DataFrame with a boolean ndarray, rather than
setting the individual True elements entire rows are being set if any value in the row is True.

i.e. what is happening is the result I would expect from df[INVALID.any(axis=1)] = NaN rather than df[INVALID] = NaN

image

This is different from how indexing an ndarray works and to me at least is very surprising and entirely unexpected. This behaviour was the cause of a very subtle bug in my code :(

Output of pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.7.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: 2.5.0b1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 1, 2017

the 2d ndarray input hits a path where a 1d array is expected, so this is a bug

Idiomatically and what the DataFrame input hits is .where()

In [1]: df = pd.DataFrame(np.random.randn(10, 5))
   ...: INVALID = (df < -1)|(df > 1)

In [3]: df.where(INVALID.values)
Out[3]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

In [4]: df.where(INVALID)
Out[4]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

Here's a patch to implement the correct conversion (note might fail some existing tests). Alternatively we could raise

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index d3561f8a0..82d9375e5 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2524,10 +2524,10 @@ class DataFrame(NDFrame):
         if indexer is not None:
             return self._setitem_slice(indexer, value)
 
-        if isinstance(key, (Series, np.ndarray, list, Index)):
-            self._setitem_array(key, value)
-        elif isinstance(key, DataFrame):
+        if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
             self._setitem_frame(key, value)
+        elif isinstance(key, (Series, np.ndarray, list, Index)):
+            self._setitem_array(key, value)
         else:
             # set column
             self._set_item(key, value)
@@ -2560,8 +2560,16 @@ class DataFrame(NDFrame):
     def _setitem_frame(self, key, value):
         # support boolean setting with DataFrame input, e.g.
         # df[df > df2] = 0
+
+        if isinstance(key, np.ndarray):
+            if key.shape != self.shape:
+                raise ValueError('Array conditional must be same shape as '
+                                 'self')
+            key = self._constructor(key, **self._construct_axes_dict())
+
         if key.values.size and not is_bool_dtype(key.values):
-            raise TypeError('Must pass DataFrame with boolean values only')
+            raise TypeError('Must pass DataFrame or 2-d ndarray'
+                            'with boolean values only')
 
         self._check_inplace_setting(value)
         self._check_setitem_copy()
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 1, 2017

would love a PR!

@jreback jreback added this to the Next Major Release milestone Dec 1, 2017

@jreback jreback changed the title BUG: Boolean ndarray indexing gives incorrect results? BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results Dec 1, 2017

@dhirschfeld

This comment has been minimized.

Copy link
Contributor Author

dhirschfeld commented Dec 1, 2017

would love a PR!

Ha! Seems to be my weekend for getting stuck into pandas so I'll give it a crack...

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 1, 2017

thanks @dhirschfeld

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

@jreback jreback modified the milestones: Next Major Release, 0.22.0, 0.21.1 Dec 9, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.