DataFrame.groupby().std() fails on filtered DataFrame #16174

edhalter · 2017-04-29T18:35:13Z

Code Sample, a copy-pastable example if possible

dicts = [{'filter_col':False, 'groupby_col':True, 'bool_col':True, 'float_col':10.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':20.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':30.5}]
df = DataFrame(dicts)
df_filter = df[df['filter_col'] == True]
dfgb = df_filter.groupby('groupby_col')
dfgb.std()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/site-packages/pandas/core/groupby.py", line 1055, in std
    return np.sqrt(self.var(ddof=ddof))
AttributeError: 'bool' object has no attribute 'sqrt'

Problem description

Required elements for the error to appear are:

groupby() is applied to a filtered DataFrame, not an original DataFrame
std(), not another aggregate function (e.g. mean()), is called on the DataFrameGroupBy object
the DataFrame contains a column of type bool
there are at least 2 rows w/ the same value of the .groupby() column (here, 'groupby_col')

In my more-complicated real-world data where I ran into the error, I would also see an Exception complaining about type float:

AttributeError: 'float' object has no attribute 'sqrt'

However, even in that case, deleting the bool column would resolve the issue.

Presumably I'll be able to work around the issue by calling .std() on individual columns of the DataFrameGroupBy object, but it seems like pandas should be able to handle this case w/o choking.

Expected Output

             bool_col  filter_col  float_col
groupby_col                                 
True              0.0     0.0       7.07107

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

python: 3.5.3.final.0

python-bits: 64

OS: Linux

OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 7.1.2
setuptools: 30.4.0
Cython: 0.25.1
numpy: 1.10.4
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
httplib2: 0.9.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-01T11:50:11Z

we exclude non-numeric columns in aggregations. however, bool is valid for some.

In [8]: df.groupby('groupby_col').sum()
Out[8]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              3.0         2.0       61.5

In [9]: df.groupby('groupby_col').mean()
Out[9]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              1.0    0.666667       20.5

In [10]: df.dtypes
Out[10]: 
bool_col          bool
filter_col        bool
float_col      float64
groupby_col       bool
dtype: object

so we could fix generally, by simply astyping bool columns (we already cast certain columns for computation anyhow), or could pull back and remove bool from numeric aggregations like sum/mean.

@TomAugspurger

TomAugspurger · 2017-05-01T13:07:55Z

sqrt and var can also make sense for booleans, but we seem to fail for when the column being aggregated has no variance.

In [5]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-375645985fa5> in <module>()
----> 1 pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()

/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py in std(self, ddof, *args, **kwargs)
   1080         # TODO: implement at Cython level?
   1081         nv.validate_groupby_func('std', args, kwargs)
-> 1082         return np.sqrt(self.var(ddof=ddof, **kwargs))
   1083
   1084     @Substitution(name='groupby')

AttributeError: 'bool' object has no attribute 'sqrt'

In [6]: pd.DataFrame({"A": [1, 1, 2], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
Out[6]:
     B    C
A
1  0.0  0.0
2  NaN  NaN


In [7]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, False], "C": [1, 1, 1]}).groupby("A").std()
Out[7]:
         B    C
A
1  0.57735  0.0

TomAugspurger · 2017-05-01T13:11:04Z

Really, the underlying issue is probably unrelated to groupby.

In [45]: pd.DataFrame({"A": [1, 1, 1, 1], "B": [True, True, True, True], "C": [1, 1, 1, 2]}).groupby("A").var()
Out[45]:
       B     C
A
1  False  0.25

Should the B column there be 0, not False? That'd be consistent with numpy

In [46]: np.var([1, 1, 1, 1])
Out[46]: 0.0

TomAugspurger · 2017-05-01T13:12:18Z

Whoops, still had a groupby in there. My bad, so it is related to groupby. We do handle the regular case correctly. Still, that's the issue is that var on a column with 0 variance returns False instead of 0.

simonjayhawkins · 2020-08-09T18:57:30Z

master is giving the expected output

>>> pd.__version__
'1.2.0.dev0+67.gaefae55e1'
>>>
>>> dicts = [
...     {"filter_col": False, "groupby_col": True, "bool_col": True, "float_col": 10.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 20.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 30.5},
... ]
>>> df = pd.DataFrame(dicts)
>>>
>>> df_filter = df[df["filter_col"] == True]
>>> dfgb = df_filter.groupby("groupby_col")
>>> dfgb.std()
             filter_col  bool_col  float_col
groupby_col
True                0.0       0.0   7.071068
>>>

could use a test.

jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels May 1, 2017

jreback added this to the Next Major Release milestone May 1, 2017

simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 9, 2020

mroeschke mentioned this issue May 21, 2021

TST: Old issues #41607

Merged

10 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021

jreback closed this as completed in #41607 May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.groupby().std() fails on filtered DataFrame #16174

DataFrame.groupby().std() fails on filtered DataFrame #16174

edhalter commented Apr 29, 2017

jreback commented May 1, 2017

TomAugspurger commented May 1, 2017 •

edited

Loading

TomAugspurger commented May 1, 2017

TomAugspurger commented May 1, 2017

simonjayhawkins commented Aug 9, 2020

DataFrame.groupby().std() fails on filtered DataFrame #16174

DataFrame.groupby().std() fails on filtered DataFrame #16174

Comments

edhalter commented Apr 29, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented May 1, 2017

TomAugspurger commented May 1, 2017 • edited Loading

TomAugspurger commented May 1, 2017

TomAugspurger commented May 1, 2017

simonjayhawkins commented Aug 9, 2020

Output of `pd.show_versions()`

TomAugspurger commented May 1, 2017 •

edited

Loading