Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.groupby().std() fails on filtered DataFrame #16174

Closed
edhalter opened this issue Apr 29, 2017 · 5 comments · Fixed by #41607
Closed

DataFrame.groupby().std() fails on filtered DataFrame #16174

edhalter opened this issue Apr 29, 2017 · 5 comments · Fixed by #41607
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@edhalter
Copy link

Code Sample, a copy-pastable example if possible

dicts = [{'filter_col':False, 'groupby_col':True, 'bool_col':True, 'float_col':10.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':20.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':30.5}]
df = DataFrame(dicts)
df_filter = df[df['filter_col'] == True]
dfgb = df_filter.groupby('groupby_col')
dfgb.std()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/site-packages/pandas/core/groupby.py", line 1055, in std
    return np.sqrt(self.var(ddof=ddof))
AttributeError: 'bool' object has no attribute 'sqrt'

Problem description

Required elements for the error to appear are:

  • groupby() is applied to a filtered DataFrame, not an original DataFrame
  • std(), not another aggregate function (e.g. mean()), is called on the DataFrameGroupBy object
  • the DataFrame contains a column of type bool
  • there are at least 2 rows w/ the same value of the .groupby() column (here, 'groupby_col')

In my more-complicated real-world data where I ran into the error, I would also see an Exception complaining about type float:

AttributeError: 'float' object has no attribute 'sqrt'

However, even in that case, deleting the bool column would resolve the issue.

Presumably I'll be able to work around the issue by calling .std() on individual columns of the DataFrameGroupBy object, but it seems like pandas should be able to handle this case w/o choking.

Expected Output

             bool_col  filter_col  float_col
groupby_col                                 
True              0.0     0.0       7.07107

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

python: 3.5.3.final.0

python-bits: 64

OS: Linux

OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 7.1.2
setuptools: 30.4.0
Cython: 0.25.1
numpy: 1.10.4
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
httplib2: 0.9.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 1, 2017

we exclude non-numeric columns in aggregations. however, bool is valid for some.

In [8]: df.groupby('groupby_col').sum()
Out[8]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              3.0         2.0       61.5

In [9]: df.groupby('groupby_col').mean()
Out[9]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              1.0    0.666667       20.5

In [10]: df.dtypes
Out[10]: 
bool_col          bool
filter_col        bool
float_col      float64
groupby_col       bool
dtype: object

so we could fix generally, by simply astyping bool columns (we already cast certain columns for computation anyhow), or could pull back and remove bool from numeric aggregations like sum/mean.

@TomAugspurger

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels May 1, 2017
@jreback jreback added this to the Next Major Release milestone May 1, 2017
@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2017

sqrt and var can also make sense for booleans, but we seem to fail for when the column being aggregated has no variance.

In [5]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-375645985fa5> in <module>()
----> 1 pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()

/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py in std(self, ddof, *args, **kwargs)
   1080         # TODO: implement at Cython level?
   1081         nv.validate_groupby_func('std', args, kwargs)
-> 1082         return np.sqrt(self.var(ddof=ddof, **kwargs))
   1083
   1084     @Substitution(name='groupby')

AttributeError: 'bool' object has no attribute 'sqrt'

In [6]: pd.DataFrame({"A": [1, 1, 2], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
Out[6]:
     B    C
A
1  0.0  0.0
2  NaN  NaN


In [7]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, False], "C": [1, 1, 1]}).groupby("A").std()
Out[7]:
         B    C
A
1  0.57735  0.0

@TomAugspurger
Copy link
Contributor

Really, the underlying issue is probably unrelated to groupby.

In [45]: pd.DataFrame({"A": [1, 1, 1, 1], "B": [True, True, True, True], "C": [1, 1, 1, 2]}).groupby("A").var()
Out[45]:
       B     C
A
1  False  0.25

Should the B column there be 0, not False? That'd be consistent with numpy

In [46]: np.var([1, 1, 1, 1])
Out[46]: 0.0

@TomAugspurger
Copy link
Contributor

Whoops, still had a groupby in there. My bad, so it is related to groupby. We do handle the regular case correctly. Still, that's the issue is that var on a column with 0 variance returns False instead of 0.

@simonjayhawkins
Copy link
Member

master is giving the expected output

>>> pd.__version__
'1.2.0.dev0+67.gaefae55e1'
>>>
>>> dicts = [
...     {"filter_col": False, "groupby_col": True, "bool_col": True, "float_col": 10.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 20.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 30.5},
... ]
>>> df = pd.DataFrame(dicts)
>>>
>>> df_filter = df[df["filter_col"] == True]
>>> dfgb = df_filter.groupby("groupby_col")
>>> dfgb.std()
             filter_col  bool_col  float_col
groupby_col
True                0.0       0.0   7.071068
>>>

could use a test.

@simonjayhawkins simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 9, 2020
@mroeschke mroeschke mentioned this issue May 21, 2021
10 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants