Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: STD modifies groupby target column when as_index=False #10355

Closed
jxrossel opened this issue Jun 15, 2015 · 11 comments · Fixed by #33630
Closed

BUG: STD modifies groupby target column when as_index=False #10355

jxrossel opened this issue Jun 15, 2015 · 11 comments · Fixed by #33630

Comments

@jxrossel
Copy link

jxrossel commented Jun 15, 2015

xref #14547 for other tests

In pandas 0.16.2 (and already in 0.16.0), using std() for aggregation after a groupby( 'my_column', as_index=False) modifies 'my_column' by taking its sqrt(). Example:

df = pandas.DataFrame({
               'a' : [1,1,1,2,2,2,3,3,3],
               'b' : [1,2,3,4,5,6,7,8,9],
})
df.groupby('a',as_index=False).std()
Out[5]: 
          a  b
0  1.000000  1
1  1.414214  1
2  1.732051  1

The square root values of 'a' are returned instead of 1, 2, 3.

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr_CH

pandas: 0.16.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.0.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.7.1
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Jun 15, 2015

Something like this would fix it. care to do a pull-requests (and add some tests)?
(should remove the other function definition as well)

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 4abdd11..3fd2436 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1812,6 +1812,10 @@ class BinGrouper(BaseGrouper):
         'min': 'group_min_bin',
         'max': 'group_max_bin',
         'var': 'group_var_bin',
+        'std': {
+            'name' : 'group_var_bin',
+            'f' : lambda func, a: np.sqrt(func(a)),
+            },
         'ohlc': 'group_ohlc',
         'first': {
             'name': 'group_nth_bin',

@jreback jreback added this to the 0.17.0 milestone Jun 15, 2015
@jxrossel
Copy link
Author

Hi,
I'm kind of a newbie here (and in Python in general). What do you mean by pull-request ?

@jreback
Copy link
Contributor

jreback commented Jun 15, 2015

see contributing docs here

@jxrossel
Copy link
Author

woaw, I didn't consider becoming a code contributer when mentioning the bug. I don't think I would be the correct person for that. I'd probably do more damage than good.

@jreback
Copy link
Contributor

jreback commented Jun 15, 2015

best way to start! give it a shot.

@jorisvandenbossche
Copy link
Member

Other case were it raises an error (when grouping by non-numerical columns): #16799

@alohia
Copy link

alohia commented Mar 12, 2018

This issue still exists in pandas 0.22. Doing std() after groupby tries to apply std() on the column being grouped by and raises an error if the column is 'str' for example. This happens when using drop_index=True in the groupby() call. How can I contribute to fix this issue?

@jreback
Copy link
Contributor

jreback commented Mar 12, 2018

I put a patch that might work, needs tests, see the contributing docs here:http://pandas-docs.github.io/pandas-docs-travis/contributing.html

@TakaakiFuruse
Copy link

TakaakiFuruse commented Mar 31, 2018

This code returns "ValueError: cannot insert a, already exists" error on pandas 0.22 with python 3.6.4.
(I have tried master and showed the same error also.)

import pandas as pd
df = pd.DataFrame({
               'a' : [1,1,1,2,2,2,3,3,3],
               'b' : [1,2,3,4,5,6,7,8,9],
})
df.groupby('a', as_index=False).agg({'a': 'count'})

Do you think the root cause is the same?

Output of pd.show_versions() is...

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ja_JP.UTF-8
LOCALE: ja_JP.UTF-8

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TakaakiFuruse
Copy link

For #10355 (comment), as I have found another similar behavior, I have created a new issue here #20566.

@mukherjees
Copy link

I can confirm the same problem as reported by @TakaakiFuruse (see two posts above), with pandas 0.22 and python 3.6.2. With the same example dataframe as he has, we see that the describe() command applied to the groupby object shows the correct results in the std columns for both a and b:
df.groupby('a', as_index=False).describe()

	a	a	a	a	a	a	a	a	b	b	b	b	b	b	b	b
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
0	3.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0	3.0	2.0	1.0	1.0	1.5	2.0	2.5	3.0
1	3.0	2.0	0.0	2.0	2.0	2.0	2.0	2.0	3.0	5.0	1.0	4.0	4.5	5.0	5.5	6.0
2	3.0	3.0	0.0	3.0	3.0	3.0	3.0	3.0	3.0	8.0	1.0	7.0	7.5	8.0	8.5	9.0

However, applying std() directly to the groupby object gives the wrong result for a:
df.groupby('a', as_index=False).std()

	a	b
0	1.0	1.0
1	1.4142135623730951	1.0
2	1.7320508075688772	1.0

Clearly, std() is not the same as .apply(np.std, ddof=1) [even though I thought that they were syntactically equivalent] because the latter again gives the right answer for both a and b:
df.groupby('a', as_index=False).apply(np.std, ddof=1)

	a	b
0	0.0	1.0
1	0.0	1.0
2	0.0	1.0

@xieyuheng xieyuheng mentioned this issue Feb 14, 2019
alexcwatt added a commit to alexcwatt/pandas that referenced this issue Apr 27, 2019
alexcwatt added a commit to alexcwatt/pandas that referenced this issue Apr 27, 2019
alexcwatt added a commit to alexcwatt/pandas that referenced this issue May 7, 2019
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 May 7, 2019
@jreback jreback modified the milestones: 0.25.0, Contributions Welcome Jul 3, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.1 May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment