Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby behaves differently when using levels and list of column keys #9344

Closed
dmsul opened this issue Jan 23, 2015 · 2 comments · Fixed by #9177
Closed

Groupby behaves differently when using levels and list of column keys #9344

dmsul opened this issue Jan 23, 2015 · 2 comments · Fixed by #9177

Comments

@dmsul
Copy link

dmsul commented Jan 23, 2015

When grouping by several levels of a MultiIndex, groupby evaltuates all possible combinations of the groupby keys. When grouping by column name, it only evaluates what exist in the DataFrame. Also, this behavior does not exist in 0.14.1, but does in all final releases from 0.15.0 on.

This may be a new feature, not a bug, but I couldn't find anything in the docs, open or closed issues, etc. (closest was Issue #8138). If this is the intended behavior, it would be nice to have in the docs.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(-1, 3))
df.index = pd.MultiIndex.from_tuples([(1, 1), (1, 2), (3, 4), (5, 6)])
idx_names = ['x', 'y']
df.index.names = idx_names

# Adds nan's for (x, y) combinations that aren't in the data
by_levels = df.groupby(level=idx_names).mean()

# This does not add missing combinations of the groupby keys
by_columns = df.reset_index().groupby(idx_names).mean()

print by_levels
print by_columns

# This passes in 0.14.1, but not >=0.15.0 final
assert by_levels.equals(by_columns)

INSTALLED VERSIONS

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: 1.3.4
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.7.0.dev-161a0f8
IPython: 2.3.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

@dmsul
Copy link
Author

dmsul commented Jan 24, 2015

The change in behavior happened in ea0a13c, probably the new _reindex_output function. I understand this is the desired behavior for categoricals, but is it also desired for any groupby on more than one level of a MultiIndex?

@jreback
Copy link
Contributor

jreback commented Feb 4, 2015

this is related to #9177

I think that was an unintened change in that by default a multi-indexed groupby should not reindex to the cartesian product of the levels (e.g. what a categorical does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants