-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Code Sample, a copy-pastable example if possible
>>> df = pd.DataFrame(
{'a': list(range(10))},
index=pd.MultiIndex.from_arrays(
[[0,1,0,1,0,1,0,1,0,1], [0,1,2,0,1,2,0,1,2,0]],
names=['l1', 'l2'])
)
>>> df
a
l1 l2
0 0 0
1 1 1
0 2 2
1 0 3
0 1 4
1 2 5
0 0 6
1 1 7
0 2 8
1 0 9
# explicitly groupby on level names or indices
>>> df.groupby(['l1', 'l2']).sum() # or df.groupby(level=list(range(df.index.nlevels))).sum()
a
l1 l2
0 0 6
1 4
2 10
1 0 12
1 8
2 5
# groupby on the multi index itself
# instead of a MultiIndex DataFrame,
# returns a single-level-indexed DataFrame with tuples in the index
>>> df.groupby(df.index).sum()
a
(0, 0) 6
(0, 1) 4
(0, 2) 10
(1, 0) 12
(1, 1) 8
(1, 2) 5
Problem description
When you group a DataFrame, whose index is a MultiIndex, on its index, resulting aggregations will be a DataFrame with a single-level index containing the tuples from the original MultiIndex. This is inferior to the behavior you obtain when passing the level names to df.groupby
, which returns a DataFrame with the same MultiIndex levels and names.
Expected Output
When df
has a MultiIndex, df.groupby(df.index)
should be be identical to df.groupby(level=list(range(df.index.nlevels)))
(or df.groupby(df.index.names)
in the event that all of df
's index levels are named).
Output of pd.show_versions()
pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None