Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.groupby(sort=False) sorts multi-index-frames #17537

Closed
MaximilianKoestler opened this issue Sep 15, 2017 · 4 comments · Fixed by #17621
Closed

BUG: df.groupby(sort=False) sorts multi-index-frames #17537

MaximilianKoestler opened this issue Sep 15, 2017 · 4 comments · Fixed by #17621
Milestone

Comments

@MaximilianKoestler
Copy link

MaximilianKoestler commented Sep 15, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame([
        [4, 2, 'x'],
        [3, 1, 'y'],
    ],
    columns=['A','B','C']).set_index(['A', 'B'])

print(df)
# Consider this DataFrame:
#
# >       C
# > A B
# > 4 2  x
# > 3 1  y

# Iterating over the group works if both levels
# of the multi-index are used for grouping.
for idx, group in df.groupby(level=[0, 1], sort=False):
    print(idx)
# > (4, 2)
# > (3, 1)

# However, grouping by only one level,
# suddenly sorts the index.
for idx, group in df.groupby(level=0, sort=False):
    print(idx)
# > 3
# > 4

# If the DataFrame has only one index,
# it works correctly
df2 = pd.DataFrame([
        [4, 2, 'x'],
        [3, 1, 'y'],
    ],
    columns=['A','B','C']).set_index(['A'])

print(df2)
# >     B  C
# > A
# > 4  2  x
# > 3  1  y

for idx, group in df2.groupby(level=0, sort=False):
    print(idx)
# > 4
# > 3

Problem description

DataFrame.groupby() has a parameter that selects whether the result should be sorted by groups.

However, if the DataFrame has a multi-index and the grouping is only done by one index, the result is sorted regardless of the value of sort.
Grouping by more than one index works.
Passing the single index as a list [0] does not fix the problem.

Expected Output

for idx, group in df.groupby(level=0, sort=False):
    print(idx)

should yield

4
3

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL:
LANG: en_GB.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.21.0.dev+450.g6eadb87fe
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 15, 2017

hmm this does look buggy. can you have a look?

@jreback jreback added this to the Next Major Release milestone Sep 15, 2017
@MaximilianKoestler
Copy link
Author

I have started debugging the problem but I have not found the error, yet.
I will probably have some time to look into it next week, but if someone takes a look who actually knows DataFrameGroupBy and BaseGrouper and all the other stuff in core/groupby.py it would be much appreciated.

@jreback jreback modified the milestones: Next Major Release, 0.21.0 Oct 1, 2017
@ipazc
Copy link

ipazc commented Oct 29, 2020

This bug seems to still be happening If you groupby() the index name. I know there's a level parameter, but seems that it is still possible to group by an index by just specifying the name. Based on the same example as the original issue:

import pandas as pd

>>> df = pd.DataFrame([
        [4, 2, 'x'],
        [3, 1, 'y'],
    ],
    columns=['A','B','C']).set_index(['A', 'B'])

		C
A	B	
4	2	x
3	1	y

>>> df.groupby("A").size()

A
3    1
4    1
dtype: int64

>>> df.groupby("A", sort=False).size()
A
3    1
4    1
dtype: int64

Even though, if you use the level parameter it works as expected:

>>> df.groupby(level="A", sort=False).size()
A
4    1
3    1
dtype: int64

Pandas 1.1.3

@Bernadette-Mohr
Copy link

... and the bug still persists with python 3.9.13 and pandas 1.4.4.
Maybe a paragraph about multiindexing in the documentation would be enough, since conserving the original order of the groups does work with the level keyword?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants