Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

Closed
masasin opened this issue Jun 6, 2018 · 7 comments · Fixed by #58630
Closed
Assignees
Labels
Bug good first issue Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@masasin
Copy link

masasin commented Jun 6, 2018

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 11, (10, 3)), columns=['num1', 'num2', 'num3'])
df['category_tuple'] = [(0, 1), (0, 1), (0, 1), (2, 3), (2, 3), (2, 3), (2, 3), (0, 4), (5,), (6,)]
df['category_string'] = list('aaabbbbcde')
df = df[['category_tuple', 'category_string', 'num1', 'num2', 'num3']]

# df.groupby('category_string').apply(lambda x: x.sort_values(by='num1'))
df.groupby('category_tuple').apply(lambda x: x.sort_values(by='num1'))

Problem description

I get this error: ValueError: Names should be list-like for a MultiIndex

This behaviour changed somewhere between 0.19.0 and 0.19.1. This can be sidestepped by casting the tuples to a string before doing the groupby, but the behaviour is definitely unexpected.

category_string works, but category_tuple does not. Other operations also break in the same way (e.g., describe()). I'd expect them both to work in the same way, and not create a multiindex unless I request it.

Expected Output

                  category_tuple category_string  num1  num2  num3
category_tuple                                                   
(0, 1)          0         (0, 1)               a     0     0    10
                2         (0, 1)               a     0     7     1
                1         (0, 1)               a     6     7     2
(2, 3)          4         (2, 3)               b     1     3     0
                6         (2, 3)               b     3     6     8
                3         (2, 3)               b     6    10     3
                5         (2, 3)               b     6     0     1
(0, 4)          7         (0, 4)               c     7     7     5
(5,)            8           (5,)               d     4     5     7
(6,)            9           (6,)               e     9     1     5

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 10.0.0
setuptools: 28.8.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Jun 6, 2018

IINM, I think we have been moving about from tuples to MultiIndex (as a general thing when rendering DataFrame). That being said, your point is well-taken, as tuples can be perfectly fine items in a DataFrame. PR to patch is welcome!

@gfyoung gfyoung added the Bug label Jun 6, 2018
@masasin
Copy link
Author

masasin commented Jun 6, 2018

How are tuples and multiindex normally differentiated? Maybe make it explicitly multiindex?

@gfyoung gfyoung removed the Bug label Jun 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 6, 2018

How are tuples and multiindex normally differentiated?

What I was referring to was the fact that we generally convert tuples in an index to MultiIndex when we read and parse data. That being said, that might be intentional on second thought. Let's double check.

cc @jreback

@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@thisisreallife
Copy link

Did this bug fixed? I came across the same situation

@mroeschke mroeschke added the Bug label Jun 20, 2021
@rhshadrach
Copy link
Member

This appears to be fixed on main. Test needs to be added.

@rhshadrach rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Apr 16, 2024
@aborrego24 aborrego24 removed their assignment Apr 18, 2024
@jasonmokk
Copy link
Contributor

take

@jasonmokk jasonmokk removed their assignment Apr 21, 2024
@jasonmokk
Copy link
Contributor

take

jasonmokk pushed a commit to jasonmokk/pandas that referenced this issue May 8, 2024
mroeschke pushed a commit that referenced this issue May 8, 2024
…ndex (#58630)

* Implement test for GH #21340

* minor fixup

* Lint contribution

* Make spacing consistent

* Lint

* Remove duplicate column construction

* Avoid DeprecationWarning by setting include_groups=False in apply

---------

Co-authored-by: Jason Mok <jasonmok@Jasons-MacBook-Air-4.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants