Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

masasin · 2018-06-06T15:25:27Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 11, (10, 3)), columns=['num1', 'num2', 'num3'])
df['category_tuple'] = [(0, 1), (0, 1), (0, 1), (2, 3), (2, 3), (2, 3), (2, 3), (0, 4), (5,), (6,)]
df['category_string'] = list('aaabbbbcde')
df = df[['category_tuple', 'category_string', 'num1', 'num2', 'num3']]

# df.groupby('category_string').apply(lambda x: x.sort_values(by='num1'))
df.groupby('category_tuple').apply(lambda x: x.sort_values(by='num1'))

Problem description

I get this error: ValueError: Names should be list-like for a MultiIndex

This behaviour changed somewhere between 0.19.0 and 0.19.1. This can be sidestepped by casting the tuples to a string before doing the groupby, but the behaviour is definitely unexpected.

category_string works, but category_tuple does not. Other operations also break in the same way (e.g., describe()). I'd expect them both to work in the same way, and not create a multiindex unless I request it.

Expected Output

                  category_tuple category_string  num1  num2  num3
category_tuple                                                   
(0, 1)          0         (0, 1)               a     0     0    10
                2         (0, 1)               a     0     7     1
                1         (0, 1)               a     6     7     2
(2, 3)          4         (2, 3)               b     1     3     0
                6         (2, 3)               b     3     6     8
                3         (2, 3)               b     6    10     3
                5         (2, 3)               b     6     0     1
(0, 4)          7         (0, 4)               c     7     7     5
(5,)            8           (5,)               d     4     5     7
(6,)            9           (6,)               e     9     1     5

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 10.0.0
setuptools: 28.8.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-06T16:30:47Z

IINM, I think we have been moving about from tuples to MultiIndex (as a general thing when rendering DataFrame). That being said, your point is well-taken, as tuples can be perfectly fine items in a DataFrame. PR to patch is welcome!

masasin · 2018-06-06T18:51:55Z

How are tuples and multiindex normally differentiated? Maybe make it explicitly multiindex?

gfyoung · 2018-06-06T19:57:20Z

How are tuples and multiindex normally differentiated?

What I was referring to was the fact that we generally convert tuples in an index to MultiIndex when we read and parse data. That being said, that might be intentional on second thought. Let's double check.

cc @jreback

thisisreallife · 2021-03-10T06:32:52Z

Did this bug fixed? I came across the same situation

rhshadrach · 2024-04-16T21:09:44Z

This appears to be fixed on main. Test needs to be added.

jasonmokk · 2024-04-21T17:12:20Z

take

jasonmokk · 2024-05-06T21:30:26Z

take

…ndex (#58630) * Implement test for GH #21340 * minor fixup * Lint contribution * Make spacing consistent * Lint * Remove duplicate column construction * Avoid DeprecationWarning by setting include_groups=False in apply --------- Co-authored-by: Jason Mok <jasonmok@Jasons-MacBook-Air-4.local>

gfyoung added Groupby MultiIndex labels Jun 6, 2018

gfyoung added the Bug label Jun 6, 2018

gfyoung removed the Bug label Jun 6, 2018

cheelee mentioned this issue Dec 16, 2019

Python wcon pip incompatibility with latest Python 3 pandas openworm/tracker-commons#171

Open

jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020

mroeschke added the Bug label Jun 20, 2021

rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Apr 16, 2024

github-actions bot assigned aborrego24 Apr 17, 2024

aborrego24 removed their assignment Apr 18, 2024

github-actions bot assigned jasonmokk Apr 21, 2024

jasonmokk removed their assignment Apr 21, 2024

github-actions bot assigned jasonmokk May 6, 2024

jasonmokk pushed a commit to jasonmokk/pandas that referenced this issue May 8, 2024

Implement test for GH pandas-dev#21340

88f0108

jasonmokk mentioned this issue May 8, 2024

TST: Add test for regression in groupby with tuples yielding MultiIndex #58630

Merged

3 tasks

mroeschke closed this as completed in #58630 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

masasin commented Jun 6, 2018 •

edited

INSTALLED VERSIONS

gfyoung commented Jun 6, 2018

masasin commented Jun 6, 2018 •

edited

gfyoung commented Jun 6, 2018

thisisreallife commented Mar 10, 2021

rhshadrach commented Apr 16, 2024

jasonmokk commented Apr 21, 2024

jasonmokk commented May 6, 2024

Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

Groupby on a column with tuples creates a multiindex, yielding unexpected behaviour. #21340

Comments

masasin commented Jun 6, 2018 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Jun 6, 2018

masasin commented Jun 6, 2018 • edited

gfyoung commented Jun 6, 2018

thisisreallife commented Mar 10, 2021

rhshadrach commented Apr 16, 2024

jasonmokk commented Apr 21, 2024

jasonmokk commented May 6, 2024

masasin commented Jun 6, 2018 •

edited

Output of `pd.show_versions()`

masasin commented Jun 6, 2018 •

edited