Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.index.map with different size fails for Pandas > 0.22 #24800

Open
RutgerK opened this issue Jan 16, 2019 · 6 comments
Open

df.index.map with different size fails for Pandas > 0.22 #24800

RutgerK opened this issue Jan 16, 2019 · 6 comments
Labels
Bug MultiIndex Regression Functionality that used to work in a prior pandas version

Comments

@RutgerK
Copy link

RutgerK commented Jan 16, 2019

Code Sample

import pandas as pd

df = pd.DataFrame({'a': [0,1,2,3],
                   'b': ['a_1_bar', 'a_2_bar', 'b_1_bar', 'b_2_bar'],
                   'c': list('defg')})

df = df.set_index(['b','c'])
df.index.map(lambda x: tuple(x[0].split('_')))

Problem description

The code above works in Pandas 0.22 and lower, but fails since 0.23. This seems to be due to the fact that Pandas wants to preserve the names of the levels in the old index.

When the amount of levels in the new index is different compared to the old one, this fails with a ValueError because of this mismatch.

ValueError: Length of names must match number of levels in MultiIndex.

Expected Output

Older version of pandas returned a new MultiIndex, without names for the levels.

MultiIndex(levels=[['a', 'b'], ['1', '2'], ['bar']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1], [0, 0, 0, 0]])

I'm not sure whether this change was deliberate. If not, a workaround might be to only preserve the names if the new amount of levels matches the old one. And otherwise disregard the names, resulting in similar behavior as in the olders Pandas version.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: 0.11.2
IPython: 7.1.1
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Jan 16, 2019

Hmm I think this should work but cc @toobaz for thoughts

@toobaz
Copy link
Member

toobaz commented Jan 17, 2019

I'm pretty sure not only that the OP code should work, but also that even when the number of levels coincide, as in (notice the different column "b"):

In [2]: df = pd.DataFrame({'a': [0,1,2,3],
   ...:                    'b': ['a_1', 'a_2', 'b_1', 'b_2'],
   ...:                    'c': list('defg')})
   ...:                    

In [3]: df = df.set_index(['b', 'c'])

In [4]: df.index.map(lambda x : tuple(x[0].split('_')))
Out[4]: 
MultiIndex(levels=[['a', 'b'], ['1', '2']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['b', 'c'])

... it is a mistake to reuse the names, because the output of the lambda does not (in general) have anything to do with the input.

So the question becomes "is there any case in which it makes sense to reuse the names of a MultiIndex in a call to map?" I think the answer is "no", and if I am right, we just need to suppress this behavior here:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py#L4441

@RutgerK RutgerK changed the title df.index.map with difference size fails for Pandas > 0.22 df.index.map with different size fails for Pandas > 0.22 Jan 17, 2019
@jorisvandenbossche
Copy link
Member

So the question becomes "is there any case in which it makes sense to reuse the names of a MultiIndex in a call to map?" I think the answer is "no",

Note that this is more general to MultiIndex.map. We also preserve the name for Index.map, Series.map, Series.apply, ..
So at least from a consistency point of view, trying to preserve the names for MultiIndex.map as well might make sense.

@toobaz
Copy link
Member

toobaz commented Jan 18, 2019

Note that this is more general to MultiIndex.map. We also preserve the name for Index.map, Series.map, Series.apply, ..

Sorry, my comment was indeed a bit vague, but I was thinking to levels names (of the original MultiIndex), not just the name attribute. The best analogy I can come with is

In [2]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])

In [3]: df.apply(lambda x : pd.Series([x[0], -x[0]]), axis=1)
Out[3]: 
   0  1
0  1 -1
1  3 -3

which does not preserve column names.

@gsaurabhr
Copy link

Still getting this error when the number of levels in original and returned multiindex is different. Any solutions?

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Nov 3, 2020
@jorisvandenbossche
Copy link
Member

Contributions to fix this are certainly welcome!

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 8, 2022
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.5 Jun 8, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants