Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that MultiIndex.set_levels() interprets passed values as new components of the .levels attribute #28294

Closed
pepicello opened this issue Sep 5, 2019 · 8 comments · Fixed by #29143

Comments

@pepicello
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(3, 3), columns=pd.MultiIndex.from_tuples([(1, 2), (1, 4), (5, 6)]))
df.columns = df.columns.set_levels([1, 1, 3], level=0)

Which raises a ValueError:

ValueError: Level values must be unique: [1, 1, 3] on level 0

Problem description

Despite a dataframe with non-unique MultiIndex can be created, they cannot be set using set_levels(). Is this behaviour expected? I believe non-unique level values were not allowed for a period of time (#18882), but then they were allowed again (#21423), so I am not sure which is the current convention.

Expected Output

          1                   3
          2         4         6
0  0.317669  0.329142  0.056725
1  0.969472  0.340309  0.135204
2  0.242408  0.934748  0.683186

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.1.8

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

@toobaz

@toobaz
Copy link
Member

toobaz commented Sep 5, 2019

I believe non-unique level values were not allowed for a period of time (#18882), but then they were allowed again (#21423), so I am not sure which is the current convention.

Those issues were about non-unique level names, yours about non-unique level values. And in this second case the story is simple: yes, they are and always were allowed.

But I think you've misunderstood how set_levels works (possibly because the docs could and should be improved): you're not assumed to provide a new value for each row, but rather for each code describing values in the level. In other words, you are changing the levels attribute. And there having duplicates does not make a lot of sense. This said, you can do it if you really want, by passing verify_integrity=False.

In short, this function is probably not what you're looking for. And unfortunately, there are no obvious alternative to do what you want to do.
The simplest way is to recreate the index.
The most efficient way might be to use set_levels in combination with set_codes (you can obtain levels and codes from a simple MultiIndex with 1 level).

@toobaz toobaz closed this as completed Sep 5, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Should we at least keep this open to improve the docs then?

@toobaz
Copy link
Member

toobaz commented Sep 5, 2019

Should we at least keep this open to improve the docs then?

Makes sense!

@toobaz toobaz reopened this Sep 5, 2019
@toobaz toobaz added the Docs label Sep 5, 2019
@toobaz toobaz changed the title MultiIndex.set_levels() requires unique level values Clarify that MultiIndex.set_levels() interpret passed values as new components of the .levels attribute Sep 5, 2019
@toobaz toobaz changed the title Clarify that MultiIndex.set_levels() interpret passed values as new components of the .levels attribute Clarify that MultiIndex.set_levels() interprets passed values as new components of the .levels attribute Sep 5, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 5, 2019
@hweecat
Copy link
Contributor

hweecat commented Sep 20, 2019

I could give a try at improving the docs for MultiIndex.set_levels(), if that's okay.

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

@hweecat sure!

hweecat added a commit to hweecat/pandas that referenced this issue Oct 5, 2019
hweecat added a commit to hweecat/pandas that referenced this issue Oct 5, 2019
rectify for failing tests

DOC: added docs for MultiIndex.set_levels (pandas-dev#28294)
hweecat added a commit to hweecat/pandas that referenced this issue Oct 5, 2019
hweecat added a commit to hweecat/pandas that referenced this issue Oct 6, 2019
hweecat added a commit to hweecat/pandas that referenced this issue Oct 21, 2019
datapythonista pushed a commit that referenced this issue Jan 3, 2020
…s() interprets passed values as new components of the .levels attribute (#28294) (#29143)
@astadmistry
Copy link

I had a similar issue:

import io
import pandas as pd
df = pd.read_csv(io.StringIO('''model,0,0,1,1,2,2
parameters,weights,letter,weights,letter,weights,letter
0,0.4,A,0.7,A,0.25,A
1,0.2,B,0.1,B,0.25,B
2,0.2,C,0.1,C,0.25,C
3,0.2,D,0.1,D,0.25,D

'''), header=[0, 1], index_col=0)

df.columns.get_level_values(0)

output:
Index(['0', '0', '1', '1', '2', '2'], dtype='object', name='model')

In pd.read_csv, you can not specify the dtype of the headers. Here I want the level 0 header to be of type int and not object.
I tried doing:
df.columns = df.columns.set_levels(df.columns.get_level_values(0).astype(int), level=0)
However got this error:
Level values must be unique: [0, 0, 1, 1, 2, 2] on level 0

After reading this issue (#28294), I found that pandas interprets the new indexes as a mapping for unique values in the same order, therefore I tried this:
df.columns = df.columns.set_levels(set(df.columns.get_level_values(0).astype(int)), level=0)
Now when running df.columns.get_level_values(0) I get:
Int64Index([0, 0, 1, 1, 2, 2], dtype='int64', name='model')

@astadmistry
Copy link

Actually a better way is:
df.columns = df.columns.set_levels(df.columns.levels[0].astype(int), level=0)

REF: https://stackoverflow.com/questions/49199530/pandas-multi-index-column-header-change-type-when-reading-a-csv-file/49199682#comment85952840_49199682

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment