Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: MultiIndex level not sorted (as desired) after making it a CategoricalIndex #47607

Open
2 of 3 tasks
d-s-dc opened this issue Jul 6, 2022 · 7 comments
Open
2 of 3 tasks
Assignees
Labels
Bug Categorical Categorical Data Type MultiIndex

Comments

@d-s-dc
Copy link

d-s-dc commented Jul 6, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

months = ['January','February','March','April','May','June','July','August','September','October','November','December']
df = pd.DataFrame({'col' : np.arange(1,25,1)},\
                  index = pd.MultiIndex.from_product([months, [1,2]], names = ['idx_1', 'idx_2'])).sort_index()
cidx = pd.CategoricalIndex(df.index.get_level_values(0).unique(), months, ordered = True)
df.index = df.index.set_levels(cidx, level = 0)
df = df.sort_index(level = 0)
display(df)

Issue Description

I wanted to sort the months at MultiIndex level0 according to real life scenario. So I used CategoricalIndex and assigned it to the level0 of the MultiIndex. What I expected after sort_index was that the level0 would be sorted using the order defined in the CategoricalIndex.
But even after making them as CategoricalIndex they are not sorted as expected.

Expected Behavior

What I wanted is that months should be sorted according to the given order in CategoricalIndex

The expected behaviour can be produced by this code as follows

display(df.reindex(months, level = 0))

image

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.4.2
numpy : 1.21.5
pytz : 2022.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 61.2.0
Cython : 0.29.30
pytest : 7.1.2
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli :
fastparquet : None
fsspec : 2022.3.0
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.5.1
numba : 0.55.1
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy :
sqlalchemy : 1.4.32
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.20.1
xlrd : 2.0.1
xlwt : None
zstandard : None

@d-s-dc d-s-dc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@phofl phofl added MultiIndex Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jul 11, 2022
@simonjayhawkins
Copy link
Member

can confirm that works as intended for regular Index and data columns.

There is another issue where a level of a MultiIndex appears to be not sorted. Maybe related. see #21136 (comment)

contributions and PRs to fix welcome.

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Jul 22, 2022

take

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Jul 30, 2022

When you use set_levels, pandas directly change the old level to a new one without aligning the original data value. Is this the desired result? @d-s-dc

@d-s-dc
Copy link
Author

d-s-dc commented Aug 5, 2022

Sorry for the late reply.
Yeah it's fine that the old level is directly changed. My problem is when I call sort_index command. Since I set the level using categorical index, I want the order to be as defined in categorical index.

@d-s-dc
Copy link
Author

d-s-dc commented Oct 5, 2022

I've found another thing. Basically, the problem seems when the multi-index is already sorted in lexicographical order. If somehow that ordering is removed then the sorting works according to categorical index.

Currently,

df.sort_index(level=0)

image

which doesn't work and the idx1 is lexicographically sorted.

But if we remove the lexicographical sorting somehow, like below

df.sort_index(level=1)

image

and then sort level 0, the sorting works according to categorical indexing.

df.sort_index(level=1).sort_index(level=0)

image

All thanks to this answer on stackoverflow.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@NA-Dev
Copy link

NA-Dev commented Aug 9, 2023

Just wasted a lot of time dealing with this bug. +1 to users who want this fixed.

@tehunter
Copy link
Contributor

Any progress on this @GYHHAHA? I'm encountering this bug too and it's was very hard to pin down the issue. Fortunately this workaround worked for me as well:

But if we remove the lexicographical sorting somehow, like below

df.sort_index(level=1)

image

and then sort level 0, the sorting works according to categorical indexing.

df.sort_index(level=1).sort_index(level=0)

image

All thanks to this answer on stackoverflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type MultiIndex
Projects
None yet
Development

No branches or pull requests

7 participants