New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.sort_index by Level Name Incorrect After Unstack/Swaplevel #20994

WillAyd opened this Issue May 9, 2018 · 0 comments


None yet
3 participants

WillAyd commented May 9, 2018

This is a very obscure issue but I think it is responsible for things getting out of whack in #20945

In [1]: mi = pd.MultiIndex.from_product([[0], ['d', 'c']], names=['bar', 'baz'])
In [2]: df = pd.DataFrame([[0, 2], [1, 3]], index=mi, columns=['B', 'A'])
In [3]: = 'foo'

In [4]: df
foo      B  A
bar baz      
0   d    0  2
    c    1  3

In [5]: df.unstack().swaplevel(axis=1)
baz  c  d  c  d
foo  B  B  A  A
0    1  0  3  2

In [6]: df.unstack().swaplevel(axis=1).sort_index(axis=1, level=0)
baz  c     d   
foo  A  B  A  B  # Here subsequent levels get sorted
0    3  1  2  0

In [7]: df.unstack().swaplevel(axis=1).sort_index(axis=1, level='baz')
baz  c     d   
foo  B  A  B  A  # Here subsequent levels aren't getting sorted
0    1  3  0  2

If the DataFrame in step 5 above was constructed directly, the sorting would be the same regardless of whether or not you used the level index or label:

In [1]: mi = pd.MultiIndex.from_tuples([('c', 'B'), ('d', 'B'), ('c', 'A'), ('d', 'A')], names=['baz', 'foo'])
In [2]: df = pd.DataFrame([[1, 0, 3, 2]], columns=mi, index=pd.Index([0], name='bar'))
In [3]: df  # Same as step 5 in above example
baz  c  d  c  d
foo  B  B  A  A
0    1  0  3  2

In [4]: df.sort_index(axis=1, level=0)
baz  c     d   
foo  A  B  A  B
0    3  1  2  0

In [5]: df.sort_index(axis=1, level='baz')
baz  c     d   
foo  A  B  A  B  # Sort is the same as item above, regardless of using label or not
0    3  1  2  0

Note that this only happened when doing the unstack and swaplevel together. My original thought was that the latter would be solely responsible, but I could not reproduce the issue using just that alone, so I'm assuming the former is mutating some kind of state of the MultiIndex?


commit: eff1faf
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None

pandas: 0.23.0rc2+27.geff1faf27
pytest: 3.4.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.0
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd referenced this issue May 15, 2018


Fix Inconsistent MultiIndex Sorting #21043

2 of 3 tasks complete

@jreback jreback added this to the 0.24.0 milestone May 17, 2018

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.24.0, 0.23.1 Jun 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment