New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data is mismatched with labels after stack with MultiIndex columns #20945

Closed
pauleikis opened this Issue May 3, 2018 · 4 comments

Comments

Projects
None yet
5 participants
@pauleikis

pauleikis commented May 3, 2018

df = pd.DataFrame([
    ['DIM', 'A', 1, 2, 3, 4],
    ['DIM', 'B', 11, 22, 33, 44],
])
df.columns = ["dim1", "dim2", 'c', 'b', 'a', 'd']
df = df.set_index(["dim1", "dim2"])
df.columns.name = 'metrics'

print(df)
metrics     c   b   a   d
dim1 dim2                
DIM  A      1   2   3   4
     B     11  22  33  44

print(df.unstack('dim2').stack('metrics'))

dim2          A   B
dim1 metrics       
DIM  a        1  11
     b        2  22
     c        3  33
     d        4  44

Problem description

After stacking data by one level of MultiIndex columns, the output is unexpected. 'metrics' labels are sorted alphabetically, but data is show as is - without being matched with labels.

Expected Output

Either option shown below would be fine and expected:

dim2          A   B
dim1 metrics       
DIM  a        3  33
     b        2  22
     c        1  11
     d        4  44

dim2          A   B
dim1 metrics       
DIM  c        1  11
     b        2  22
     a        3  33
     d        4  44

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.3
pip: 9.0.1
setuptools: 38.2.3
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: 0.7.11.None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@pauleikis

This comment has been minimized.

pauleikis commented May 3, 2018

I just wanted to add that there is a workaround for this issue, but it looks rather ugly:

df = df.stack('metrics').to_frame().unstack('dim2')
df.columns = df.columns.droplevel()
print(df)

dim2          A   B
dim1 metrics       
DIM  c        1  11
     b        2  22
     a        3  33
     d        4  44

As you can see it produces one of the expected outputs mentioned above.

@WillAyd

This comment has been minimized.

Member

WillAyd commented May 4, 2018

Hmm interesting. Strangely enough if you change the order in which you apply the methods it gets you to the right answer

In [7]: df.stack('metrics').unstack('dim2')
Out[7]: 
dim2          A   B
dim1 metrics       
DIM  c        1  11
     b        2  22
     a        3  33
     d        4  44
@gfyoung

This comment has been minimized.

Member

gfyoung commented May 8, 2018

This is indeed a little weird! PR to patch is welcome!

@WillAyd

This comment has been minimized.

Member

WillAyd commented May 8, 2018

I believe the problem stems from the block below:

if not this.columns.is_lexsorted():

This appears related to #8809 and essentially is causing a disconnect as it sorts the data but the subsequent constructor does not use the sorted labels for the level that gets pushed to the row index.

Not terribly familiar with this code so investigating further but lmk if anyone has insights there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment