Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.stack() behaving differently between 0.19.2 and 0.20.1 #16323

Closed
dpsugasa opened this issue May 10, 2017 · 12 comments · Fixed by #16325
Closed

df.stack() behaving differently between 0.19.2 and 0.20.1 #16323

dpsugasa opened this issue May 10, 2017 · 12 comments · Fixed by #16325
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@dpsugasa
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np


index = pd.DatetimeIndex(['2017-05-04', '2017-05-05', '2017-05-08', '2017-05-09',
               '2017-05-10'],
              dtype='datetime64[ns]',name = 'date', freq='B')
columns = pd.MultiIndex(levels=[['HSBA LN Equity', 'UCG IM Equity', 'ISP IM Equity'], ['LAST PRICE', 'HIGH', 'LOW']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
data = np.array([[ 663.8, 672.5, 661.1, 15.97, 16.02, 15.49, 2.76, 2.768, 2.694],
              [ 658.6, 663.9, 656.0, 16.22, 16.48, 15.77, 2.842, 2.868, 2.77 ],
              [ 660.6, 664.1, 658.9, 16.01, 16.49, 15.94, 2.852, 2.898, 2.826],
              [ 664.9, 669.2, 662.5, 15.90, 16.41, 15.90, 2.848, 2.898, 2.842],
              [ 670.9, 673.4, 663.8, 16.09, 16.15, 15.59, 2.85,  2.888, 2.802]])
df = pd.DataFrame(data, columns=columns, index = index)

Problem description

Since switching to 0.20.1, when using df.stack(0), the output looks like this:

                              HIGH  LAST PRICE      LOW
date                                                   
2017-05-04 HSBA LN Equity  672.500     663.800  661.100
           UCG IM Equity     2.768       2.760    2.694
           ISP IM Equity    16.020      15.970   15.490
2017-05-05 HSBA LN Equity  663.900     658.600  656.000
           UCG IM Equity     2.868       2.842    2.770
           ISP IM Equity    16.480      16.220   15.770
2017-05-08 HSBA LN Equity  664.100     660.600  658.900
           UCG IM Equity     2.898       2.852    2.826
           ISP IM Equity    16.490      16.010   15.940
2017-05-09 HSBA LN Equity  669.200     664.900  662.500
           UCG IM Equity     2.898       2.848    2.842
           ISP IM Equity    16.410      15.900   15.900
2017-05-10 HSBA LN Equity  673.400     670.900  663.800
           UCG IM Equity     2.888       2.850    2.802
           ISP IM Equity    16.150      16.090   15.590

The columns change order and the tickers no longer correspond to the correct prices.

` Expected Output

0.19.2 maintains the correct hierarchy.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 35.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.1
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

@dsm054
Copy link
Contributor

dsm054 commented May 10, 2017

Simpler case:

columns = pd.MultiIndex(levels=[['B', 'A'],
                                ['C', 'D']],
                        labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
df = pd.DataFrame(columns=columns, data=[range(4)])
assert df.loc[0, ("A","C")] == df.stack(0).loc[(0, "A"), "C"]

passes under 0.19.2 for me and fails under 0.20.1.

@sinhrks sinhrks added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug Regression Functionality that used to work in a prior pandas version labels May 11, 2017
@dsm054
Copy link
Contributor

dsm054 commented May 11, 2017

In 0.19.2, with the above dataframe, we have

In [8]: df.sortlevel(level=0, axis=1)
Out[8]: 
   B     A   
   C  D  C  D
0  0  1  2  3

In 0.20.1, we have

In [378]: df.sortlevel(level=0, axis=1)
/home/dsm/sys/miniconda3/envs/py36/bin/ipython:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
  #!/home/dsm/sys/miniconda3/envs/py36/bin/python
Out[378]: 
   A     B   
   C  D  C  D
0  2  3  0  1

or, to use the new syntax,

In [380]: df.sort_index(level=0, axis=1)
Out[380]: 
   A     B   
   C  D  C  D
0  2  3  0  1

That is, in 0.20.1, the sort actually affects the frame, and frame.columns and this.columns differ in _stack_multi_columns. As a result, I think this line:

    new_levels.append(frame.columns.levels[level_num])

is no longer justified, because the level order can have changed. I think that just using new_levels.append(level_vals) (taken from this after the sort) should suffice to repair it. Anyone else want to call dibs, or should I take a run at it?

@sinhrks
Copy link
Member

sinhrks commented May 11, 2017

Thanks for the report. It looks like a regression. I think @dsm054 's fix looks good.

dsm054 added a commit to dsm054/pandas that referenced this issue May 11, 2017
dsm054 added a commit to dsm054/pandas that referenced this issue May 11, 2017
@jreback jreback added this to the 0.20.2 milestone May 11, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
TomAugspurger pushed a commit that referenced this issue May 30, 2017
@ilmioalias
Copy link

hi, i think there is still some problem

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

# We create a MultiIndex
PAE = ['ITA', 'FRA']
VAR = ['A1', 'A2']
TYP = ['CRT', 'DBT', 'NET']
MI = pd.MultiIndex.from_product([PAE, VAR, TYP], names=['PAE', 'VAR', 'TYP'])

# We create a dataframe with multindex MI
V = [20, 10, 10, 40, 10, 30, 120, 110, 10, 140, 110, 30]
DF = pd.DataFrame(data=V, index=MI, columns=['VALUE'])

# We unstack the dataframe and drop level 0
DF = DF.unstack(['VAR', 'TYP'])
DF.columns = DF.columns.droplevel(0)
DF[('A0', 'NET')] = 9999

# We stack the dataframe
DF0 = DF.stack(['VAR', 'TYP'])
# DF0 is wrong
DF1 = DF.sort_index(axis=1).stack(['VAR', 'TYP'])
# DF1 is right 

Problem description

data in DF0 doesn't correspond to original data before unstack and droplevel.

Expected Output

PAE  VAR  TYP
FRA  A0   NET    9999.0
     A1   CRT     120.0
          DBT     110.0
          NET      10.0
     A2   CRT     140.0
          DBT     110.0
          NET      30.0
ITA  A0   NET    9999.0
     A1   CRT      20.0
          DBT      10.0
          NET      10.0
     A2   CRT      40.0
          DBT      10.0
          NET      30.0
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-61-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.18.1
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@ilmioalias : Can you try installing 0.20.2 and report back if the output matches for you now?

@dsm054
Copy link
Contributor

dsm054 commented Jul 14, 2017

I can confirm that it looks like @ilmioalias has found a different failure mode not solved by the first fix, and obviously not caught by the original tests, which I thought were pretty deep. :-/

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@dsm054 : Are you using 0.20.3 ? If so, could you also try 0.20.2 and see what you get?

@dsm054
Copy link
Contributor

dsm054 commented Jul 14, 2017

Same under both, more's the pity.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Okay, could you file a separate issue then and cross-reference this one? Afterwards, feel free to patch this since you were the one who spearheaded the initial fix.

@dsm054
Copy link
Contributor

dsm054 commented Jul 14, 2017

Sure, I'll take it. I'm annoyed that something managed to sneak through. 😒

Thanks for the report, @ilmioalias!!

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

I'm annoyed that something managed to sneak through.

Well, if we could write tests that ALWAYS covered EVERY use-case, then we wouldn't need to take contributions from anyone because our tests would be perfect. 😉

@ilmioalias
Copy link

Hi,
i tried pandas 0.20.2, it doesn't work. I obtained same 0.20.3 result.

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants