df.stack() behaving differently between 0.19.2 and 0.20.1 #16323

Closed
dpsugasa opened this Issue May 10, 2017 · 12 comments

Comments

Projects
None yet
6 participants
@dpsugasa

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np


index = pd.DatetimeIndex(['2017-05-04', '2017-05-05', '2017-05-08', '2017-05-09',
               '2017-05-10'],
              dtype='datetime64[ns]',name = 'date', freq='B')
columns = pd.MultiIndex(levels=[['HSBA LN Equity', 'UCG IM Equity', 'ISP IM Equity'], ['LAST PRICE', 'HIGH', 'LOW']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
data = np.array([[ 663.8, 672.5, 661.1, 15.97, 16.02, 15.49, 2.76, 2.768, 2.694],
              [ 658.6, 663.9, 656.0, 16.22, 16.48, 15.77, 2.842, 2.868, 2.77 ],
              [ 660.6, 664.1, 658.9, 16.01, 16.49, 15.94, 2.852, 2.898, 2.826],
              [ 664.9, 669.2, 662.5, 15.90, 16.41, 15.90, 2.848, 2.898, 2.842],
              [ 670.9, 673.4, 663.8, 16.09, 16.15, 15.59, 2.85,  2.888, 2.802]])
df = pd.DataFrame(data, columns=columns, index = index)

Problem description

Since switching to 0.20.1, when using df.stack(0), the output looks like this:

                              HIGH  LAST PRICE      LOW
date                                                   
2017-05-04 HSBA LN Equity  672.500     663.800  661.100
           UCG IM Equity     2.768       2.760    2.694
           ISP IM Equity    16.020      15.970   15.490
2017-05-05 HSBA LN Equity  663.900     658.600  656.000
           UCG IM Equity     2.868       2.842    2.770
           ISP IM Equity    16.480      16.220   15.770
2017-05-08 HSBA LN Equity  664.100     660.600  658.900
           UCG IM Equity     2.898       2.852    2.826
           ISP IM Equity    16.490      16.010   15.940
2017-05-09 HSBA LN Equity  669.200     664.900  662.500
           UCG IM Equity     2.898       2.848    2.842
           ISP IM Equity    16.410      15.900   15.900
2017-05-10 HSBA LN Equity  673.400     670.900  663.800
           UCG IM Equity     2.888       2.850    2.802
           ISP IM Equity    16.150      16.090   15.590

The columns change order and the tickers no longer correspond to the correct prices.

` Expected Output

0.19.2 maintains the correct hierarchy.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 35.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.1
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 May 10, 2017

Contributor

Simpler case:

columns = pd.MultiIndex(levels=[['B', 'A'],
                                ['C', 'D']],
                        labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
df = pd.DataFrame(columns=columns, data=[range(4)])
assert df.loc[0, ("A","C")] == df.stack(0).loc[(0, "A"), "C"]

passes under 0.19.2 for me and fails under 0.20.1.

Contributor

dsm054 commented May 10, 2017

Simpler case:

columns = pd.MultiIndex(levels=[['B', 'A'],
                                ['C', 'D']],
                        labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
df = pd.DataFrame(columns=columns, data=[range(4)])
assert df.loc[0, ("A","C")] == df.stack(0).loc[(0, "A"), "C"]

passes under 0.19.2 for me and fails under 0.20.1.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 May 11, 2017

Contributor

In 0.19.2, with the above dataframe, we have

In [8]: df.sortlevel(level=0, axis=1)
Out[8]: 
   B     A   
   C  D  C  D
0  0  1  2  3

In 0.20.1, we have

In [378]: df.sortlevel(level=0, axis=1)
/home/dsm/sys/miniconda3/envs/py36/bin/ipython:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
  #!/home/dsm/sys/miniconda3/envs/py36/bin/python
Out[378]: 
   A     B   
   C  D  C  D
0  2  3  0  1

or, to use the new syntax,

In [380]: df.sort_index(level=0, axis=1)
Out[380]: 
   A     B   
   C  D  C  D
0  2  3  0  1

That is, in 0.20.1, the sort actually affects the frame, and frame.columns and this.columns differ in _stack_multi_columns. As a result, I think this line:

    new_levels.append(frame.columns.levels[level_num])

is no longer justified, because the level order can have changed. I think that just using new_levels.append(level_vals) (taken from this after the sort) should suffice to repair it. Anyone else want to call dibs, or should I take a run at it?

Contributor

dsm054 commented May 11, 2017

In 0.19.2, with the above dataframe, we have

In [8]: df.sortlevel(level=0, axis=1)
Out[8]: 
   B     A   
   C  D  C  D
0  0  1  2  3

In 0.20.1, we have

In [378]: df.sortlevel(level=0, axis=1)
/home/dsm/sys/miniconda3/envs/py36/bin/ipython:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
  #!/home/dsm/sys/miniconda3/envs/py36/bin/python
Out[378]: 
   A     B   
   C  D  C  D
0  2  3  0  1

or, to use the new syntax,

In [380]: df.sort_index(level=0, axis=1)
Out[380]: 
   A     B   
   C  D  C  D
0  2  3  0  1

That is, in 0.20.1, the sort actually affects the frame, and frame.columns and this.columns differ in _stack_multi_columns. As a result, I think this line:

    new_levels.append(frame.columns.levels[level_num])

is no longer justified, because the level order can have changed. I think that just using new_levels.append(level_vals) (taken from this after the sort) should suffice to repair it. Anyone else want to call dibs, or should I take a run at it?

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks May 11, 2017

Member

Thanks for the report. It looks like a regression. I think @dsm054 's fix looks good.

Member

sinhrks commented May 11, 2017

Thanks for the report. It looks like a regression. I think @dsm054 's fix looks good.

dsm054 added a commit to dsm054/pandas that referenced this issue May 11, 2017

dsm054 added a commit to dsm054/pandas that referenced this issue May 11, 2017

@jreback jreback added this to the 0.20.2 milestone May 11, 2017

@jreback jreback closed this in #16325 May 11, 2017

jreback added a commit that referenced this issue May 11, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

@ilmioalias

This comment has been minimized.

Show comment
Hide comment
@ilmioalias

ilmioalias Jul 14, 2017

hi, i think there is still some problem

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

# We create a MultiIndex
PAE = ['ITA', 'FRA']
VAR = ['A1', 'A2']
TYP = ['CRT', 'DBT', 'NET']
MI = pd.MultiIndex.from_product([PAE, VAR, TYP], names=['PAE', 'VAR', 'TYP'])

# We create a dataframe with multindex MI
V = [20, 10, 10, 40, 10, 30, 120, 110, 10, 140, 110, 30]
DF = pd.DataFrame(data=V, index=MI, columns=['VALUE'])

# We unstack the dataframe and drop level 0
DF = DF.unstack(['VAR', 'TYP'])
DF.columns = DF.columns.droplevel(0)
DF[('A0', 'NET')] = 9999

# We stack the dataframe
DF0 = DF.stack(['VAR', 'TYP'])
# DF0 is wrong
DF1 = DF.sort_index(axis=1).stack(['VAR', 'TYP'])
# DF1 is right 

Problem description

data in DF0 doesn't correspond to original data before unstack and droplevel.

Expected Output

PAE  VAR  TYP
FRA  A0   NET    9999.0
     A1   CRT     120.0
          DBT     110.0
          NET      10.0
     A2   CRT     140.0
          DBT     110.0
          NET      30.0
ITA  A0   NET    9999.0
     A1   CRT      20.0
          DBT      10.0
          NET      10.0
     A2   CRT      40.0
          DBT      10.0
          NET      30.0
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-61-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.18.1
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

hi, i think there is still some problem

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

# We create a MultiIndex
PAE = ['ITA', 'FRA']
VAR = ['A1', 'A2']
TYP = ['CRT', 'DBT', 'NET']
MI = pd.MultiIndex.from_product([PAE, VAR, TYP], names=['PAE', 'VAR', 'TYP'])

# We create a dataframe with multindex MI
V = [20, 10, 10, 40, 10, 30, 120, 110, 10, 140, 110, 30]
DF = pd.DataFrame(data=V, index=MI, columns=['VALUE'])

# We unstack the dataframe and drop level 0
DF = DF.unstack(['VAR', 'TYP'])
DF.columns = DF.columns.droplevel(0)
DF[('A0', 'NET')] = 9999

# We stack the dataframe
DF0 = DF.stack(['VAR', 'TYP'])
# DF0 is wrong
DF1 = DF.sort_index(axis=1).stack(['VAR', 'TYP'])
# DF1 is right 

Problem description

data in DF0 doesn't correspond to original data before unstack and droplevel.

Expected Output

PAE  VAR  TYP
FRA  A0   NET    9999.0
     A1   CRT     120.0
          DBT     110.0
          NET      10.0
     A2   CRT     140.0
          DBT     110.0
          NET      30.0
ITA  A0   NET    9999.0
     A1   CRT      20.0
          DBT      10.0
          NET      10.0
     A2   CRT      40.0
          DBT      10.0
          NET      30.0
dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-61-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.18.1
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 14, 2017

Member

@ilmioalias : Can you try installing 0.20.2 and report back if the output matches for you now?

Member

gfyoung commented Jul 14, 2017

@ilmioalias : Can you try installing 0.20.2 and report back if the output matches for you now?

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jul 14, 2017

Contributor

I can confirm that it looks like @ilmioalias has found a different failure mode not solved by the first fix, and obviously not caught by the original tests, which I thought were pretty deep. :-/

Contributor

dsm054 commented Jul 14, 2017

I can confirm that it looks like @ilmioalias has found a different failure mode not solved by the first fix, and obviously not caught by the original tests, which I thought were pretty deep. :-/

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 14, 2017

Member

@dsm054 : Are you using 0.20.3 ? If so, could you also try 0.20.2 and see what you get?

Member

gfyoung commented Jul 14, 2017

@dsm054 : Are you using 0.20.3 ? If so, could you also try 0.20.2 and see what you get?

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jul 14, 2017

Contributor

Same under both, more's the pity.

Contributor

dsm054 commented Jul 14, 2017

Same under both, more's the pity.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 14, 2017

Member

Okay, could you file a separate issue then and cross-reference this one? Afterwards, feel free to patch this since you were the one who spearheaded the initial fix.

Member

gfyoung commented Jul 14, 2017

Okay, could you file a separate issue then and cross-reference this one? Afterwards, feel free to patch this since you were the one who spearheaded the initial fix.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Jul 14, 2017

Contributor

Sure, I'll take it. I'm annoyed that something managed to sneak through. 😒

Thanks for the report, @ilmioalias!!

Contributor

dsm054 commented Jul 14, 2017

Sure, I'll take it. I'm annoyed that something managed to sneak through. 😒

Thanks for the report, @ilmioalias!!

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 14, 2017

Member

I'm annoyed that something managed to sneak through.

Well, if we could write tests that ALWAYS covered EVERY use-case, then we wouldn't need to take contributions from anyone because our tests would be perfect. 😉

Member

gfyoung commented Jul 14, 2017

I'm annoyed that something managed to sneak through.

Well, if we could write tests that ALWAYS covered EVERY use-case, then we wouldn't need to take contributions from anyone because our tests would be perfect. 😉

@ilmioalias

This comment has been minimized.

Show comment
Hide comment
@ilmioalias

ilmioalias Jul 17, 2017

Hi,
i tried pandas 0.20.2, it doesn't work. I obtained same 0.20.3 result.

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Hi,
i tried pandas 0.20.2, it doesn't work. I obtained same 0.20.3 result.

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment