DataFrame.reset_index deletes index, does not all for ints as level arg #16263

Closed
m4g005 opened this Issue May 6, 2017 · 7 comments

Comments

Projects
None yet
5 participants
@m4g005

m4g005 commented May 6, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
pd.__version__

u'0.20.1'

data = pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']) 
data
A B C D
0 -0.134549 2.352525 0.481132 -1.919506
1 1.980074 0.720437 0.410702 -0.703470
2 -3.063166 -0.781255 0.270469 -0.539081
3 -1.125265 0.308374 -0.166085 -1.253959
data.set_index(['A'], inplace=True)
data
B C D
A
-0.134549 2.352525 0.481132 -1.919506
1.980074 0.720437 0.410702 -0.703470
-3.063166 -0.781255 0.270469 -0.539081
-1.125265 0.308374 -0.166085 -1.253959
data.reset_index(level=['A'], inplace=True)
data
B C D
0 2.352525 0.481132 -1.919506
1 0.720437 0.410702 -0.703470
2 -0.781255 0.270469 -0.539081
3 0.308374 -0.166085 -1.253959

Problem description

between v0.19.2 and v0.20.1, the behavior of DataFrame.reset_index changed.
With a single set index:

  • It does not attempt to keep the column (essentially making drop=True always on)
  • level=int no longer works (iterables work)

Expected Output

v0.19.2 results:

data.reset_index(level=['A'], inplace=True)
data
A B C D
0 0.100442 -0.620740 -2.018020 1.059871
1 -0.530272 0.402598 -1.453445 -0.729623
2 -1.040126 -0.536687 -1.136123 -0.748891
3 -0.269727 0.182250 0.847344 0.785692

Output of pd.show_versions()

import pandas as pd
import numpy as np
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
@toobaz

This comment has been minimized.

Show comment
Hide comment
@toobaz

toobaz May 6, 2017

Member

This is related to passing the level= argument when there is a non-MultiIndex. Before, the argument would be just discarded:

In [3]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level='not present')
Out[3]: 
   index         A         B         C         D
0      0  0.057457  0.065932  0.276079  0.305390
1      1 -0.562195 -0.385750 -0.228925 -0.426511
2      2  0.377559 -0.837031 -0.384840 -0.305262
3      3 -0.670057 -0.737446  0.561989  0.528754

In [4]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level=['not present'])
Out[4]: 
   index         A         B         C         D
0      0  0.613373 -0.169316 -0.592379  1.050764
1      1  0.069762  0.995308  0.030434 -0.361300
2      2 -0.526487  0.165054  0.015452  0.954447
3      3  0.585677 -1.435712 -0.298280 -0.581473

but a7a0574 changed the behaviour so that now, vice-versa, even valid level names/indices are not considered.

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Member

toobaz commented May 6, 2017

This is related to passing the level= argument when there is a non-MultiIndex. Before, the argument would be just discarded:

In [3]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level='not present')
Out[3]: 
   index         A         B         C         D
0      0  0.057457  0.065932  0.276079  0.305390
1      1 -0.562195 -0.385750 -0.228925 -0.426511
2      2  0.377559 -0.837031 -0.384840 -0.305262
3      3 -0.670057 -0.737446  0.561989  0.528754

In [4]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level=['not present'])
Out[4]: 
   index         A         B         C         D
0      0  0.613373 -0.169316 -0.592379  1.050764
1      1  0.069762  0.995308  0.030434 -0.361300
2      2 -0.526487  0.165054  0.015452  0.954447
3      3  0.585677 -1.435712 -0.298280 -0.581473

but a7a0574 changed the behaviour so that now, vice-versa, even valid level names/indices are not considered.

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

@toobaz

This comment has been minimized.

Show comment
Hide comment
@toobaz

toobaz May 6, 2017

Member

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Sorry, the question is already answered by the behaviour when there is a MultiIndex:

In [5]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    610                                  'level number' % level)
--> 611             level = self.names.index(level)
    612         except ValueError:

ValueError: 'E' is not in list

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-f1686d7d4dfc> in <module>()
----> 1 pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])

/home/nobackup/repo/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/frame.py in <listcomp>(.0)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    612         except ValueError:
    613             if not isinstance(level, int):
--> 614                 raise KeyError('Level %s not found' % str(level))
    615             elif level < 0:
    616                 level += self.nlevels

KeyError: 'Level E not found'
Member

toobaz commented May 6, 2017

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Sorry, the question is already answered by the behaviour when there is a MultiIndex:

In [5]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    610                                  'level number' % level)
--> 611             level = self.names.index(level)
    612         except ValueError:

ValueError: 'E' is not in list

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-f1686d7d4dfc> in <module>()
----> 1 pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])

/home/nobackup/repo/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/frame.py in <listcomp>(.0)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    612         except ValueError:
    613             if not isinstance(level, int):
--> 614                 raise KeyError('Level %s not found' % str(level))
    615             elif level < 0:
    616                 level += self.nlevels

KeyError: 'Level E not found'
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 6, 2017

Member

@m4g005 Thanks for the report! And @toobaz for the quick analysis.
This is indeed a regression (although it seems it was more working by accident before)

Member

jorisvandenbossche commented May 6, 2017

@m4g005 Thanks for the report! And @toobaz for the quick analysis.
This is indeed a regression (although it seems it was more working by accident before)

toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

@jreback jreback added the Indexing label May 6, 2017

@schwab

This comment has been minimized.

Show comment
Hide comment
@schwab

schwab May 6, 2017

Perhaps it was working by accident before, but the new behavior of completely dropping the index column when reset index is called seems problematic. Additionally, according to the docs for reset_index "For a standard index, the index name will be used..." which indicates now it's even out of sync with the documented spec. It also brings up the important question, if we do want to keep this behavior going forward, then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

schwab commented May 6, 2017

Perhaps it was working by accident before, but the new behavior of completely dropping the index column when reset index is called seems problematic. Additionally, according to the docs for reset_index "For a standard index, the index name will be used..." which indicates now it's even out of sync with the documented spec. It also brings up the important question, if we do want to keep this behavior going forward, then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

@schwab

This comment has been minimized.

Show comment
Hide comment
@schwab

schwab May 6, 2017

@m4g005 To get this working the same in both version, you can try it without the level name.

`data.reset_index(inplace=True)

data`

A B C D
0 1.11556 1.21351 -0.185124 0.868765
1 1.63402 0.322284 0.299842 -0.174827
2 -1.21852 -0.35271 0.773597 1.62995
3 -0.416348 -0.113201 -0.151533 -1.01033

schwab commented May 6, 2017

@m4g005 To get this working the same in both version, you can try it without the level name.

`data.reset_index(inplace=True)

data`

A B C D
0 1.11556 1.21351 -0.185124 0.868765
1 1.63402 0.322284 0.299842 -0.174827
2 -1.21852 -0.35271 0.773597 1.62995
3 -0.416348 -0.113201 -0.151533 -1.01033
@toobaz

This comment has been minimized.

Show comment
Hide comment
@toobaz

toobaz May 6, 2017

Member

then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

This is going to be fixed, no doubt.

Member

toobaz commented May 6, 2017

then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

This is going to be fixed, no doubt.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 6, 2017

Member

Indeed, @schwab, as I confirmed above, this is a regression, it is supposed to work, and @toobaz already made a PR to fix it.

Member

jorisvandenbossche commented May 6, 2017

Indeed, @schwab, as I confirmed above, this is a regression, it is supposed to work, and @toobaz already made a PR to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment