DataFrame.reset_index deletes index, does not all for ints as level arg #16263

Closed
m4g005 opened this Issue May 6, 2017 · 7 comments

Comments

Projects
None yet
5 participants

m4g005 commented May 6, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
pd.__version__

u'0.20.1'

data = pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']) 
data
A B C D
0 -0.134549 2.352525 0.481132 -1.919506
1 1.980074 0.720437 0.410702 -0.703470
2 -3.063166 -0.781255 0.270469 -0.539081
3 -1.125265 0.308374 -0.166085 -1.253959
data.set_index(['A'], inplace=True)
data
B C D
A
-0.134549 2.352525 0.481132 -1.919506
1.980074 0.720437 0.410702 -0.703470
-3.063166 -0.781255 0.270469 -0.539081
-1.125265 0.308374 -0.166085 -1.253959
data.reset_index(level=['A'], inplace=True)
data
B C D
0 2.352525 0.481132 -1.919506
1 0.720437 0.410702 -0.703470
2 -0.781255 0.270469 -0.539081
3 0.308374 -0.166085 -1.253959

Problem description

between v0.19.2 and v0.20.1, the behavior of DataFrame.reset_index changed.
With a single set index:

  • It does not attempt to keep the column (essentially making drop=True always on)
  • level=int no longer works (iterables work)

Expected Output

v0.19.2 results:

data.reset_index(level=['A'], inplace=True)
data
A B C D
0 0.100442 -0.620740 -2.018020 1.059871
1 -0.530272 0.402598 -1.453445 -0.729623
2 -1.040126 -0.536687 -1.136123 -0.748891
3 -0.269727 0.182250 0.847344 0.785692

Output of pd.show_versions()

import pandas as pd
import numpy as np
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
Contributor

toobaz commented May 6, 2017

This is related to passing the level= argument when there is a non-MultiIndex. Before, the argument would be just discarded:

In [3]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level='not present')
Out[3]: 
   index         A         B         C         D
0      0  0.057457  0.065932  0.276079  0.305390
1      1 -0.562195 -0.385750 -0.228925 -0.426511
2      2  0.377559 -0.837031 -0.384840 -0.305262
3      3 -0.670057 -0.737446  0.561989  0.528754

In [4]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level=['not present'])
Out[4]: 
   index         A         B         C         D
0      0  0.613373 -0.169316 -0.592379  1.050764
1      1  0.069762  0.995308  0.030434 -0.361300
2      2 -0.526487  0.165054  0.015452  0.954447
3      3  0.585677 -1.435712 -0.298280 -0.581473

but a7a0574 changed the behaviour so that now, vice-versa, even valid level names/indices are not considered.

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Contributor

toobaz commented May 6, 2017

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Sorry, the question is already answered by the behaviour when there is a MultiIndex:

In [5]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    610                                  'level number' % level)
--> 611             level = self.names.index(level)
    612         except ValueError:

ValueError: 'E' is not in list

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-f1686d7d4dfc> in <module>()
----> 1 pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])

/home/nobackup/repo/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/frame.py in <listcomp>(.0)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    612         except ValueError:
    613             if not isinstance(level, int):
--> 614                 raise KeyError('Level %s not found' % str(level))
    615             elif level < 0:
    616                 level += self.nlevels

KeyError: 'Level E not found'

@toobaz toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

@toobaz toobaz BUG: support for "level=" when reset_index() is called with a flat Index
closes #16263
d27ae91

jorisvandenbossche added this to the 0.20.2 milestone May 6, 2017

@m4g005 Thanks for the report! And @toobaz for the quick analysis.
This is indeed a regression (although it seems it was more working by accident before)

@toobaz toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

@toobaz toobaz BUG: support for "level=" when reset_index() is called with a flat Index
closes #16263
fd83502

@toobaz toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

@toobaz toobaz BUG: support for "level=" when reset_index() is called with a flat Index
closes #16263
f4c3121

@toobaz toobaz added a commit to toobaz/pandas that referenced this issue May 6, 2017

@toobaz toobaz BUG: support for "level=" when reset_index() is called with a flat Index
closes #16263
7cc21c2

jreback added the Indexing label May 6, 2017

schwab commented May 6, 2017 edited

Perhaps it was working by accident before, but the new behavior of completely dropping the index column when reset index is called seems problematic. Additionally, according to the docs for reset_index "For a standard index, the index name will be used..." which indicates now it's even out of sync with the documented spec. It also brings up the important question, if we do want to keep this behavior going forward, then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

schwab commented May 6, 2017 edited

@m4g005 To get this working the same in both version, you can try it without the level name.

`data.reset_index(inplace=True)

data`

A B C D
0 1.11556 1.21351 -0.185124 0.868765
1 1.63402 0.322284 0.299842 -0.174827
2 -1.21852 -0.35271 0.773597 1.62995
3 -0.416348 -0.113201 -0.151533 -1.01033
Contributor

toobaz commented May 6, 2017

then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

This is going to be fixed, no doubt.

Indeed, @schwab, as I confirmed above, this is a regression, it is supposed to work, and @toobaz already made a PR to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment