New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when accessing MultiIndex level with duplicated level names #18872

Closed
toobaz opened this Issue Dec 20, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@toobaz
Member

toobaz commented Dec 20, 2017

Code Sample, a copy-pastable example if possible

In [2]: mi = pd.MultiIndex.from_product([['i'], ['ii'], ['iii']])

In [3]: mi.rename([1,5,6]).get_level_values(1) # Interpreted as label
Out[3]: Index(['i'], dtype='object', name=1)

In [4]: mi.rename([1,5,1]).get_level_values(1) # Interpreted as index
Out[4]: Index(['ii'], dtype='object', name=5)

In [5]: mi.rename(['a',5,'a']).get_level_values('a') # ValueError is OK, KeyError is not
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    670                 raise ValueError('The name %s occurs multiple times, use a '
--> 671                                  'level number' % level)
    672             level = self.names.index(level)

ValueError: The name a occurs multiple times, use a level number

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-e8a616f9610f> in <module>()
----> 1 mi.rename(['a',5,'a']).get_level_values('a')

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_level_values(self, level)
    975         Index(['d', 'e', 'f'], dtype='object', name='level_2')
    976         """
--> 977         level = self._get_level_number(level)
    978         values = self._get_level_values(level)
    979         return values

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    673         except ValueError:
    674             if not isinstance(level, int):
--> 675                 raise KeyError('Level %s not found' % str(level))
    676             elif level < 0:
    677                 level += self.nlevels

KeyError: 'Level a not found'

In [6]: mi.rename([1, 'a', 'a']).get_level_values(1) # How am I going to access the second level?!
Out[6]: Index(['i'], dtype='object', name=1)

Problem description

There are different problems, but the root cause is (I think) the same:

  1. the first is trivial: the KeyError in In [5]: should not appear
  2. the second is that the interpretation of an integer changes when there is a duplicate name (difference between Out[3] and Out[4])
  3. the third is that in a situation like In [6], I have no way whatsoever to access the second column, since it is denoted by a duplicated name and its index is also the name of another column (sure, this is a perverse example, but I suspect it can bite in some cases in which users use duplicate labels and pandas internal code adds integer labels)

Expected Output

If we were to design this from scratch, the solution would be simple: prioritize the "index" interpretation of an integer over the "label" interpretation, so that the former is always unambiguous. Is this a too strong API change?

If the answer is "no", I will be happy to implement it, possibly with a temporary warning in those cases where the behaviour will change (that is: requested label is integer and is present in the names).

If the answer is "yes", I would like to at least suppress the KeyError in In [5]: and have In [4]: raise an error rather than return a result inconsistent with In [3]:.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 04db779
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+388.g04db779d4
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz

This comment has been minimized.

Member

toobaz commented Dec 20, 2017

Sorry, had missed #10461. Maybe it is a duplicate, depending on how we want to proceed.

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 20, 2017

we should disallow duplicate level names

@toobaz

This comment has been minimized.

Member

toobaz commented Dec 20, 2017

OK. Considering however that the default level names are (duplicate) None, would you rather

  • disallow non-None duplicate level names (cleaner)
    or
  • disallow integer duplicate level names (less disruptive)
    ?
@jreback

This comment has been minimized.

Contributor

jreback commented Dec 20, 2017

I think 1) disallow non-None duplicate level names.

toobaz added a commit to toobaz/pandas that referenced this issue Dec 20, 2017

@toobaz toobaz referenced this issue Dec 20, 2017

Merged

API: disallow duplicate level names #18882

4 of 4 tasks complete

toobaz added a commit to toobaz/pandas that referenced this issue Dec 21, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 22, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 22, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 22, 2017

@jreback jreback added this to the 0.23.0 milestone Dec 23, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 23, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 23, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Dec 29, 2017

jreback added a commit that referenced this issue Dec 29, 2017

hexgnu added a commit to hexgnu/pandas that referenced this issue Jan 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment