New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: invalid MultiIndex construction #17464

Closed
flipdazed opened this Issue Sep 7, 2017 · 2 comments

Comments

Projects
None yet
5 participants
@flipdazed

flipdazed commented Sep 7, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
idx0 = range(2)
idx1 = np.repeat(range(2), 2)

midx = pd.MultiIndex(
    levels=[idx0, idx1],
    labels=[
        np.repeat(range(len(idx0)), len(idx1)),
        np.tile(range(len(idx1)), len(idx0))
    ],
    names=['idx0', 'idx1']
)

df = pd.DataFrame(
    [
        [i**2/float(j), 'example{}'.format(i), i**3/float(j)]
        for j in range(1, len(idx0) + 1)
        for i in range(1, len(idx1) + 1)
    ],
    columns=['col0', 'col1', 'col2'],
    index=midx
)

example = df.loc[[(0, 1)]]

For display

In [13]: example
Out[13]:
           col0      col1  col2
idx0 idx1
0    1     16.0  example4  64.0

Problem description

Taken from this StackOverflow post:

Given the following dataframe

           col0      col1  col2
idx0 idx1
0    0      1.0  example1   1.0
     0      4.0  example2   8.0
     1      9.0  example3  27.0
     1     16.0  example4  64.0
1    0      0.5  example1   0.5
     0      2.0  example2   4.0
     1      4.5  example3  13.5
     1      8.0  example4  32.0

the .xs operation will select

In [121]: df.xs((0,1), level=[0,1])
Out[121]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

whilst the .loc operation will select

In [125]: df.loc[[(0,1)]]
Out[125]:
           col0      col1  col2
idx0 idx1
0    1     16.0  example4  64.0

This is highlighted even further by the following

In [149]: df.loc[pd.IndexSlice[:, 1], :]
Out[149]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

In [150]: df.loc[pd.IndexSlice[0, 1], :]
Out[150]:
col0          16
col1    example4
col2          64
Name: (0, 1), dtype: object

Expected Output

Note that this only works for this minimal example because there is only one label in level 0 index axis

In [8]: df.loc[pd.IndexSlice[:, 1], :]
Out[8]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 36.2.7
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.5.0

@gfyoung gfyoung added the Indexing label Sep 7, 2017

@colin1alexander

This comment has been minimized.

colin1alexander commented Sep 7, 2017

The midx returned as constructed above is as follows:

MultiIndex(levels=[[0, 1], 
                   [0, 0, 1, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], 
                   [0, 0, 1, 1, 0, 0, 1, 1]],
           names=[u'idx0', u'idx1'])

Whereas typical construction would be:

midx = pd.MultiIndex.from_product([[0, 1], [0, 0, 1, 1]], names=['idx0', 'idx1'])
>>> midx
MultiIndex(levels=[[0, 1], 
                   [0, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], 
                   [0, 0, 1, 1, 0, 0, 1, 1]],
           names=[u'idx0', u'idx1'])

Using the typical construction, indexing via df.loc[[(0, 1)]] appears to behave as expected.

I believe the issue here is how this indexing handles duplicates in one of the levels (i.e. the 1).

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 7, 2017

your directly constructing the MultiIndex is violating guarantees, namely that the levels are each unique. We don't explicitly check this as the public constructors guarantee this. I suppose we could check this when verify_integrity=True.

want to do a PR for this?

to clarify, you certainly can have a non-unique MultiIndex (though generally discouraged as they are not that performant), but you would have duplicate labels, never level values.

@jreback jreback added this to the Next Major Release milestone Sep 7, 2017

@jreback jreback changed the title from `iloc` only returns single entry for duplicate indices when a single index is provided to ERR: invalid MultiIndex construction Sep 7, 2017

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Oct 27, 2017

@TomAugspurger TomAugspurger referenced this issue Apr 15, 2018

Merged

Fix nonempty_index for CategoricalIndex #3394

3 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment