Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: invalid MultiIndex construction #17464

Closed
flipdazed opened this issue Sep 7, 2017 · 2 comments · Fixed by #17971
Closed

ERR: invalid MultiIndex construction #17464

flipdazed opened this issue Sep 7, 2017 · 2 comments · Fixed by #17971
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@flipdazed
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
idx0 = range(2)
idx1 = np.repeat(range(2), 2)

midx = pd.MultiIndex(
    levels=[idx0, idx1],
    labels=[
        np.repeat(range(len(idx0)), len(idx1)),
        np.tile(range(len(idx1)), len(idx0))
    ],
    names=['idx0', 'idx1']
)

df = pd.DataFrame(
    [
        [i**2/float(j), 'example{}'.format(i), i**3/float(j)]
        for j in range(1, len(idx0) + 1)
        for i in range(1, len(idx1) + 1)
    ],
    columns=['col0', 'col1', 'col2'],
    index=midx
)

example = df.loc[[(0, 1)]]

For display

In [13]: example
Out[13]:
           col0      col1  col2
idx0 idx1
0    1     16.0  example4  64.0

Problem description

Taken from this StackOverflow post:

Given the following dataframe

           col0      col1  col2
idx0 idx1
0    0      1.0  example1   1.0
     0      4.0  example2   8.0
     1      9.0  example3  27.0
     1     16.0  example4  64.0
1    0      0.5  example1   0.5
     0      2.0  example2   4.0
     1      4.5  example3  13.5
     1      8.0  example4  32.0

the .xs operation will select

In [121]: df.xs((0,1), level=[0,1])
Out[121]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

whilst the .loc operation will select

In [125]: df.loc[[(0,1)]]
Out[125]:
           col0      col1  col2
idx0 idx1
0    1     16.0  example4  64.0

This is highlighted even further by the following

In [149]: df.loc[pd.IndexSlice[:, 1], :]
Out[149]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

In [150]: df.loc[pd.IndexSlice[0, 1], :]
Out[150]:
col0          16
col1    example4
col2          64
Name: (0, 1), dtype: object

Expected Output

Note that this only works for this minimal example because there is only one label in level 0 index axis

In [8]: df.loc[pd.IndexSlice[:, 1], :]
Out[8]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 36.2.7
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.5.0

@gfyoung gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 7, 2017
@colinalexander
Copy link

The midx returned as constructed above is as follows:

MultiIndex(levels=[[0, 1], 
                   [0, 0, 1, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], 
                   [0, 0, 1, 1, 0, 0, 1, 1]],
           names=[u'idx0', u'idx1'])

Whereas typical construction would be:

midx = pd.MultiIndex.from_product([[0, 1], [0, 0, 1, 1]], names=['idx0', 'idx1'])
>>> midx
MultiIndex(levels=[[0, 1], 
                   [0, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], 
                   [0, 0, 1, 1, 0, 0, 1, 1]],
           names=[u'idx0', u'idx1'])

Using the typical construction, indexing via df.loc[[(0, 1)]] appears to behave as expected.

I believe the issue here is how this indexing handles duplicates in one of the levels (i.e. the 1).

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

your directly constructing the MultiIndex is violating guarantees, namely that the levels are each unique. We don't explicitly check this as the public constructors guarantee this. I suppose we could check this when verify_integrity=True.

want to do a PR for this?

to clarify, you certainly can have a non-unique MultiIndex (though generally discouraged as they are not that performant), but you would have duplicate labels, never level values.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex Difficulty Novice labels Sep 7, 2017
@jreback jreback added this to the Next Major Release milestone Sep 7, 2017
@jreback jreback changed the title iloc only returns single entry for duplicate indices when a single index is provided ERR: invalid MultiIndex construction Sep 7, 2017
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
5 participants