Skip to content

BUG: MultiIndex.levels can propagate stale values from parent DataFrame #55315

@trianta2

Description

@trianta2

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from itertools import product
import pandas as pd


df = pd.DataFrame(product([1,2,3], ['a', 'b', 'c']), columns=['foo', 'bar'])
df['val'] = range(len(df))

print(df)
#    foo bar  val
# 0    1   a    0
# 1    1   b    1
# 2    1   c    2
# 3    2   a    3
# 4    2   b    4
# 5    2   c    5
# 6    3   a    6
# 7    3   b    7
# 8    3   c    8

df2 = df.set_index(['foo', 'bar'])

print(df2)
#          val
# foo bar
# 1   a      0
#     b      1
#     c      2
# 2   a      3
#     b      4
#     c      5
# 3   a      6
#     b      7
#     c      8

df3 = df2.query('val.between(3, 5)')

print(df3)
#          val
# foo bar
# 2   a      3
#     b      4
#     c      5

print(df3.index.levels[0])
# Index([1, 2, 3], dtype='int64', name='foo')

print(df3.index.get_level_values(0))
# Index([2, 2, 2], dtype='int64', name='foo')

Issue Description

When you slice or query a DataFrame with a MultiIndex, the MultiIndex.levels attribute can refer to index values that no longer exist in the resulting DataFrame.

In the provided example, see how df3.index.levels[0] includes foo values 1 & 3, but should only include foo value 2 due to the query. The line df3.index.get_level_values(0) is correct though.

I'm not familiar with the pandas code base, but it appears a caching decorator is used here which might explain why stale values are propagated to the new DataFrame.

Expected Behavior

df3.index.levels[0] should produce Index([2], dtype='int64', name='foo')

Installed Versions

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.5.final.0
python-bits : 64

processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions