Skip to content

PERF: Memory Issue with DataFrame Conditional Filtering: Impact on Index Type with df['x'] > x #56973

@ghost

Description

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Hello, I encountered some memory issues when indexing a large DataFrame. I created a test case and found that indexing in the format df['a'] > 5 changes the original RangeIndex format to int64index, doubling the memory usage.

import pandas as pd

df = pd.DataFrame({'a': range(100000)})
print("Original index type:", type(df.index))

# loc operation
df_loc = df.loc[df['a'] > 5]
print("Index type after loc:", type(df_loc.index))
df_loc.info()

# iloc operation
df_iloc = df.iloc[5:]
print("Index type after iloc:", type(df_iloc.index))
df_iloc.info()

Original index type: <class 'pandas.core.indexes.range.RangeIndex'>
Index type after loc: <class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.frame.DataFrame'>
Index: 99994 entries, 6 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       99994 non-null  int64
dtypes: int64(1)
memory usage: 1.5 MB
Index type after iloc: <class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99995 entries, 5 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       99995 non-null  int64
dtypes: int64(1)
memory usage: 781.3 KB

Installed Versions

python : 3.9.18.final.0
python-bits : 64
pandas : 2.1.4
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : 3.0.5
pytest : None
hypothesis : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsIndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions