Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: MultiIndex.memory_usage shouldn't trigger the index engine #58385

Merged

Conversation

GianlucaFicarelli
Copy link
Contributor

Ignore the index engine when it isn't already cached.

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Reproducible Example

Calling memory_usage() can be unexpectedly slow when called on a big MultiIndex.

Using the main branch:

In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"])
CPU times: user 247 ms, sys: 170 ms, total: 417 ms
Wall time: 419 ms

In [4]: %time idx.memory_usage()
CPU times: user 2.91 s, sys: 4.09 s, total: 7 s
Wall time: 8.81 s
Out[4]: 500016953

Using this PR branch:

In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"])
CPU times: user 238 ms, sys: 148 ms, total: 386 ms
Wall time: 385 ms

In [4]: %time idx.memory_usage()
CPU times: user 112 µs, sys: 1e+03 ns, total: 113 µs
Wall time: 118 µs
Out[4]: 500016953

Side note: index._engine.sizeof() doesn't consider the content of index._engine.values. If it should, a separate issue can be opened. In the example above, the additional unreported used memory would be:

In [8]: idx._engine.values
Out[8]: 
array([   262402,    262403,    262404, ..., 131331299, 131331300,
       131331301], dtype=uint64)

In [9]: idx._engine.values.shape
Out[9]: (100000000,)

In [15]: idx._engine.values.nbytes
Out[15]: 800000000

@GianlucaFicarelli GianlucaFicarelli force-pushed the memory_usage_ignore_engine branch 3 times, most recently from 7fe8358 to 6d4e407 Compare April 23, 2024 16:46
@@ -142,6 +142,31 @@ def test_memory_usage_components_narrow_series(any_real_numpy_dtype):
assert total_usage == non_index_usage + index_usage


def test_memory_usage_doesnt_trigger_engine(index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this test to pandas/tests/indexes/test_old_base.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mroeschke mroeschke added Performance Memory or execution speed performance Index Related to the Index class or subclasses labels Apr 23, 2024
@mroeschke mroeschke added this to the 3.0 milestone Apr 24, 2024
@mroeschke mroeschke merged commit ea2f857 into pandas-dev:main Apr 24, 2024
45 of 46 checks passed
@mroeschke
Copy link
Member

Thanks @GianlucaFicarelli

@GianlucaFicarelli GianlucaFicarelli deleted the memory_usage_ignore_engine branch April 24, 2024 16:50
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…das-dev#58385)

* PERF: MultiIndex.memory_usage shouldn't trigger the index engine

Ignore the index engine when it isn't already cached.

* Move test, sort whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants