Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: include cache in memory_usage #58529

Open
1 of 3 tasks
GianlucaFicarelli opened this issue May 2, 2024 · 0 comments
Open
1 of 3 tasks

ENH: include cache in memory_usage #58529

GianlucaFicarelli opened this issue May 2, 2024 · 0 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@GianlucaFicarelli
Copy link
Contributor

GianlucaFicarelli commented May 2, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The memory_usage() method can be called to get information about the memory used by some of the Pandas objects.
However, in some cases the cached data aren't included.

For example, MultiIndex.memory_usage() includes memory used by:

  • levels
  • codes
  • names
  • _engine (if initialised)

but it doesn't consider:

  • _engine.values (it could be included in engine.sizeof)
  • values (cached in _values)
  • dtypes and a few other negligible cached properties

Example code (using the current main branch using this code):

In [52]: idx = pd.MultiIndex.from_product([np.arange(100), np.arange(100), np.arange(100)], names=["x0", "x1", "x2"])

In [53]: list(idx._cache)
Out[53]: ['levels']

In [54]: idx.memory_usage(deep=True)
Out[54]: 3002553

In [55]: idx._engine.values.nbytes
Out[55]: 4000000

In [56]: idx._engine.sizeof(deep=True)
Out[56]: 0

In [57]: list(idx._cache)
Out[57]: ['levels', '_engine']

In [58]: idx.memory_usage(deep=True)
Out[58]: 3002553

In [111]: idx.values
Out[111]: 
array([(0, 0, 0), (0, 0, 1), (0, 0, 2), ..., (99, 99, 97), (99, 99, 98),
       (99, 99, 99)], dtype=object)

In [112]: idx.memory_usage(deep=True)
Out[112]: 3002553

In [113]: getsizeof(idx.values[0]) * len(idx.values)
Out[113]: 64000000

In [114]: list(idx._cache)
Out[114]: ['levels', '_engine', '_values', 'nbytes']

In [115]: idx.memory_usage(deep=True)
Out[115]: 3002553

In [117]: idx.get_loc((99, 99, 99))
Out[117]: 999999

In [118]: idx.memory_usage(deep=True)
Out[118]: 3015057

In [133]: idx._engine.get_indexer(idx._engine.values[0:2])
Out[133]: array([0, 1])

In [134]: idx._engine.is_mapping_populated
Out[134]: True

In [135]: idx._engine.sizeof(deep=True)
Out[135]: 25428008

In [136]: idx.memory_usage(deep=True)
Out[136]: 28443065

Feature Description

memory_usage() could accept an optional bool parameter cache with default value False.

    def memory_usage(self, deep: bool = False, cache: bool = False) -> int: ...

If True, it should include also the cached data.
If False, it should keep the existing behaviour (although including the engine data might not be the most intuitive thing, after adding the cache parameter)

Alternative Solutions

Alternatively, the signature of memory_usage() can remain the same, but the result should include the cached data.

However, it may be surprising for the user if the result changes, depending on what properties have been called (but this is already happening for the engine, and it can be documented).

Additional Context

If memory_usage is used to inspect the memory usage of Pandas objects, it would be better to return a value as close as possible to the actually used memory.

@GianlucaFicarelli GianlucaFicarelli added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2024
@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants