ENH: include cache in memory_usage #58529

GianlucaFicarelli · 2024-05-02T13:18:09Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

The memory_usage() method can be called to get information about the memory used by some of the Pandas objects.
However, in some cases the cached data aren't included.

For example, MultiIndex.memory_usage() includes memory used by:

levels
codes
names
_engine (if initialised)

but it doesn't consider:

_engine.values (it could be included in engine.sizeof)
values (cached in _values)
dtypes and a few other negligible cached properties

Example code (using the current main branch using this code):

In [52]: idx = pd.MultiIndex.from_product([np.arange(100), np.arange(100), np.arange(100)], names=["x0", "x1", "x2"])

In [53]: list(idx._cache)
Out[53]: ['levels']

In [54]: idx.memory_usage(deep=True)
Out[54]: 3002553

In [55]: idx._engine.values.nbytes
Out[55]: 4000000

In [56]: idx._engine.sizeof(deep=True)
Out[56]: 0

In [57]: list(idx._cache)
Out[57]: ['levels', '_engine']

In [58]: idx.memory_usage(deep=True)
Out[58]: 3002553

In [111]: idx.values
Out[111]: 
array([(0, 0, 0), (0, 0, 1), (0, 0, 2), ..., (99, 99, 97), (99, 99, 98),
       (99, 99, 99)], dtype=object)

In [112]: idx.memory_usage(deep=True)
Out[112]: 3002553

In [113]: getsizeof(idx.values[0]) * len(idx.values)
Out[113]: 64000000

In [114]: list(idx._cache)
Out[114]: ['levels', '_engine', '_values', 'nbytes']

In [115]: idx.memory_usage(deep=True)
Out[115]: 3002553

In [117]: idx.get_loc((99, 99, 99))
Out[117]: 999999

In [118]: idx.memory_usage(deep=True)
Out[118]: 3015057

In [133]: idx._engine.get_indexer(idx._engine.values[0:2])
Out[133]: array([0, 1])

In [134]: idx._engine.is_mapping_populated
Out[134]: True

In [135]: idx._engine.sizeof(deep=True)
Out[135]: 25428008

In [136]: idx.memory_usage(deep=True)
Out[136]: 28443065

Feature Description

memory_usage() could accept an optional bool parameter cache with default value False.

    def memory_usage(self, deep: bool = False, cache: bool = False) -> int: ...

If True, it should include also the cached data.
If False, it should keep the existing behaviour (although including the engine data might not be the most intuitive thing, after adding the cache parameter)

Alternative Solutions

Alternatively, the signature of memory_usage() can remain the same, but the result should include the cached data.

However, it may be surprising for the user if the result changes, depending on what properties have been called (but this is already happening for the engine, and it can be documented).

Additional Context

If memory_usage is used to inspect the memory usage of Pandas objects, it would be better to return a value as close as possible to the actually used memory.

The text was updated successfully, but these errors were encountered:

GianlucaFicarelli added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2024

rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2024

rajat315315 mentioned this issue May 21, 2024

Added cache argument to memory_usage function. #58802

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: include cache in memory_usage #58529

ENH: include cache in memory_usage #58529

GianlucaFicarelli commented May 2, 2024 •

edited

ENH: include cache in memory_usage #58529

ENH: include cache in memory_usage #58529

Comments

GianlucaFicarelli commented May 2, 2024 • edited

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

GianlucaFicarelli commented May 2, 2024 •

edited