Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add a way for dtypes with embedded references to influence memory usage tracking #23146

Open
ngoldbaum opened this issue Feb 2, 2023 · 0 comments

Comments

@ngoldbaum
Copy link
Member

Proposed new feature or change:

Currently dtypes that have embedded references can't influence numpy or python's facilities for determining the amount of memory used by an array.

For example, the sizes reported for object dtypes only include the space allocated in the array buffer and do not include the sizes of the objects stored in the array:

import numpy as np
import sys

really_long_string = 'a'*100_000

arr = np.array([really_long_string]*5, dtype=object)

print(sys.getsizeof(arr))
print(arr.nbytes)

This script prints 152 and 40 on my 64 bit linux machine.

Pandas has a workaround for this for object arrays:

https://github.com/pandas-dev/pandas/blob/dbed1316d10ee699ad6d5c8205a3f4e0b1377e5d/pandas/core/base.py#L1132-L1136

So long as object dtypes are the only real-world usage of dtypes that use embedded references this isn't a big deal, but with the new dtype API this will likely happen more often and downstream users of these dtypes will need to learn more complicated workarounds. In particular, I'm writing stringdtype for variable-length strings which has the same issue as the object dtype example above.

This information could be stored on the array object itself, but that will likely require a large number of code changes inside numpy since the places where allocations happen (inside cast loops and inside the dtype's setitem implementation, maybe other places I'm not aware) don't have access to the array object, so the allocation tracking information would need to be threaded through numpy's internal API.

It would be more straightforward with the way dtypes are currently set up to store allocation tracking metadata on the dtype object as allocations happen, since the routines that do allocations will have access to a dtype instance, and then the array can read that information from the dtype if it exists when calculating the array size. However, that will require making sure arrays don't share dtype instances, or perhaps adding a registry to the dtype to map array instances to allocation tracking information.

For now, for my purposes, I can add a helper inside my dtype implementation that calculates the "real" memory usage for the array, so this isn't a blocker for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant