You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently dtypes that have embedded references can't influence numpy or python's facilities for determining the amount of memory used by an array.
For example, the sizes reported for object dtypes only include the space allocated in the array buffer and do not include the sizes of the objects stored in the array:
So long as object dtypes are the only real-world usage of dtypes that use embedded references this isn't a big deal, but with the new dtype API this will likely happen more often and downstream users of these dtypes will need to learn more complicated workarounds. In particular, I'm writing stringdtype for variable-length strings which has the same issue as the object dtype example above.
This information could be stored on the array object itself, but that will likely require a large number of code changes inside numpy since the places where allocations happen (inside cast loops and inside the dtype's setitem implementation, maybe other places I'm not aware) don't have access to the array object, so the allocation tracking information would need to be threaded through numpy's internal API.
It would be more straightforward with the way dtypes are currently set up to store allocation tracking metadata on the dtype object as allocations happen, since the routines that do allocations will have access to a dtype instance, and then the array can read that information from the dtype if it exists when calculating the array size. However, that will require making sure arrays don't share dtype instances, or perhaps adding a registry to the dtype to map array instances to allocation tracking information.
For now, for my purposes, I can add a helper inside my dtype implementation that calculates the "real" memory usage for the array, so this isn't a blocker for me.
The text was updated successfully, but these errors were encountered:
Proposed new feature or change:
Currently dtypes that have embedded references can't influence numpy or python's facilities for determining the amount of memory used by an array.
For example, the sizes reported for object dtypes only include the space allocated in the array buffer and do not include the sizes of the objects stored in the array:
This script prints
152
and40
on my 64 bit linux machine.Pandas has a workaround for this for object arrays:
https://github.com/pandas-dev/pandas/blob/dbed1316d10ee699ad6d5c8205a3f4e0b1377e5d/pandas/core/base.py#L1132-L1136
So long as object dtypes are the only real-world usage of dtypes that use embedded references this isn't a big deal, but with the new dtype API this will likely happen more often and downstream users of these dtypes will need to learn more complicated workarounds. In particular, I'm writing stringdtype for variable-length strings which has the same issue as the object dtype example above.
This information could be stored on the array object itself, but that will likely require a large number of code changes inside numpy since the places where allocations happen (inside cast loops and inside the dtype's
setitem
implementation, maybe other places I'm not aware) don't have access to the array object, so the allocation tracking information would need to be threaded through numpy's internal API.It would be more straightforward with the way dtypes are currently set up to store allocation tracking metadata on the dtype object as allocations happen, since the routines that do allocations will have access to a dtype instance, and then the array can read that information from the dtype if it exists when calculating the array size. However, that will require making sure arrays don't share dtype instances, or perhaps adding a registry to the dtype to map array instances to allocation tracking information.
For now, for my purposes, I can add a helper inside my dtype implementation that calculates the "real" memory usage for the array, so this isn't a blocker for me.
The text was updated successfully, but these errors were encountered: