Skip to content

Switch to an arena allocator #76

@ngoldbaum

Description

@ngoldbaum

Arrow's string arrays use arrow's variable-sized binary layout, which stores the string data in a single contiguous buffer per array along with a strided array of integer offsets. This is a really nice design because you only need one heap allocation per array and the string data are contiguous in memory so it's much easier to write fast ufuncs and exploit SIMD. The down side is that string arrays need to either be immutable or there can be pathological cases where enlarging a single array element leads to reallocating the storage for an entire array. I don't think it would be a big downside for many real-world use cases for string arrays to force people to copy into a new array if they need to mutate elements, but it would be a wart in NumPy's API where one data type has restrictions that others don't.

Numpy doesn't have a concept of per-array heap storage outside the array buffer, however even if we added facilities for that, the dtype API doesn't allow access to the array object, so we would also need to add a pointer to the per-array storage to the ufunc loops in the dtype API as well. I think a more straightforward approach would be to let the per-array storage buffer live on the descriptor instance and ensure that each array has a unique descriptor object.

A first step towards figuring out how hard it would be to have unique descriptors per array would be to add a flag to the dtype that indicates the dtype is associated with an array. When we set this flag at array creation time, if the flag is already set, we raise an error. I think running the strigdtype tests with this flag turned on would probably catch most of the places in numpy or the dtype implementation that would need to be updated. Once we're confident a per-array descriptor is possible, we get per-array buffer storage for free, and I think it would be relatively straightfoward to switch to an arrow-like in-memory representation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions