Skip to content

Improve performance of series.ext_array.replace_with_mask() #52

@hombit

Description

@hombit

Currently, arrow misses the support of pyarrow.compute.replace_with_mask for struct arrays:
apache/arrow#29558

That's why we have our own implementation used by NestedExtenstionArray.__setitem__(). The implementation has an overhead of creating a len(self)-sized struct array to perform the replacement. This approach would work well when we are going to replace many elements, but when we replacing just few, it would produce a large memory foot-print and probably take a while.

An alternative approach would be copying the original array to np.ndarray[pa.StructScalar], replace the elements in-place, and convert it back:

def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array) -> pa.ChunkedArray:
    """Replace the elements of the array with the value where the mask is True"""
    np_array = np.fromiter(array, dtype=object)
    np_array[mask] = value
    new_pa_array = pa.array(np_array)
    return pa.chunked_array([new_pa_array])

We should create a benchmark and see what works faster and have smaller memory foot-print.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions