REF/PERF: ArrowExtensionArray.setitem #50632

lukemanley · 2023-01-08T20:51:38Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Simplifies and improves performance of ArrowExtensionArray.__setitem__ via more use of pyarrow compute functions and less back and forth through numpy.

ASVs:

        before           after         ratio
    [059b2c18]       [91049c37]
    <main>           <arrow-setitem>
-         382±2μs          336±1μs     0.88  array.ArrowStringArray.time_setitem_list(False)
-         900±5μs         566±10μs     0.63  array.ArrowStringArray.time_setitem_list(True)
-      14.0±0.3ms      6.12±0.06ms     0.44  array.ArrowStringArray.time_setitem(False)
-         345±6μs        99.6±20μs     0.29  array.ArrowStringArray.time_setitem_slice(False)
-      24.5±0.2ms      6.12±0.05ms     0.25  array.ArrowStringArray.time_setitem(True)
-     5.35±0.06ms          304±3μs     0.06  array.ArrowStringArray.time_setitem_slice(True)

jbrockmendel · 2023-01-09T15:54:35Z

Not for this PR but because I'm cranky: the fact that we have to implement a kludgy __setitem__ sucks. ideally pyarrow would implement it; failing that better to set into each of the elements of self._data.buffers() (which besides being a PITA im not sure is even possible in mask-less cases)

lukemanley · 2023-01-10T12:47:23Z

Not for this PR but because I'm cranky: the fact that we have to implement a kludgy __setitem__ sucks. ideally pyarrow would implement it; failing that better to set into each of the elements of self._data.buffers() (which besides being a PITA im not sure is even possible in mask-less cases)

This was a (failed?) attempt to make it a little less kludgy :). Agreed its still kludgy.

Part of the complexity is supporting older versions of pyarrow where the pyarrow.compute functions have bugs we need to work around. We could remove the fast paths if simplicity is preferred.

I'm not familiar with the buffers but could take a look.

jbrockmendel · 2023-01-10T17:25:16Z

I'm not familiar with the buffers but could take a look.

Definitely don't put time into this on my account. I've got a bug in my bonnet about preserving views (xref #45419) that becomes less important with CoW. Plus as mentioned, im not confident that the buffers() approach is even feasible.

mroeschke · 2023-01-17T19:18:05Z

pandas/tests/arrays/string_/test_string_arrow.py

@@ -172,7 +172,6 @@ def test_setitem(multiple_chunks, key, value, expected):

    result[key] = value
    tm.assert_equal(result, expected)
-    assert result._data.num_chunks == expected._data.num_chunks


So this no longer holds?

The existing implementation iterated through the chunks and set values chunk by chunk. This implementation passes the entire ChunkedArray to pyarrow's compute functions. At the moment it looks like pyarrow.compute.if_else combines chunks (but still returns a ChunkedArray with one chunk) whereas pyarrow.compute.replace_with_mask maintains the chunking layout of the input. I'm not sure if that behavior applies for all cases or based on inputs. Let me know if you think pandas should ensure chunking layout remains unchanged.

Generally I think we should try to maintain the chunking layout as much as possible, but it's more of an implementation detail and if the pyarrow compute functions don't necessarily maintain the chunking layout I suppose this is fine

mroeschke · 2023-01-17T19:19:43Z

pandas/core/arrays/arrow/array.py

+
+        if isinstance(data, pa.Array):
+            data = pa.chunked_array([data])
+        self._data = data


Are we guaranteed here that self._data.type matches self.dtype.pyarrow_dtype?

Yes, I believe so. I addressed the TODO in ArrowExtensionArray._maybe_convert_setitem_value so that it now boxes the setitem values will raise if the replacement values cannot be cast to the original self._data.type.

mroeschke · 2023-01-18T19:18:22Z

Thanks @lukemanley

lukemanley added 2 commits January 8, 2023 15:23

REF/PERF: ArrowExtensionArray.__setitem__

cac4c2a

update asv

91049c3

lukemanley added Refactor Internal refactoring of code Performance Memory or execution speed performance Arrow pyarrow functionality labels Jan 8, 2023

lukemanley added 2 commits January 8, 2023 15:52

whatsnew

e01e6e1

fixes

a205bf7

fix min versions

851704f

lukemanley added 3 commits January 10, 2023 17:48

Merge remote-tracking branch 'upstream/main' into arrow-setitem

1c97e75

fix min versions

833baa6

more min version fixes

3c31f1b

mroeschke reviewed Jan 17, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 18, 2023

mroeschke approved these changes Jan 18, 2023

View reviewed changes

mroeschke merged commit 601f227 into pandas-dev:main Jan 18, 2023

lukemanley deleted the arrow-setitem branch January 19, 2023 04:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/PERF: ArrowExtensionArray.setitem #50632

REF/PERF: ArrowExtensionArray.setitem #50632

lukemanley commented Jan 8, 2023 •

edited

Loading

jbrockmendel commented Jan 9, 2023

lukemanley commented Jan 10, 2023

jbrockmendel commented Jan 10, 2023

mroeschke Jan 17, 2023

lukemanley Jan 17, 2023

mroeschke Jan 18, 2023

mroeschke Jan 17, 2023 •

edited

Loading

lukemanley Jan 17, 2023

mroeschke commented Jan 18, 2023

REF/PERF: ArrowExtensionArray.__setitem__ #50632

REF/PERF: ArrowExtensionArray.__setitem__ #50632

Conversation

lukemanley commented Jan 8, 2023 • edited Loading

jbrockmendel commented Jan 9, 2023

lukemanley commented Jan 10, 2023

jbrockmendel commented Jan 10, 2023

mroeschke Jan 17, 2023

Choose a reason for hiding this comment

lukemanley Jan 17, 2023

Choose a reason for hiding this comment

mroeschke Jan 18, 2023

Choose a reason for hiding this comment

mroeschke Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

lukemanley Jan 17, 2023

Choose a reason for hiding this comment

mroeschke commented Jan 18, 2023

REF/PERF: ArrowExtensionArray.setitem #50632

REF/PERF: ArrowExtensionArray.setitem #50632

lukemanley commented Jan 8, 2023 •

edited

Loading

mroeschke Jan 17, 2023 •

edited

Loading