Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/PERF: ArrowExtensionArray.__setitem__ #50632

Merged
merged 8 commits into from
Jan 18, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Jan 8, 2023

Simplifies and improves performance of ArrowExtensionArray.__setitem__ via more use of pyarrow compute functions and less back and forth through numpy.

ASVs:

        before           after         ratio
    [059b2c18]       [91049c37]
    <main>           <arrow-setitem>
-         382±2μs          336±1μs     0.88  array.ArrowStringArray.time_setitem_list(False)
-         900±5μs         566±10μs     0.63  array.ArrowStringArray.time_setitem_list(True)
-      14.0±0.3ms      6.12±0.06ms     0.44  array.ArrowStringArray.time_setitem(False)
-         345±6μs        99.6±20μs     0.29  array.ArrowStringArray.time_setitem_slice(False)
-      24.5±0.2ms      6.12±0.05ms     0.25  array.ArrowStringArray.time_setitem(True)
-     5.35±0.06ms          304±3μs     0.06  array.ArrowStringArray.time_setitem_slice(True)

@lukemanley lukemanley added Refactor Internal refactoring of code Performance Memory or execution speed performance Arrow pyarrow functionality labels Jan 8, 2023
@jbrockmendel
Copy link
Member

Not for this PR but because I'm cranky: the fact that we have to implement a kludgy __setitem__ sucks. ideally pyarrow would implement it; failing that better to set into each of the elements of self._data.buffers() (which besides being a PITA im not sure is even possible in mask-less cases)

@lukemanley
Copy link
Member Author

Not for this PR but because I'm cranky: the fact that we have to implement a kludgy __setitem__ sucks. ideally pyarrow would implement it; failing that better to set into each of the elements of self._data.buffers() (which besides being a PITA im not sure is even possible in mask-less cases)

This was a (failed?) attempt to make it a little less kludgy :). Agreed its still kludgy.

Part of the complexity is supporting older versions of pyarrow where the pyarrow.compute functions have bugs we need to work around. We could remove the fast paths if simplicity is preferred.

I'm not familiar with the buffers but could take a look.

@jbrockmendel
Copy link
Member

I'm not familiar with the buffers but could take a look.

Definitely don't put time into this on my account. I've got a bug in my bonnet about preserving views (xref #45419) that becomes less important with CoW. Plus as mentioned, im not confident that the buffers() approach is even feasible.

@@ -172,7 +172,6 @@ def test_setitem(multiple_chunks, key, value, expected):

result[key] = value
tm.assert_equal(result, expected)
assert result._data.num_chunks == expected._data.num_chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this no longer holds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing implementation iterated through the chunks and set values chunk by chunk. This implementation passes the entire ChunkedArray to pyarrow's compute functions. At the moment it looks like pyarrow.compute.if_else combines chunks (but still returns a ChunkedArray with one chunk) whereas pyarrow.compute.replace_with_mask maintains the chunking layout of the input. I'm not sure if that behavior applies for all cases or based on inputs. Let me know if you think pandas should ensure chunking layout remains unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think we should try to maintain the chunking layout as much as possible, but it's more of an implementation detail and if the pyarrow compute functions don't necessarily maintain the chunking layout I suppose this is fine


if isinstance(data, pa.Array):
data = pa.chunked_array([data])
self._data = data
Copy link
Member

@mroeschke mroeschke Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed here that self._data.type matches self.dtype.pyarrow_dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. I addressed the TODO in ArrowExtensionArray._maybe_convert_setitem_value so that it now boxes the setitem values will raise if the replacement values cannot be cast to the original self._data.type.

@mroeschke mroeschke added this to the 2.0 milestone Jan 18, 2023
@mroeschke mroeschke merged commit 601f227 into pandas-dev:main Jan 18, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the arrow-setitem branch January 19, 2023 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants