PERF: ArrowExtensionArray.iter #49825

lukemanley · 2022-11-22T00:56:20Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Faster iteration through pyarrow-backed arrays.

from asv_bench.benchmarks.strings import Iter

dtype = "string[pyarrow]"
bench = Iter()
bench.setup(dtype)

%timeit bench.time_iter(dtype)

# 266 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 57.7 ms ± 679 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

mroeschke · 2022-11-22T01:59:12Z

pandas/core/arrays/arrow/array.py

+        Iterate over elements of the array.
+        """
+        na_value = self._dtype.na_value
+        for value in self._data:


Is there any benefit to combining the chunks of self._data and then iterating?

Maybe seeing a 5-10% improvement testing locally, but not much. Seems to be 5-10% with 10 chunks or 10000 chunks. I can add it if you want?

Does the 5%-10% hold if there are a low number of chunks (1, 2, or 3)? I'd guess that most of the time there shouldn't be large amount of chunks since most construction is pd.chunked_array([scalars])

I don't think I'm seeing a consistent improvement at low number of chunks.

If curious, here is the snippet I'm testing with:

import pyarrow as pa def time_iter(arr): for v in arr: pass num_chunks = 2 chunk_size = 10**5 data = [np.arange(chunk_size)] * num_chunks arr = pa.chunked_array(data) %timeit time_iter(arr) %timeit time_iter(arr.combine_chunks())

IIUC doing combine_chunks will cause a copy, so any speed improvements will be offset by memory increase. if so, i think not worth it

mroeschke · 2022-11-22T18:26:18Z

Thanks @lukemanley!

* faster ArrowExtensionArray.__iter__ * gh ref

faster ArrowExtensionArray.__iter__

0c42523

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Nov 22, 2022

gh ref

8a2bb16

mroeschke reviewed Nov 22, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into arrow-ea-iter

1943a7f

jbrockmendel approved these changes Nov 22, 2022

View reviewed changes

mroeschke added this to the 2.0 milestone Nov 22, 2022

mroeschke approved these changes Nov 22, 2022

View reviewed changes

mroeschke merged commit a614b7a into pandas-dev:main Nov 22, 2022

mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022

PERF: ArrowExtensionArray.__iter__ (pandas-dev#49825)

253f58d

* faster ArrowExtensionArray.__iter__ * gh ref

lukemanley deleted the arrow-ea-iter branch December 20, 2022 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: ArrowExtensionArray.iter #49825

PERF: ArrowExtensionArray.iter #49825

Uh oh!

lukemanley commented Nov 22, 2022 •

edited

Loading

Uh oh!

mroeschke Nov 22, 2022

Uh oh!

lukemanley Nov 22, 2022

Uh oh!

mroeschke Nov 22, 2022 •

edited

Loading

Uh oh!

lukemanley Nov 22, 2022

Uh oh!

jbrockmendel Nov 22, 2022

Uh oh!

mroeschke commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

PERF: ArrowExtensionArray.__iter__ #49825

PERF: ArrowExtensionArray.__iter__ #49825

Uh oh!

Conversation

lukemanley commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

lukemanley Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

mroeschke Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukemanley Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PERF: ArrowExtensionArray.iter #49825

PERF: ArrowExtensionArray.iter #49825

lukemanley commented Nov 22, 2022 •

edited

Loading

mroeschke Nov 22, 2022 •

edited

Loading