Arrow chunk_size as keyword argument #3084

prrao87 · 2024-03-19T13:53:05Z

Closes #2998.

As per @ray6080's comment, setting the arrow record batch size (which we call chunk_size) to 1M is a reasonable default because DuckDB does the same. In the majority of cases, a larger record batch size is favourable, and the user can always bring down the chunk_size if necessary.

I also fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

prrao87 · 2024-03-19T13:59:57Z

@mewim what do you think about moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method? That way Polars could just benefit from the arrow logic directly and not have to specify it independently.

codecov · 2024-03-19T14:05:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.71%. Comparing base (f8fe205) to head (638763b).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3084   +/-   ##
=======================================
  Coverage   92.71%   92.71%           
=======================================
  Files        1162     1162           
  Lines       43140    43140           
=======================================
+ Hits        39997    39999    +2     
+ Misses       3143     3141    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prrao87 · 2024-03-19T14:49:51Z

moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method

I just made those changes and built & tested locally, works well 👌🏽.

alexander-beedie · 2024-03-21T05:25:37Z

fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

FYI: I believe they were specified this way in order to validate result acquisition across chunk boundaries.

Also the meaning of chunk_size changed here - the docs say it refers to the number of rows, but by moving the logic from inside the Polars function into the Arrow one and exposing it on the existing parameter, the actual row count is now only going to match chunk_size if you have a single column 😅

As per #2998 (comment), you could instead allow opt-in to the adaptive behaviour by supporting a chunk_size=None option, otherwise maintaining the existing behaviour (where chunk_size=n_rows).

This would allow for adaptive behaviour without changing the meaning of integer chunk_size (so not a breaking change). Indeed, this could even be the new default - as the parameter currently has to be specified, you could change the default to chunk_size=None (eg: adaptive by default) without any existing code being impacted 👍

prrao87 · 2024-03-21T12:26:25Z

Hi @alexander-beedie, great points.

It makes sense to revert to the small chunk size numbers in the tests so as to validate result acquisition across chunk boundaries as you mentioned
For the opt-in behaviour for get_as_pl as you mentioned (where chunk_size=None, we would still perform adaptive logic in the get_as_arrow method, correct?
I'm not sure what you meant by the last comment (the "new default") not being a breaking change. Do you mean breaking change w.r.t the old approach that you committed, or w.r.t. this new way where we do the adaptive logic inside get_as_arrow?

In any case, I think we're converging towards an agreeable and better solution, so thanks again :)

mewim · 2024-03-21T13:04:17Z

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

prrao87 · 2024-03-21T13:23:29Z

@mewim I reworked it according to your latest comment. What do you think?

tools/python_api/src_py/query_result.py

alexander-beedie · 2024-03-21T13:30:02Z

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

I'd reserve chunk_size = 0/-1 for a new mode that guarantees only a single chunk (aka: no chunks?) is produced. This allows None to be adaptive, an integer to continue doing what it currently does, and if/when the functionality is added lower down to produce an unchunked Arrow result we could enable 0/-1 for that mode 🤔

I'm not sure what you meant by the last comment (the "new default") not being a breaking change.

Just that the way the PR was initially written changed the meaning of chunk_size so that everyone currently using it would start getting back results that were not chunked the way that they expected (which would be a breaking change). If the new default was None then all current usage would continue to behave identically (so the change in default would not be breaking); only new usage that omits an integer chunk_size would get the new behaviour. Allows for a seamless/non-breaking transition.

prrao87 · 2024-03-21T13:31:05Z

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records? The cases where it's None or a positive integer make sense.

Could you maybe post a snippet of how you'd use it here?

alexander-beedie · 2024-03-21T13:34:31Z

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records?

It doesn't/can't at the moment as the current low-level code will always produce chunked results (the adaptive mode improves the situation, but isn't a guarantee). The idea would be to always produce a single chunk that represents the complete result set.

We'd use such a mode in Polars, as otherwise we typically rechunk Arrow data that arrives with n_chunks > 1 (it's more optimal for us to operate on contiguous data).

prrao87 · 2024-03-21T13:43:39Z

Hmm, so it's a polars-specific setting where we'd document that setting chunk_size as -1 is an option. But as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

prrao87 · 2024-03-21T14:12:09Z

I think c26fc74 addresses your comments @alexander-beedie. We can use the get_num_tuples() method to return the entire results as a single chunk when the user specifies 0 or -1.

This will have to be documented carefully, though!

alexander-beedie · 2024-03-21T14:30:20Z

as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

It's a no-op unless n_chunks > 1, so no need to set explicitly; can leave as-is.

tools/python_api/test/test_arrow.py

mewim · 2024-03-21T16:58:38Z

@prrao87 Note that when merging to the master, the commits needs to be squashed into one.

prrao87 · 2024-03-21T16:59:35Z

Shall I go ahead and squash-merge?

mewim · 2024-03-21T17:00:40Z

Yeah I think it is ready to merge

alexander-beedie · 2024-03-22T07:10:26Z

We can use the get_num_tuples() method to return the entire results as a single chunk

@prrao87: Somehow I completely missed this method; great solution ✌️😎

Fix #2998: Arrow chunk_size as keyword argument

6728497

prrao87 requested a review from mewim March 19, 2024 13:53

Adaptive chunk size logic for get_as_arrow

e917aa5

prrao87 added 5 commits March 19, 2024 10:51

Run formatter

102df9d

Fix missing kwarg

d044d48

Fix chunk sizes for arrow tests

920aba1

Provide polars users the means to customize chunk_size

faa38f9

test small chunk_size for polars and arrow

9108926

prrao87 added 3 commits March 21, 2024 09:09

Merge branch 'master' into arrow-chunksize

7bc0f50

Revert to small test chunk sizes

82f75f5

Rework chunk_size defaults and conditions

993f597

mewim reviewed Mar 21, 2024

View reviewed changes

tools/python_api/src_py/query_result.py Outdated Show resolved Hide resolved

mewim reviewed Mar 21, 2024

View reviewed changes

tools/python_api/src_py/query_result.py Outdated Show resolved Hide resolved

Fix conditional logic

c26fc74

prrao87 requested a review from mewim March 21, 2024 15:29

mewim reviewed Mar 21, 2024

View reviewed changes

tools/python_api/test/test_arrow.py Show resolved Hide resolved

Cover 0, -1, None and fixed int params for chunk_size in tests

638763b

mewim approved these changes Mar 21, 2024

View reviewed changes

tools/python_api/test/test_arrow.py Show resolved Hide resolved

prrao87 merged commit 05359c7 into master Mar 21, 2024
17 checks passed

prrao87 deleted the arrow-chunksize branch March 21, 2024 17:01

alexander-beedie mentioned this pull request Mar 22, 2024

QueryResult get_as_pl should always return a single chunk #3110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow chunk_size as keyword argument #3084

Arrow chunk_size as keyword argument #3084

prrao87 commented Mar 19, 2024 •

edited

Loading

prrao87 commented Mar 19, 2024 •

edited

Loading

codecov bot commented Mar 19, 2024 •

edited

Loading

prrao87 commented Mar 19, 2024

alexander-beedie commented Mar 21, 2024 •

edited

Loading

prrao87 commented Mar 21, 2024

mewim commented Mar 21, 2024

prrao87 commented Mar 21, 2024

alexander-beedie commented Mar 21, 2024 •

edited

Loading

prrao87 commented Mar 21, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading

prrao87 commented Mar 21, 2024

prrao87 commented Mar 21, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading

mewim commented Mar 21, 2024

prrao87 commented Mar 21, 2024

mewim commented Mar 21, 2024

alexander-beedie commented Mar 22, 2024

Arrow chunk_size as keyword argument #3084

Arrow chunk_size as keyword argument #3084

Conversation

prrao87 commented Mar 19, 2024 • edited Loading

prrao87 commented Mar 19, 2024 • edited Loading

codecov bot commented Mar 19, 2024 • edited Loading

Codecov Report

prrao87 commented Mar 19, 2024

alexander-beedie commented Mar 21, 2024 • edited Loading

prrao87 commented Mar 21, 2024

mewim commented Mar 21, 2024

prrao87 commented Mar 21, 2024

alexander-beedie commented Mar 21, 2024 • edited Loading

prrao87 commented Mar 21, 2024 • edited Loading

alexander-beedie commented Mar 21, 2024 • edited Loading

prrao87 commented Mar 21, 2024

prrao87 commented Mar 21, 2024 • edited Loading

alexander-beedie commented Mar 21, 2024 • edited Loading

mewim commented Mar 21, 2024

prrao87 commented Mar 21, 2024

mewim commented Mar 21, 2024

alexander-beedie commented Mar 22, 2024

prrao87 commented Mar 19, 2024 •

edited

Loading

prrao87 commented Mar 19, 2024 •

edited

Loading

codecov bot commented Mar 19, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading

prrao87 commented Mar 21, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading

prrao87 commented Mar 21, 2024 •

edited

Loading

alexander-beedie commented Mar 21, 2024 •

edited

Loading