Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow chunk_size as keyword argument #3084

Merged
merged 12 commits into from
Mar 21, 2024
Merged

Arrow chunk_size as keyword argument #3084

merged 12 commits into from
Mar 21, 2024

Conversation

prrao87
Copy link
Member

@prrao87 prrao87 commented Mar 19, 2024

Closes #2998.

As per @ray6080's comment, setting the arrow record batch size (which we call chunk_size) to 1M is a reasonable default because DuckDB does the same. In the majority of cases, a larger record batch size is favourable, and the user can always bring down the chunk_size if necessary.

I also fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

@prrao87 prrao87 requested a review from mewim March 19, 2024 13:53
@prrao87
Copy link
Member Author

prrao87 commented Mar 19, 2024

@mewim what do you think about moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method? That way Polars could just benefit from the arrow logic directly and not have to specify it independently.

Copy link

codecov bot commented Mar 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.71%. Comparing base (f8fe205) to head (638763b).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3084   +/-   ##
=======================================
  Coverage   92.71%   92.71%           
=======================================
  Files        1162     1162           
  Lines       43140    43140           
=======================================
+ Hits        39997    39999    +2     
+ Misses       3143     3141    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@prrao87
Copy link
Member Author

prrao87 commented Mar 19, 2024

moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method

I just made those changes and built & tested locally, works well 👌🏽.

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 21, 2024

fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

FYI: I believe they were specified this way in order to validate result acquisition across chunk boundaries.

Also the meaning of chunk_size changed here - the docs say it refers to the number of rows, but by moving the logic from inside the Polars function into the Arrow one and exposing it on the existing parameter, the actual row count is now only going to match chunk_size if you have a single column 😅

As per #2998 (comment), you could instead allow opt-in to the adaptive behaviour by supporting a chunk_size=None option, otherwise maintaining the existing behaviour (where chunk_size=n_rows).

This would allow for adaptive behaviour without changing the meaning of integer chunk_size (so not a breaking change). Indeed, this could even be the new default - as the parameter currently has to be specified, you could change the default to chunk_size=None (eg: adaptive by default) without any existing code being impacted 👍

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

Hi @alexander-beedie, great points.

  1. It makes sense to revert to the small chunk size numbers in the tests so as to validate result acquisition across chunk boundaries as you mentioned
  2. For the opt-in behaviour for get_as_pl as you mentioned (where chunk_size=None, we would still perform adaptive logic in the get_as_arrow method, correct?
  3. I'm not sure what you meant by the last comment (the "new default") not being a breaking change. Do you mean breaking change w.r.t the old approach that you committed, or w.r.t. this new way where we do the adaptive logic inside get_as_arrow?

In any case, I think we're converging towards an agreeable and better solution, so thanks again :)

@mewim
Copy link
Member

mewim commented Mar 21, 2024

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

@mewim I reworked it according to your latest comment. What do you think?

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 21, 2024

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

I'd reserve chunk_size = 0/-1 for a new mode that guarantees only a single chunk (aka: no chunks?) is produced. This allows None to be adaptive, an integer to continue doing what it currently does, and if/when the functionality is added lower down to produce an unchunked Arrow result we could enable 0/-1 for that mode 🤔

I'm not sure what you meant by the last comment (the "new default") not being a breaking change.

Just that the way the PR was initially written changed the meaning of chunk_size so that everyone currently using it would start getting back results that were not chunked the way that they expected (which would be a breaking change). If the new default was None then all current usage would continue to behave identically (so the change in default would not be breaking); only new usage that omits an integer chunk_size would get the new behaviour. Allows for a seamless/non-breaking transition.

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records? The cases where it's None or a positive integer make sense.

Could you maybe post a snippet of how you'd use it here?

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 21, 2024

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records?

It doesn't/can't at the moment as the current low-level code will always produce chunked results (the adaptive mode improves the situation, but isn't a guarantee). The idea would be to always produce a single chunk that represents the complete result set.

We'd use such a mode in Polars, as otherwise we typically rechunk Arrow data that arrives with n_chunks > 1 (it's more optimal for us to operate on contiguous data).

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

Hmm, so it's a polars-specific setting where we'd document that setting chunk_size as -1 is an option. But as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

I think c26fc74 addresses your comments @alexander-beedie. We can use the get_num_tuples() method to return the entire results as a single chunk when the user specifies 0 or -1.

This will have to be documented carefully, though!

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 21, 2024

as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

It's a no-op unless n_chunks > 1, so no need to set explicitly; can leave as-is.

@prrao87 prrao87 requested a review from mewim March 21, 2024 15:29
tools/python_api/test/test_arrow.py Show resolved Hide resolved
@mewim
Copy link
Member

mewim commented Mar 21, 2024

@prrao87 Note that when merging to the master, the commits needs to be squashed into one.

@prrao87
Copy link
Member Author

prrao87 commented Mar 21, 2024

Shall I go ahead and squash-merge?

@mewim
Copy link
Member

mewim commented Mar 21, 2024

Yeah I think it is ready to merge

@prrao87 prrao87 merged commit 05359c7 into master Mar 21, 2024
17 checks passed
@prrao87 prrao87 deleted the arrow-chunksize branch March 21, 2024 17:01
@alexander-beedie
Copy link
Contributor

We can use the get_num_tuples() method to return the entire results as a single chunk

@prrao87: Somehow I completely missed this method; great solution ✌️😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make chunk_size in get_as_arrow an optional keyword argument
3 participants