-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected batch size of 32,767 instead of 10,000,000 #33
Comments
Hello @ddresslerlegalplans , thanks for opening this issue. I updated the parameter description of batch size to be (hopefully) more helpful:
Without knowing what database and driver you use I think I have an hypothesis, there the limit of 32767 comes from. 32767 is
That's the secondary use-case of batching. So you do not have to store all at once. The primary idea is not to fetch each row individually to save Network IO. I am guessing here, but it seems you want to produce one single large arrow array to hold all the data. You still should concatenate it after fetching. Especially string data is usually much smaller once it is in the arrow array. The buffer ODBC uses for transfer need always to account for the largest possible values, not only for the actual values in your database. So even in that use case fetching with a batch size of just 100 or 1000 is more reasonable than fetching with a batch size of several million. Cheers, Markus |
I am closing this issue, as there is nothing actionable. Apart from improving the documentation, which has already happened. Cheers, Markus |
Thanks Markus! After changing the Batch Size to 32767 it appears to be running correctly. I appreciate the thorough explanation. Have a nice day. Cheers! |
Hi @ddresslerlegalplans , happy to hear it works out now for you. This makes me think though. Maybe I should add a helper function which does the batching and concatenation into a single arrow array itself? Thanks for reporting back. Cheers, Markus |
I'm running
When I enumerate over the BatchReader
I get the following output:
I would have expected the num_rows to be 10Million
if we do a smaller batch say 10 it works:
If we do the edge case batchsize+1 its missing 1 expected row
I'm curious how I can debug this and figure out whats causing this limit? Does this work with large batches or do you recommend another tool for large batches? Thank you for your time and consideration.
I've also tried changing the query to be just 1 column and I still get the same result
I'm also wondering if 1.456GB of RAM usage will be a problem or not as thats what I calculated if the reader was actually reading 10M records
The text was updated successfully, but these errors were encountered: