Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected batch size of 32,767 instead of 10,000,000 #33

Closed
ddresslerlegalplans opened this issue Jan 13, 2023 · 4 comments
Closed

Unexpected batch size of 32,767 instead of 10,000,000 #33

ddresslerlegalplans opened this issue Jan 13, 2023 · 4 comments

Comments

@ddresslerlegalplans
Copy link

ddresslerlegalplans commented Jan 13, 2023

I'm running

    reader = read_arrow_batches_from_odbc(
        query=QUERY,
        connection_string=CONNECTION_STRING,
        batch_size=10000000,
    )

When I enumerate over the BatchReader

    for i, batch in enumerate(reader):
        print(f"Batch Size {batch_size}")
        print(f"In loop reading arrow batches, i:{i}")
        # print(f"batch schema:{batch.schema}")
        print(f"NUM ROWS: {batch.num_rows}")#32767
        print(f"num_columns: {batch.num_columns}")#9
        # print(f"columns: {batch.columns}")# a list of columns and their data
        print(f"get_total_buffer_size: {batch.get_total_buffer_size()}")#4770907
        print(f"nbytes: {batch.nbytes}")#4770891
        # print(f"schema: {batch.schema}")

I get the following output:

Batch Size 10000000
In loop reading arrow batches, i:0
NUM ROWS: 32767
num_columns: 9
get_total_buffer_size: 4770907
nbytes: 4770891

I would have expected the num_rows to be 10Million

if we do a smaller batch say 10 it works:

Batch Size 10
In loop reading arrow batches, i:4
NUM ROWS: 10
num_columns: 9
get_total_buffer_size: 1468

If we do the edge case batchsize+1 its missing 1 expected row

Batch Size 32768
In loop reading arrow batches, i:2
NUM ROWS: 32767
num_columns: 9
get_total_buffer_size: 4749999
nbytes: 4749983

I'm curious how I can debug this and figure out whats causing this limit? Does this work with large batches or do you recommend another tool for large batches? Thank you for your time and consideration.

I've also tried changing the query to be just 1 column and I still get the same result

    QUERY = f"SELECT 'A' as the_letter_A FROM tbl"

I'm also wondering if 1.456GB of RAM usage will be a problem or not as thats what I calculated if the reader was actually reading 10M records

@pacman82
Copy link
Owner

Hello @ddresslerlegalplans ,

thanks for opening this issue. I updated the parameter description of batch size to be (hopefully) more helpful:

The maximum number rows within each batch. Please note that the actual batch
size is up to the ODBC driver of your database. This parameter influences primarily the size
of the buffers the ODBC driver is supposed to fill with data, yet it is up to the driver how
many values it fills in one go. Also note that the primary use-case of batching is to reduce
IO overhead. So even if you fetch millions of rows a batch size of 100 or 1000 may be
entirely reasonable. This is trading memory for speed, but with diminishing returns.

Without knowing what database and driver you use I think I have an hypothesis, there the limit of 32767 comes from. 32767 is 2^(16-1) - 1. Which is the highest possible number you can represent with a signed 16Bit integer. So my guess is that your driver uses a 16Bit integer to represent batch size. Actually it behaves quite nicely, I have seen drivers just overflowing and wreaking havoc if the batch size becomes to large. Independent of driver implementation in the ODBC standard as I understand it the driver has the final say about batch size, so you can not rely on it being a certain length.

I'm also wondering if 1.456GB of RAM usage will be a problem or not as thats what I calculated if the reader was actually reading 10M records

That's the secondary use-case of batching. So you do not have to store all at once. The primary idea is not to fetch each row individually to save Network IO. I am guessing here, but it seems you want to produce one single large arrow array to hold all the data. You still should concatenate it after fetching. Especially string data is usually much smaller once it is in the arrow array. The buffer ODBC uses for transfer need always to account for the largest possible values, not only for the actual values in your database. So even in that use case fetching with a batch size of just 100 or 1000 is more reasonable than fetching with a batch size of several million.

Cheers, Markus

@pacman82
Copy link
Owner

I am closing this issue, as there is nothing actionable. Apart from improving the documentation, which has already happened.

Cheers, Markus

@ddresslerlegalplans
Copy link
Author

Thanks Markus! After changing the Batch Size to 32767 it appears to be running correctly. I appreciate the thorough explanation. Have a nice day.

Cheers!

@pacman82
Copy link
Owner

Hi @ddresslerlegalplans ,

happy to hear it works out now for you. This makes me think though. Maybe I should add a helper function which does the batching and concatenation into a single arrow array itself?

Thanks for reporting back.

Cheers, Markus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants