Unexpected batch size of 32,767 instead of 10,000,000 #33

ddresslerlegalplans · 2023-01-13T19:41:06Z

I'm running

    reader = read_arrow_batches_from_odbc(
        query=QUERY,
        connection_string=CONNECTION_STRING,
        batch_size=10000000,
    )

When I enumerate over the BatchReader

    for i, batch in enumerate(reader):
        print(f"Batch Size {batch_size}")
        print(f"In loop reading arrow batches, i:{i}")
        # print(f"batch schema:{batch.schema}")
        print(f"NUM ROWS: {batch.num_rows}")#32767
        print(f"num_columns: {batch.num_columns}")#9
        # print(f"columns: {batch.columns}")# a list of columns and their data
        print(f"get_total_buffer_size: {batch.get_total_buffer_size()}")#4770907
        print(f"nbytes: {batch.nbytes}")#4770891
        # print(f"schema: {batch.schema}")

I get the following output:

Batch Size 10000000
In loop reading arrow batches, i:0
NUM ROWS: 32767
num_columns: 9
get_total_buffer_size: 4770907
nbytes: 4770891

I would have expected the num_rows to be 10Million

if we do a smaller batch say 10 it works:

Batch Size 10
In loop reading arrow batches, i:4
NUM ROWS: 10
num_columns: 9
get_total_buffer_size: 1468

If we do the edge case batchsize+1 its missing 1 expected row

Batch Size 32768
In loop reading arrow batches, i:2
NUM ROWS: 32767
num_columns: 9
get_total_buffer_size: 4749999
nbytes: 4749983

I'm curious how I can debug this and figure out whats causing this limit? Does this work with large batches or do you recommend another tool for large batches? Thank you for your time and consideration.

I've also tried changing the query to be just 1 column and I still get the same result

    QUERY = f"SELECT 'A' as the_letter_A FROM tbl"

I'm also wondering if 1.456GB of RAM usage will be a problem or not as thats what I calculated if the reader was actually reading 10M records

The text was updated successfully, but these errors were encountered:

pacman82 · 2023-01-14T10:07:19Z

Hello @ddresslerlegalplans ,

thanks for opening this issue. I updated the parameter description of batch size to be (hopefully) more helpful:

The maximum number rows within each batch. Please note that the actual batch
size is up to the ODBC driver of your database. This parameter influences primarily the size
of the buffers the ODBC driver is supposed to fill with data, yet it is up to the driver how
many values it fills in one go. Also note that the primary use-case of batching is to reduce
IO overhead. So even if you fetch millions of rows a batch size of 100 or 1000 may be
entirely reasonable. This is trading memory for speed, but with diminishing returns.

Without knowing what database and driver you use I think I have an hypothesis, there the limit of 32767 comes from. 32767 is 2^(16-1) - 1. Which is the highest possible number you can represent with a signed 16Bit integer. So my guess is that your driver uses a 16Bit integer to represent batch size. Actually it behaves quite nicely, I have seen drivers just overflowing and wreaking havoc if the batch size becomes to large. Independent of driver implementation in the ODBC standard as I understand it the driver has the final say about batch size, so you can not rely on it being a certain length.

I'm also wondering if 1.456GB of RAM usage will be a problem or not as thats what I calculated if the reader was actually reading 10M records

That's the secondary use-case of batching. So you do not have to store all at once. The primary idea is not to fetch each row individually to save Network IO. I am guessing here, but it seems you want to produce one single large arrow array to hold all the data. You still should concatenate it after fetching. Especially string data is usually much smaller once it is in the arrow array. The buffer ODBC uses for transfer need always to account for the largest possible values, not only for the actual values in your database. So even in that use case fetching with a batch size of just 100 or 1000 is more reasonable than fetching with a batch size of several million.

Cheers, Markus

pacman82 · 2023-01-25T09:53:47Z

I am closing this issue, as there is nothing actionable. Apart from improving the documentation, which has already happened.

Cheers, Markus

ddresslerlegalplans · 2023-02-22T19:12:08Z

Thanks Markus! After changing the Batch Size to 32767 it appears to be running correctly. I appreciate the thorough explanation. Have a nice day.

Cheers!

pacman82 · 2023-02-23T07:49:02Z

Hi @ddresslerlegalplans ,

happy to hear it works out now for you. This makes me think though. Maybe I should add a helper function which does the batching and concatenation into a single arrow array itself?

Thanks for reporting back.

Cheers, Markus

pacman82 closed this as completed Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected batch size of 32,767 instead of 10,000,000 #33

Unexpected batch size of 32,767 instead of 10,000,000 #33

ddresslerlegalplans commented Jan 13, 2023 •

edited

Loading

pacman82 commented Jan 14, 2023

pacman82 commented Jan 25, 2023

ddresslerlegalplans commented Feb 22, 2023

pacman82 commented Feb 23, 2023

Unexpected batch size of 32,767 instead of 10,000,000 #33

Unexpected batch size of 32,767 instead of 10,000,000 #33

Comments

ddresslerlegalplans commented Jan 13, 2023 • edited Loading

pacman82 commented Jan 14, 2023

pacman82 commented Jan 25, 2023

ddresslerlegalplans commented Feb 22, 2023

pacman82 commented Feb 23, 2023

ddresslerlegalplans commented Jan 13, 2023 •

edited

Loading