Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - File Size Limit for Partitioned Parquet #190

Closed
lukeawyatt opened this issue May 10, 2022 · 5 comments
Closed

Feature Request - File Size Limit for Partitioned Parquet #190

lukeawyatt opened this issue May 10, 2022 · 5 comments

Comments

@lukeawyatt
Copy link

I've noticed that when producing partitioned files from large datasets, the file output sizes are inconsistent when specifying batch_size_mib and batches_per_file. It'd be a very helpful feature if a new parameter can be added so the file size can be capped while still allowing the batches to self-configure as they do.

I'd imagine it'd look something like file_size_limit_mib

Thanks for all that you do! Let me know your thoughts or if I missed something.

@pacman82
Copy link
Owner

Hello @lukeawyatt ,

thanks for your feedback! You are right. When I started writing this tool, I was concerned with the memory usage on the machine executing this code. batch_size_mib does refer to a buffer size for data in transit. It is not a way to specify the size of the batch within the file. Currently each batch becomes one row group, which is likely to be a lot smaller in terms of memory consumption, due to compression, and not every string or binary value, maxing out the supported length. The factory of how much smaller a batch depends a lot on the shape of the data.

You likely know this already, but I think I write it down here, as your issue made me realize the --help does a bad job explaining this.

the file output sizes are inconsistent when specifying batch_size_mib and batches_per_file

I would be interessted to learn what kind of inconsistency you are most concerned with. The inconstistency between different datasets resulting from different queries? Or is it the inconsistency of the size of a row group within each dataset?

I'd imagine it'd look something like file_size_limit_mib

I agree, this feels like a sensible feature. I currently have little idea of how hard or easy it will be to implement. The biggest source of uncertainty is the capabilities of the upstream parquet library in that regard. If they allow writing to any io::write, or even support a file size limit directly, I would say this is likely to happen.

In both cases I would expect the limit to be fuzzy though. I could also imagine that to exceed the file limit, by maybe a length of a footer or something. Would this be fine for your usecase?

Also, if possible, I would like to learn more about the way you are using odbc2parquet and why consistent file sizes matter to you.

Thanks for all that you do!

Well thank you. Despite the fact that you are just an anomynous person on the internet to me, this still means a lot to me.

Cheers, Markus

@pacman82
Copy link
Owner

Hello @lukeawyatt ,

As of now, I feel I could only implement this using hacky workarounds. Let's see if the upstream issue gains some traction.

Cheers, Markus

@lukeawyatt
Copy link
Author

Hey @pacman82

Thanks for being so prompt with this. I've subscribed and "thumbed up" the arrow-rs issue to aid traction. Regarding my use case and the inconsistencies I'm referencing, within my datasets, it seems some batches hold significantly more data than others. This is likely due to select varchar columns. I'd like to agnostically pass in a query and have the chunked output be fairly consistent. Having more predictability will help prevent file cap issues when the files are in transit. My end goal can be seen in the scenarios below:

 

Example 1: Default parameters

odbc2parquet query --connection-string '{CS}' /tmp/out.par " {QUERY}
OUTPUT
out.par - 5.6 gb

 

Example 2: Current Usage, Batch configuration to enable partitioning

odbc2parquet query --batches_per_file 4000 --batch_size_mib 100 --connection-string '{CS}' /tmp/out.par " {QUERY}
OUTPUT
out_1.par - 745 mb
out_2.par - 1205 mb
out_3.par - 432 mb
out_4.par - 1461 mb
out_5.par - 894 mb
out_6.par - 863 mb

 

Example 3: Desired Output, Configure only a partitioned file limit

odbc2parquet query --file_size_limit_mib 976 --connection-string '{CS}' /tmp/out.par " {QUERY}
OUTPUT
out_1.par - 1022 mb
out_2.par - 1024 mb
out_3.par - 1019 mb
out_4.par - 1021 mb
out_5.par - 1020 mb
out_6.par - 494 mb

 

Let me know if this helps or if you need further clarification. And thanks again!

~Luke

@pacman82
Copy link
Owner

pacman82 commented Jun 4, 2022

Hello Luke,

odbc2parquet 0.8.0 has been released. It features the --file-size-threshold option for the query subcommand. Please tell me how it works for you.

Cheers, Markus

@lukeawyatt
Copy link
Author

Hi Markus,

This works flawlessly! Thank you for your efforts on this.

~Luke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants