New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request - File Size Limit for Partitioned Parquet #190
Comments
Hello @lukeawyatt , thanks for your feedback! You are right. When I started writing this tool, I was concerned with the memory usage on the machine executing this code. You likely know this already, but I think I write it down here, as your issue made me realize the
I would be interessted to learn what kind of inconsistency you are most concerned with. The inconstistency between different datasets resulting from different queries? Or is it the inconsistency of the size of a row group within each dataset?
I agree, this feels like a sensible feature. I currently have little idea of how hard or easy it will be to implement. The biggest source of uncertainty is the capabilities of the upstream In both cases I would expect the limit to be fuzzy though. I could also imagine that to exceed the file limit, by maybe a length of a footer or something. Would this be fine for your usecase? Also, if possible, I would like to learn more about the way you are using
Well thank you. Despite the fact that you are just an anomynous person on the internet to me, this still means a lot to me. Cheers, Markus |
Hello @lukeawyatt , As of now, I feel I could only implement this using hacky workarounds. Let's see if the upstream issue gains some traction. Cheers, Markus |
Hey @pacman82 Thanks for being so prompt with this. I've subscribed and "thumbed up" the arrow-rs issue to aid traction. Regarding my use case and the inconsistencies I'm referencing, within my datasets, it seems some batches hold significantly more data than others. This is likely due to select varchar columns. I'd like to agnostically pass in a query and have the chunked output be fairly consistent. Having more predictability will help prevent file cap issues when the files are in transit. My end goal can be seen in the scenarios below:
Let me know if this helps or if you need further clarification. And thanks again! ~Luke |
Hello Luke,
Cheers, Markus |
Hi Markus, This works flawlessly! Thank you for your efforts on this. ~Luke |
I've noticed that when producing partitioned files from large datasets, the file output sizes are inconsistent when specifying
batch_size_mib
andbatches_per_file
. It'd be a very helpful feature if a new parameter can be added so the file size can be capped while still allowing the batches to self-configure as they do.I'd imagine it'd look something like
file_size_limit_mib
Thanks for all that you do! Let me know your thoughts or if I missed something.
The text was updated successfully, but these errors were encountered: