Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Parquet files are extremely fragmented #123

Open
jaychia opened this issue May 3, 2024 · 8 comments
Open

Generated Parquet files are extremely fragmented #123

jaychia opened this issue May 3, 2024 · 8 comments

Comments

@jaychia
Copy link

jaychia commented May 3, 2024

Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.

For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!
image

Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file 😅

Writing this instead with PyArrow (I amended the code in prepare_data.py), we get much more well-behaved rowgroups:

image

Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!

@jaychia
Copy link
Author

jaychia commented May 3, 2024

I also did some very rough benchmarks before/after making the rowgroups nicer:

DuckDB:

Before
Code block 'Run duckdb query 1' took: 5.84075 s
Code block 'Run duckdb query 2' took: 0.54433 s
Code block 'Run duckdb query 3' took: 3.55683 s
Code block 'Run duckdb query 4' took: 3.08790 s
Code block 'Run duckdb query 5' took: 3.28742 s
Code block 'Run duckdb query 6' took: 3.09252 s
Code block 'Run duckdb query 7' took: 4.42866 s
Code block 'Run duckdb query 8' took: 3.93587 s
Code block 'Run duckdb query 9' took: 4.89768 s
Code block 'Run duckdb query 10' took: 3.53358 s

After:
Code block 'Run duckdb query 1' took: 1.30279 s
Code block 'Run duckdb query 2' took: 0.33867 s
Code block 'Run duckdb query 3' took: 0.88817 s
Code block 'Run duckdb query 4' took: 0.67681 s
Code block 'Run duckdb query 5' took: 0.94662 s
Code block 'Run duckdb query 6' took: 0.65088 s
Code block 'Run duckdb query 7' took: 1.56854 s
Code block 'Run duckdb query 8' took: 1.16898 s
Code block 'Run duckdb query 9' took: 1.49360 s
Code block 'Run duckdb query 10' took: 1.45454 s

Polars:

Before:
Code block 'Run polars query 1' took: 2.72320 s
Code block 'Run polars query 2' took: 0.17861 s
Code block 'Run polars query 3' took: 1.20695 s
Code block 'Run polars query 4' took: 1.17034 s
Code block 'Run polars query 5' took: 1.28489 s
Code block 'Run polars query 6' took: 0.91699 s
Code block 'Run polars query 7' took: 1.69375 s
Code block 'Run polars query 8' took: 1.64263 s
Code block 'Run polars query 9' took: 8.58342 s
Code block 'Run polars query 10' took: 1.66261 s

After:
Code block 'Run polars query 1' took: 1.51288 s
Code block 'Run polars query 2' took: 0.14328 s
Code block 'Run polars query 3' took: 0.67987 s
Code block 'Run polars query 4' took: 0.52064 s
Code block 'Run polars query 5' took: 0.69359 s
Code block 'Run polars query 6' took: 0.30325 s
Code block 'Run polars query 7' took: 0.72357 s
Code block 'Run polars query 8' took: 0.93876 s
Code block 'Run polars query 9' took: 4.14785 s
Code block 'Run polars query 10' took: 0.88245 s

Pretty significant!

@ritchie46
Copy link
Member

Right.. That is something we should take up upstream in Polars.

@ritchie46
Copy link
Member

I think this sink_parquet is the culprit. I expect a collect().write_parquet() to produce much cleaner row-groups. sink_parquet immediately writes morsel results.

@jaychia
Copy link
Author

jaychia commented May 3, 2024

I tried doing .collect().write_parquet() on a SCALE_FACTOR=0.2 dataset - it seems to be better but the rowgroups are still fairly fragmented (about 4MB compressed, 10MB uncompressed) but also noticing some really tiny ~100KB ones still.

Ideally Parquet rowgroups should be much larger (the format spec recommends 512MB to 1GB, but in practice I've seen more like 128MB or so).

Any thoughts on if I should create an issue in Polars itself to follow-up there?

@ritchie46
Copy link
Member

ritchie46 commented May 4, 2024

Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes.

Currently, our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size. Maybe we could, but I do think we should favor relatively smaller groups. (But I agree, not this small)

@jaychia
Copy link
Author

jaychia commented May 5, 2024

Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes.

Currently, our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size. Maybe we could, but I do think we should favor relatively smaller groups. (But I agree, not this small)

Yup, I agree that 512MB as a blanket statement is probably not the best idea. Those were suggestions based on HDFS I believe, but for modern workloads we should probably try to optimize for something like AWS S3/object storage.

our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size... I do think we should favor relatively smaller groups

Do you have an idea for how small you'd prefer here? For AWS S3, Typical sizes for byte-range requests are 8 MB or 16 MB which could be a good idea to build around. Perhaps a good guideline should be to try for Column chunks of about 16MB (~2M rows for int64 columns), so rowgroup sizes would just be 16MB * N_COLS. This is close to pyarrow's defaults which is about 1M rows per rowgroup.

WDYT?

@ritchie46
Copy link
Member

I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small splits (a few rows) to be written. Will fix it.

@jaychia
Copy link
Author

jaychia commented May 6, 2024

I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small splits (a few rows) to be written. Will fix it.

That sounds great and will be a big improvement vs the current behavior of just a few thousand rows! Thanks for the quick responses 👏 👏 👏

Any thoughts on why 512^2 vs something like PyArrow's 1024^2? I'm curious if you've done any benchmarking demonstrating performance benefits of having smaller rowgroups, either with the Polars reader or otherwise (or on local disk vs S3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants