Generated Parquet files are extremely fragmented #123

jaychia · 2024-05-03T06:07:22Z

Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.

For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!

Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file 😅

Writing this instead with PyArrow (I amended the code in prepare_data.py), we get much more well-behaved rowgroups:

Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!

The text was updated successfully, but these errors were encountered:

jaychia · 2024-05-03T06:10:58Z

I also did some very rough benchmarks before/after making the rowgroups nicer:

DuckDB:

Before
Code block 'Run duckdb query 1' took: 5.84075 s
Code block 'Run duckdb query 2' took: 0.54433 s
Code block 'Run duckdb query 3' took: 3.55683 s
Code block 'Run duckdb query 4' took: 3.08790 s
Code block 'Run duckdb query 5' took: 3.28742 s
Code block 'Run duckdb query 6' took: 3.09252 s
Code block 'Run duckdb query 7' took: 4.42866 s
Code block 'Run duckdb query 8' took: 3.93587 s
Code block 'Run duckdb query 9' took: 4.89768 s
Code block 'Run duckdb query 10' took: 3.53358 s

After:
Code block 'Run duckdb query 1' took: 1.30279 s
Code block 'Run duckdb query 2' took: 0.33867 s
Code block 'Run duckdb query 3' took: 0.88817 s
Code block 'Run duckdb query 4' took: 0.67681 s
Code block 'Run duckdb query 5' took: 0.94662 s
Code block 'Run duckdb query 6' took: 0.65088 s
Code block 'Run duckdb query 7' took: 1.56854 s
Code block 'Run duckdb query 8' took: 1.16898 s
Code block 'Run duckdb query 9' took: 1.49360 s
Code block 'Run duckdb query 10' took: 1.45454 s

Polars:

Before:
Code block 'Run polars query 1' took: 2.72320 s
Code block 'Run polars query 2' took: 0.17861 s
Code block 'Run polars query 3' took: 1.20695 s
Code block 'Run polars query 4' took: 1.17034 s
Code block 'Run polars query 5' took: 1.28489 s
Code block 'Run polars query 6' took: 0.91699 s
Code block 'Run polars query 7' took: 1.69375 s
Code block 'Run polars query 8' took: 1.64263 s
Code block 'Run polars query 9' took: 8.58342 s
Code block 'Run polars query 10' took: 1.66261 s

After:
Code block 'Run polars query 1' took: 1.51288 s
Code block 'Run polars query 2' took: 0.14328 s
Code block 'Run polars query 3' took: 0.67987 s
Code block 'Run polars query 4' took: 0.52064 s
Code block 'Run polars query 5' took: 0.69359 s
Code block 'Run polars query 6' took: 0.30325 s
Code block 'Run polars query 7' took: 0.72357 s
Code block 'Run polars query 8' took: 0.93876 s
Code block 'Run polars query 9' took: 4.14785 s
Code block 'Run polars query 10' took: 0.88245 s

Pretty significant!

ritchie46 · 2024-05-03T06:26:34Z

Right.. That is something we should take up upstream in Polars.

ritchie46 · 2024-05-03T14:32:59Z

I think this sink_parquet is the culprit. I expect a collect().write_parquet() to produce much cleaner row-groups. sink_parquet immediately writes morsel results.

jaychia · 2024-05-03T17:05:23Z

I tried doing .collect().write_parquet() on a SCALE_FACTOR=0.2 dataset - it seems to be better but the rowgroups are still fairly fragmented (about 4MB compressed, 10MB uncompressed) but also noticing some really tiny ~100KB ones still.

Ideally Parquet rowgroups should be much larger (the format spec recommends 512MB to 1GB, but in practice I've seen more like 128MB or so).

Any thoughts on if I should create an issue in Polars itself to follow-up there?

ritchie46 · 2024-05-04T06:33:09Z

Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes.

Currently, our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size. Maybe we could, but I do think we should favor relatively smaller groups. (But I agree, not this small)

jaychia · 2024-05-05T20:09:31Z

Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes.

Currently, our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size. Maybe we could, but I do think we should favor relatively smaller groups. (But I agree, not this small)

Yup, I agree that 512MB as a blanket statement is probably not the best idea. Those were suggestions based on HDFS I believe, but for modern workloads we should probably try to optimize for something like AWS S3/object storage.

our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size... I do think we should favor relatively smaller groups

Do you have an idea for how small you'd prefer here? For AWS S3, Typical sizes for byte-range requests are 8 MB or 16 MB which could be a good idea to build around. Perhaps a good guideline should be to try for Column chunks of about 16MB (~2M rows for int64 columns), so rowgroup sizes would just be 16MB * N_COLS. This is close to pyarrow's defaults which is about 1M rows per rowgroup.

WDYT?

ritchie46 · 2024-05-06T08:15:15Z

I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small splits (a few rows) to be written. Will fix it.

jaychia · 2024-05-06T19:59:58Z

I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small splits (a few rows) to be written. Will fix it.

That sounds great and will be a big improvement vs the current behavior of just a few thousand rows! Thanks for the quick responses 👏 👏 👏

Any thoughts on why 512^2 vs something like PyArrow's 1024^2? I'm curious if you've done any benchmarking demonstrating performance benefits of having smaller rowgroups, either with the Polars reader or otherwise (or on local disk vs S3).

jaychia mentioned this issue May 3, 2024

Use pyarrow to write parquet files #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated Parquet files are extremely fragmented #123

Generated Parquet files are extremely fragmented #123

jaychia commented May 3, 2024

jaychia commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 3, 2024

jaychia commented May 3, 2024

ritchie46 commented May 4, 2024 •

edited

jaychia commented May 5, 2024

ritchie46 commented May 6, 2024

jaychia commented May 6, 2024

Generated Parquet files are extremely fragmented #123

Generated Parquet files are extremely fragmented #123

Comments

jaychia commented May 3, 2024

jaychia commented May 3, 2024

ritchie46 commented May 3, 2024

ritchie46 commented May 3, 2024

jaychia commented May 3, 2024

ritchie46 commented May 4, 2024 • edited

jaychia commented May 5, 2024

ritchie46 commented May 6, 2024

jaychia commented May 6, 2024

ritchie46 commented May 4, 2024 •

edited