Is ZSTD compression level 22 used for producing the Parquet files? #531
Replies: 4 comments
-
|
For some historical reasons, that code isn't public but is something we're looking to fix. So it's using Spark's defaults for zstd... from a quick search the zstd compression level used to write out the parquet might not be configurable in Spark. |
Beta Was this translation helpful? Give feedback.
-
|
February's release is 474 GB versus January's 471 GB. I can only see another ~5 GB of US addresses added to February's release so I suspect everything is being compressed at level 3 and not 22. Are there plans for level 22 compression in March's release? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the feedback, Mark Litwintschik (@marklit). My original response is still true. We're just using Spark/Sedona to write the dataset out which (as far as I can tell) does not allow us to adjust the zstd compression level so we're stuck with the default for now. If I'm wrong about that let me know. I do think we may want to evaluate the additional compute/time cost of using a higher compression level before just going with the highest. |
Beta Was this translation helpful? Give feedback.
-
|
Understood. The finished Parquet could be run through DuckDB to re-compress at level 22. I'm not sure what sort of CPUs you're running everything on but on my 14900K I can barely see the wall clock time difference between level 3 and 22. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm not sure if there is a public repo of the workflows used to produce the final parquet files. Are you using level 22 compression on the final Parquet files that are published to S3?
Level 3 is the default in the ZSTD C lib most tools use but overriding it to level 22 can turn 88 MB of data into 70 MB.
Beta Was this translation helpful? Give feedback.
All reactions