Is ZSTD compression level 22 used for producing the Parquet files? #531

Mark Litwintschik (marklit) · 2024-12-18T10:14:47Z

Mark Litwintschik (marklit)
Dec 18, 2024

I'm not sure if there is a public repo of the workflows used to produce the final parquet files. Are you using level 22 compression on the final Parquet files that are published to S3?

Level 3 is the default in the ZSTD C lib most tools use but overriding it to level 22 can turn 88 MB of data into 70 MB.

Jacob Wasserman (jwass) · 2024-12-18T11:58:35Z

Jacob Wasserman (jwass)
Dec 18, 2024
Collaborator

For some historical reasons, that code isn't public but is something we're looking to fix.
The parquet files are written by Spark with code that looks basically like this:

df.write.format("geoparquet")
        .option("geoparquet.version", "1.1.0")
        .option("geoparquet.crs", "")
        .option("geoparquet.covering.geometry", "bbox")
        .option("compression", "zstd")
        .option("parquet.block.size", 16 * 1024 * 1024)
        .save(path)

So it's using Spark's defaults for zstd... from a quick search the zstd compression level used to write out the parquet might not be configurable in Spark.

0 replies

Mark Litwintschik (marklit) · 2025-02-19T18:26:43Z

Mark Litwintschik (marklit)
Feb 19, 2025
Author

February's release is 474 GB versus January's 471 GB. I can only see another ~5 GB of US addresses added to February's release so I suspect everything is being compressed at level 3 and not 22. Are there plans for level 22 compression in March's release?

0 replies

Jacob Wasserman (jwass) · 2025-02-20T15:23:27Z

Jacob Wasserman (jwass)
Feb 20, 2025
Collaborator

Thanks for the feedback, Mark Litwintschik (@marklit). My original response is still true. We're just using Spark/Sedona to write the dataset out which (as far as I can tell) does not allow us to adjust the zstd compression level so we're stuck with the default for now. If I'm wrong about that let me know.

I do think we may want to evaluate the additional compute/time cost of using a higher compression level before just going with the highest.

0 replies

Mark Litwintschik (marklit) · 2025-02-20T15:32:55Z

Mark Litwintschik (marklit)
Feb 20, 2025
Author

Understood. The finished Parquet could be run through DuckDB to re-compress at level 22. I'm not sure what sort of CPUs you're running everything on but on my 14900K I can barely see the wall clock time difference between level 3 and 22.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overture Maps

Is ZSTD compression level 22 used for producing the Parquet files? #531

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Overture Maps

Is ZSTD compression level 22 used for producing the Parquet files? #531

Uh oh!

Uh oh!

Mark Litwintschik (marklit) Dec 18, 2024

Replies: 4 comments

Uh oh!

Uh oh!

Jacob Wasserman (jwass) Dec 18, 2024 Collaborator

Uh oh!

Mark Litwintschik (marklit) Feb 19, 2025 Author

Uh oh!

Uh oh!

Jacob Wasserman (jwass) Feb 20, 2025 Collaborator

Uh oh!

Uh oh!

Mark Litwintschik (marklit) Feb 20, 2025 Author

Mark Litwintschik (marklit)
Dec 18, 2024

Jacob Wasserman (jwass)
Dec 18, 2024
Collaborator

Mark Litwintschik (marklit)
Feb 19, 2025
Author

Jacob Wasserman (jwass)
Feb 20, 2025
Collaborator

Mark Litwintschik (marklit)
Feb 20, 2025
Author