Skip to content

Update TPCH Gen To Split Files#63

Merged
mwc360 merged 2 commits intomainfrom
tpcgen_split_files
Dec 10, 2025
Merged

Update TPCH Gen To Split Files#63
mwc360 merged 2 commits intomainfrom
tpcgen_split_files

Conversation

@mwc360
Copy link
Contributor

@mwc360 mwc360 commented Dec 10, 2025

Don't output a single file per table (default in TPCHGenCLI). Calculates the optimal number of parts per table to target realistic file sizes found in both Databricks and Fabric. Small tables end up having smaller files, larger tables benefit from larger files to minimize file metadata overhead.

Add multithreading to improve perf of generating datasets.

@mwc360 mwc360 merged commit 7cbedc9 into main Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant