Skip to content

Speed-up Parquet data generation#10

Open
wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
wolfgang-desalvador:wdesalvador/improve-parquet-data-generation
Open

Speed-up Parquet data generation#10
wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
wolfgang-desalvador:wdesalvador/improve-parquet-data-generation

Conversation

@wolfgang-desalvador
Copy link
Copy Markdown

This pull request optimizes the data generation process in parquet_generator.py by reducing redundant function calls and improving batch processing efficiency. The main change is to pre-generate all column data for the entire file before batching, which reduces overhead and leverages zero-copy slicing for batch creation.

Performance optimizations:

  • All column data for the entire file is now generated upfront using either _generate_batch_columns or _generate_legacy_batch, reducing the number of function calls from (num_batches * num_columns) to just num_columns.
  • During batch processing, each batch is now created by slicing the pre-generated full table (full_table.slice(...)), which is more efficient and avoids repeated data generation.

@wolfgang-desalvador
Copy link
Copy Markdown
Author

This method needs to be validated @russfellows @wvaske @FileSystemGuy since it would require from the processes the ability to keep in memory the whole 3+ GiB parquet file.
This maybe could reply guidance or a way to use bigger chunks

@russfellows
Copy link
Copy Markdown

This looks like a good change. I will try this out, and see if there are any further optimizations that can be made as well. Thanks Wolfgang

russfellows added a commit to russfellows/dlio_benchmark that referenced this pull request Apr 10, 2026
… into row groups

Based on mlcommons#10 (Wolfgang De Salvador).
Generate all column data in one pass before the batch loop, then use
pa.Table.slice() (zero-copy in Arrow) to produce each row-group batch.

Reduces generation call overhead from (num_batches × num_columns) to
just num_columns calls. For a file with 10 batches and 5 columns this
is a 10× reduction in gen_random_tensor calls.

Improvement over upstream PR: added explicit memory trade-off comment
and clarified the zero-copy slice semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants