Speed-up Parquet data generation by wolfgang-desalvador · Pull Request #10 · mlcommons/DLIO_local_changes

wolfgang-desalvador · 2026-04-08T13:18:13Z

This pull request optimizes the data generation process in parquet_generator.py by reducing redundant function calls and improving batch processing efficiency. The main change is to pre-generate all column data for the entire file before batching, which reduces overhead and leverages zero-copy slicing for batch creation.

Performance optimizations:

All column data for the entire file is now generated upfront using either _generate_batch_columns or _generate_legacy_batch, reducing the number of function calls from (num_batches * num_columns) to just num_columns.
During batch processing, each batch is now created by slicing the pre-generated full table (full_table.slice(...)), which is more efficient and avoids repeated data generation.

wolfgang-desalvador · 2026-04-08T13:25:03Z

This method needs to be validated @russfellows @wvaske @FileSystemGuy since it would require from the processes the ability to keep in memory the whole 3+ GiB parquet file.
This maybe could reply guidance or a way to use bigger chunks

russfellows · 2026-04-09T16:56:50Z

This looks like a good change. I will try this out, and see if there are any further optimizations that can be made as well. Thanks Wolfgang

… into row groups Based on mlcommons#10 (Wolfgang De Salvador). Generate all column data in one pass before the batch loop, then use pa.Table.slice() (zero-copy in Arrow) to produce each row-group batch. Reduces generation call overhead from (num_batches × num_columns) to just num_columns calls. For a file with 10 batches and 5 columns this is a 10× reduction in gen_random_tensor calls. Improvement over upstream PR: added explicit memory trade-off comment and clarified the zero-copy slice semantics.

Calculate parquet random tensor per column rathern than per batch

f1a29af

wolfgang-desalvador requested a review from a team April 8, 2026 13:18

wolfgang-desalvador mentioned this pull request Apr 8, 2026

Generating 10 files with 10 process for dlrm model takes more than 1 hour with 1% I/O time mlcommons/storage#324

Open

FileSystemGuy approved these changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up Parquet data generation#10

Speed-up Parquet data generation#10
wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
wolfgang-desalvador:wdesalvador/improve-parquet-data-generation

wolfgang-desalvador commented Apr 8, 2026

Uh oh!

wolfgang-desalvador commented Apr 8, 2026

Uh oh!

russfellows commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wolfgang-desalvador commented Apr 8, 2026

Uh oh!

wolfgang-desalvador commented Apr 8, 2026

Uh oh!

russfellows commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants