[Training v3.0 consolidation] DLRM optimization: evaluate Parquet reader and generation parameters to improve throughput

## Summary
This issue proposes a focused investigation into the Parquet-based data path used by the DLRM workload, with the goal of improving effective read throughput while keeping Parquet as the on-disk format.

## Motivation
DLRM currently relies on Parquet files for sample storage. Preliminary observations suggest that throughput per accelerator may be limited by reader-side overhead (decoding, batching, threading) and/or by suboptimal file generation parameters (row group size, compression, page size, file count/size distribution), rather than by raw storage bandwidth.

Preserving Parquet as the format is desirable for compatibility and ecosystem alignment, so the goal is to quantify how far the current pipeline can be pushed by tuning parameters before considering format-level alternatives.

## Proposed methodology
1. Define a baseline DLRM dataset and a fixed accelerator/host configuration.
2. Sweep one parameter at a time, then perform a small combined sweep on the most promising candidates.
3. For each run, record:
   - Achieved throughput (GB/s, samples/s) per accelerator
   - CPU utilization (overall and per worker)
   - Storage-side metrics (read IOPS, average request size)
   - Reader latency distribution
4. Compare against the DLRM target throughput (≥15 GB/s per accelerator; reference compute times 0.00038 s for GB200 and 0.00056 s for MI300X — see #353).

## Success criteria
- A documented set of Parquet generation and reader parameters that meet or exceed the DLRM target throughput on representative storage backends.
- A short report quantifying the contribution of each parameter to overall throughput and CPU cost.
- Recommendations for default values to ship in `mlpstorage` for DLRM datagen and training.

## Related
- #333 (Evaluate Arrow IPC format as an alternative reader format)
- #353 (Confirm computation time step for DLRM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training v3.0 consolidation] DLRM optimization: evaluate Parquet reader and generation parameters to improve throughput #354

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Training v3.0 consolidation] DLRM optimization: evaluate Parquet reader and generation parameters to improve throughput #354

Description

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions