Skip to content

[Training v3.0 consolidation] DLRM optimization: evaluate Parquet reader and generation parameters to improve throughput #354

@wolfgang-desalvador

Description

@wolfgang-desalvador

Summary

This issue proposes a focused investigation into the Parquet-based data path used by the DLRM workload, with the goal of improving effective read throughput while keeping Parquet as the on-disk format.

Motivation

DLRM currently relies on Parquet files for sample storage. Preliminary observations suggest that throughput per accelerator may be limited by reader-side overhead (decoding, batching, threading) and/or by suboptimal file generation parameters (row group size, compression, page size, file count/size distribution), rather than by raw storage bandwidth.

Preserving Parquet as the format is desirable for compatibility and ecosystem alignment, so the goal is to quantify how far the current pipeline can be pushed by tuning parameters before considering format-level alternatives.

Proposed methodology

  1. Define a baseline DLRM dataset and a fixed accelerator/host configuration.
  2. Sweep one parameter at a time, then perform a small combined sweep on the most promising candidates.
  3. For each run, record:
    • Achieved throughput (GB/s, samples/s) per accelerator
    • CPU utilization (overall and per worker)
    • Storage-side metrics (read IOPS, average request size)
    • Reader latency distribution
  4. Compare against the DLRM target throughput (≥15 GB/s per accelerator; reference compute times 0.00038 s for GB200 and 0.00056 s for MI300X — see [Training v3.0 consolidation] Confirm computation time step for DLRM #353).

Success criteria

  • A documented set of Parquet generation and reader parameters that meet or exceed the DLRM target throughput on representative storage backends.
  • A short report quantifying the contribution of each parameter to overall throughput and CPU cost.
  • Recommendations for default values to ship in mlpstorage for DLRM datagen and training.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions