You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue proposes a focused investigation into the Parquet-based data path used by the DLRM workload, with the goal of improving effective read throughput while keeping Parquet as the on-disk format.
Motivation
DLRM currently relies on Parquet files for sample storage. Preliminary observations suggest that throughput per accelerator may be limited by reader-side overhead (decoding, batching, threading) and/or by suboptimal file generation parameters (row group size, compression, page size, file count/size distribution), rather than by raw storage bandwidth.
Preserving Parquet as the format is desirable for compatibility and ecosystem alignment, so the goal is to quantify how far the current pipeline can be pushed by tuning parameters before considering format-level alternatives.
Proposed methodology
Define a baseline DLRM dataset and a fixed accelerator/host configuration.
Sweep one parameter at a time, then perform a small combined sweep on the most promising candidates.
For each run, record:
Achieved throughput (GB/s, samples/s) per accelerator
CPU utilization (overall and per worker)
Storage-side metrics (read IOPS, average request size)
Summary
This issue proposes a focused investigation into the Parquet-based data path used by the DLRM workload, with the goal of improving effective read throughput while keeping Parquet as the on-disk format.
Motivation
DLRM currently relies on Parquet files for sample storage. Preliminary observations suggest that throughput per accelerator may be limited by reader-side overhead (decoding, batching, threading) and/or by suboptimal file generation parameters (row group size, compression, page size, file count/size distribution), rather than by raw storage bandwidth.
Preserving Parquet as the format is desirable for compatibility and ecosystem alignment, so the goal is to quantify how far the current pipeline can be pushed by tuning parameters before considering format-level alternatives.
Proposed methodology
Success criteria
mlpstoragefor DLRM datagen and training.Related