Skip to content

mlpstorage training run --model=flux has considerably lower I/O Throughput causing train_au_meet_expectation to fail #330

@ddn-kums

Description

@ddn-kums

Hi,

The mlpstorage training run --model=flux --accelerator-type b200 .. job (even using a single accelerator) has considerably lower I/O throughput of 0.8 MB/s resulting in train_au_meet_expectation: fail.

The lower mlpstorage training run --model=flux --accelerator-type b200 .. has been observed across two high-performance file-systems systems comprised of of ONLY NVMe SSD drives.

From performance profiling during the flux training run, we observe most of the time being spent on PyUnicode_FromFormatV and parquet routine with all of the 8 x pt_data_worker 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training parquet files.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                            
 371212 nodeadm+  20   0 6769292 963864  32760 R 101.6   0.4 250:37.83 pt_data_worker                                                                                                                                                     
 371210 nodeadm+  20   0 6769268 962760  33144 R 101.3   0.4 251:53.42 pt_data_worker                                                                                                                                                     
 371211 nodeadm+  20   0 6834544 995.6m  33928 R 101.3   0.4 251:40.72 pt_data_worker                                                                                                                                                     
 371213 nodeadm+  20   0 6769304 965116  33136 R 101.3   0.4 251:20.06 pt_data_worker                                                                                                                                                     
 371206 nodeadm+  20   0 6990404   1.0g  33144 S 101.0   0.4 251:41.42 pt_data_worker                                                                                                                                                     
 371207 nodeadm+  20   0 6834496   1.0g  33928 R 101.0   0.4 252:42.40 pt_data_worker                                                                                                                                                     
 371208 nodeadm+  20   0 6834508   1.0g  33520 R 101.0   0.4 250:56.91 pt_data_worker                                                                                                                                                     
 371209 nodeadm+  20   0 6802672 991216  33136 R 101.0   0.4 250:17.03 pt_data_worker
Image

Details

  • Generate the --model=flux dataset
$ mlpstorage training datagen --hosts=srt017-e0 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --num-processes=1 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 21:54:34|INFO: Environment validation passed
2026-04-10 21:54:34|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434
2026-04-10 21:54:34|INFO: Creating data directory: /mnt/redfs/mlstorage_dd/flux_b200/flux...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/train...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/valid...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-10 21:54:35|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=flux_datagen ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=2126 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/flux_b200/flux --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T21:54:41.987119 Running DLIO [Generating data] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================

- Verify the generated dataset

$ ls -1 *.parquet | wc -l
2126

$ ls -lh *.parquet | tail -5
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2121_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2122_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2123_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2124_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2125_of_2126.parquet

$ du -sh *.parquet | tail -5
17M	img_2121_of_2126.parquet
17M	img_2122_of_2126.parquet
17M	img_2123_of_2126.parquet
17M	img_2124_of_2126.parquet
17M	img_2125_of_2126.parquet

- File System 1 - Parallel File System across 72 x NVMe drives - Training I/O Throughput (MB/second): 0.8269

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --pa
ram dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 22:15:30|INFO: Environment validation passed
2026-04-10 22:15:30|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
2026-04-10 22:15:30|INFO: Created benchmark run: training_run_flux_20260410_221529
2026-04-10 22:15:30|STATUS: Verifying benchmark run for training_run_flux_20260410_221529
..
..
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
⠹ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
[OUTPUT] 2026-04-11T11:07:18.208178 Ending block 1 - 12756 steps completed in 46301.55 s
[OUTPUT] 2026-04-11T11:07:18.219430 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 37.2174
[OUTPUT] 2026-04-11T11:07:18.219543 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.2307
[OUTPUT] 2026-04-11T11:07:18.219621 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-11T11:07:18.224527 Ending epoch 1 - 12756 steps completed in 46301.56 s
[OUTPUT] 2026-04-11T11:07:18.935511 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 37.2174 (0.0000)
[METRIC] Training Throughput (samples/second): 13.2307 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8269 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================

[OUTPUT] 2026-04-11T11:07:18.980509 outputs saved in RANKID_output.json
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/mnt/redfs/mlstorage_dd/flux_b200/flux'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 2126
  record_length  = 65536
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 48
2026-04-11 11:07:26|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529/training_20260410_221529_metadata.json

- File System 2 - Local File System (zfs) across 12 x NVMe drives - Training I/O Throughput (MB/second): 0.8598

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/zfs-fs1/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-11 12:47:38|INFO: Environment validation passed
2026-04-11 12:47:38|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
2026-04-11 12:47:39|INFO: Created benchmark run: training_run_flux_20260411_124738
2026-04-11 12:47:39|STATUS: Verifying benchmark run for training_run_flux_20260411_124738
..
..
[OUTPUT] 2026-04-12T01:09:54.675389 Ending block 1 - 12756 steps completed in 44529.72 s
[OUTPUT] 2026-04-12T01:09:54.683906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 38.6982
[OUTPUT] 2026-04-12T01:09:54.684024 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.7572
[OUTPUT] 2026-04-12T01:09:54.684086 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-12T01:09:54.688416 Ending epoch 1 - 12756 steps completed in 44529.73 s
[OUTPUT] 2026-04-12T01:09:55.396901 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 38.6982 (0.0000)
[METRIC] Training Throughput (samples/second): 13.7572 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8598 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-04-12T01:09:55.440256 outputs saved in RANKID_output.json
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/zfs-fs1/mlstorage_dd/flux_b200/flux'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 2126
  record_length  = 65536
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 48
2026-04-12 01:10:02|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738/training_20260411_124738_metadata.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions