Hi,
The mlpstorage training run --model=flux --accelerator-type b200 .. job (even using a single accelerator) has considerably lower I/O throughput of 0.8 MB/s resulting in train_au_meet_expectation: fail.
The lower mlpstorage training run --model=flux --accelerator-type b200 .. has been observed across two high-performance file-systems systems comprised of of ONLY NVMe SSD drives.
From performance profiling during the flux training run, we observe most of the time being spent on PyUnicode_FromFormatV and parquet routine with all of the 8 x pt_data_worker 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training parquet files.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
371212 nodeadm+ 20 0 6769292 963864 32760 R 101.6 0.4 250:37.83 pt_data_worker
371210 nodeadm+ 20 0 6769268 962760 33144 R 101.3 0.4 251:53.42 pt_data_worker
371211 nodeadm+ 20 0 6834544 995.6m 33928 R 101.3 0.4 251:40.72 pt_data_worker
371213 nodeadm+ 20 0 6769304 965116 33136 R 101.3 0.4 251:20.06 pt_data_worker
371206 nodeadm+ 20 0 6990404 1.0g 33144 S 101.0 0.4 251:41.42 pt_data_worker
371207 nodeadm+ 20 0 6834496 1.0g 33928 R 101.0 0.4 252:42.40 pt_data_worker
371208 nodeadm+ 20 0 6834508 1.0g 33520 R 101.0 0.4 250:56.91 pt_data_worker
371209 nodeadm+ 20 0 6802672 991216 33136 R 101.0 0.4 250:17.03 pt_data_worker
Details
- Generate the
--model=flux dataset
$ mlpstorage training datagen --hosts=srt017-e0 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --num-processes=1 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 21:54:34|INFO: Environment validation passed
2026-04-10 21:54:34|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434
2026-04-10 21:54:34|INFO: Creating data directory: /mnt/redfs/mlstorage_dd/flux_b200/flux...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/train...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/valid...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-10 21:54:35|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=flux_datagen ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=2126 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/flux_b200/flux --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T21:54:41.987119 Running DLIO [Generating data] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT] dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
- Verify the generated dataset
$ ls -1 *.parquet | wc -l
2126
$ ls -lh *.parquet | tail -5
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2121_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2122_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2123_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2124_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2125_of_2126.parquet
$ du -sh *.parquet | tail -5
17M img_2121_of_2126.parquet
17M img_2122_of_2126.parquet
17M img_2123_of_2126.parquet
17M img_2124_of_2126.parquet
17M img_2125_of_2126.parquet
- File System 1 - Parallel File System across 72 x NVMe drives - Training I/O Throughput (MB/second): 0.8269
$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --pa
ram dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 22:15:30|INFO: Environment validation passed
2026-04-10 22:15:30|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
2026-04-10 22:15:30|INFO: Created benchmark run: training_run_flux_20260410_221529
2026-04-10 22:15:30|STATUS: Verifying benchmark run for training_run_flux_20260410_221529
..
..
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
⠹ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
[OUTPUT] 2026-04-11T11:07:18.208178 Ending block 1 - 12756 steps completed in 46301.55 s
[OUTPUT] 2026-04-11T11:07:18.219430 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 37.2174
[OUTPUT] 2026-04-11T11:07:18.219543 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.2307
[OUTPUT] 2026-04-11T11:07:18.219621 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-11T11:07:18.224527 Ending epoch 1 - 12756 steps completed in 46301.56 s
[OUTPUT] 2026-04-11T11:07:18.935511 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 37.2174 (0.0000)
[METRIC] Training Throughput (samples/second): 13.2307 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8269 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-04-11T11:07:18.980509 outputs saved in RANKID_output.json
storage_type = <StorageType.LOCAL_FS: 'local_fs'>
storage_root = './'
storage_options= None
data_folder = '/mnt/redfs/mlstorage_dd/flux_b200/flux'
framework = <FrameworkType.PYTORCH: 'pytorch'>
num_files_train= 2126
record_length = 65536
generate_data = False
do_train = True
do_checkpoint = False
epochs = 1
batch_size = 48
2026-04-11 11:07:26|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529/training_20260410_221529_metadata.json
- File System 2 - Local File System (zfs) across 12 x NVMe drives - Training I/O Throughput (MB/second): 0.8598
$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/zfs-fs1/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-11 12:47:38|INFO: Environment validation passed
2026-04-11 12:47:38|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
2026-04-11 12:47:39|INFO: Created benchmark run: training_run_flux_20260411_124738
2026-04-11 12:47:39|STATUS: Verifying benchmark run for training_run_flux_20260411_124738
..
..
[OUTPUT] 2026-04-12T01:09:54.675389 Ending block 1 - 12756 steps completed in 44529.72 s
[OUTPUT] 2026-04-12T01:09:54.683906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 38.6982
[OUTPUT] 2026-04-12T01:09:54.684024 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.7572
[OUTPUT] 2026-04-12T01:09:54.684086 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-12T01:09:54.688416 Ending epoch 1 - 12756 steps completed in 44529.73 s
[OUTPUT] 2026-04-12T01:09:55.396901 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 38.6982 (0.0000)
[METRIC] Training Throughput (samples/second): 13.7572 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8598 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-04-12T01:09:55.440256 outputs saved in RANKID_output.json
storage_type = <StorageType.LOCAL_FS: 'local_fs'>
storage_root = './'
storage_options= None
data_folder = '/zfs-fs1/mlstorage_dd/flux_b200/flux'
framework = <FrameworkType.PYTORCH: 'pytorch'>
num_files_train= 2126
record_length = 65536
generate_data = False
do_train = True
do_checkpoint = False
epochs = 1
batch_size = 48
2026-04-12 01:10:02|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738/training_20260411_124738_metadata.json
Hi,
The
mlpstorage training run --model=flux --accelerator-type b200 ..job (even using a single accelerator) has considerably lower I/O throughput of 0.8 MB/s resulting intrain_au_meet_expectation: fail.The lower
mlpstorage training run --model=flux --accelerator-type b200 ..has been observed across two high-performance file-systems systems comprised of of ONLY NVMe SSD drives.From performance profiling during the
fluxtraining run, we observe most of the time being spent onPyUnicode_FromFormatVandparquetroutine with all of the 8 xpt_data_worker100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training parquet files.Details
--model=fluxdataset- Verify the generated dataset
- File System 1 - Parallel File System across 72 x NVMe drives - Training I/O Throughput (MB/second): 0.8269
- File System 2 - Local File System (zfs) across 12 x NVMe drives - Training I/O Throughput (MB/second): 0.8598