Skip to content

MLPerf Storage RetinaNet CLOSED Division Failure #506

Description

@austingnanaraj

Date: June 24, 2026 03:42 UTC
Test: RetinaNet Training - 4 Client, CLOSED Division

Summary
MLPerf Storage RetinaNet training benchmark in CLOSED division failed during warmup phase with a RuntimeError: dispatch failure in the s3dlio library. The test successfully passed validation and completed file listing (20+ minutes) but crashed immediately when attempting to read data during the warmup iteration.
Test Configuration (Redacted)
Parameter Value
Benchmark mlpstorage 3.0.16
Division CLOSED
Workload RetinaNet Training
Model RetinaNet B200
Clients 4 nodes (HOST1, HOST2, HOST3, HOST4)
Processes 16 (4 per client)
Accelerators 16 B200
Memory/Host 754 GB
Dataset 50,203,282 files (full dataset)
Epochs 8
Batch Size 24
Storage Configuration:
Library: s3dlio
Endpoints: 4 S3 endpoints (REDACTED)
Region: us-east-1
Load Balancing: round_robin
Concurrency: 64 in-flight per rank
Workers: 8 per rank
❌ Failure Details
Error Message:
RuntimeError: dispatch failure
Error Location:
python
File: dlio_benchmark/reader/_s3_iterable_mixin.py, line 424
Function: _s3_stream_s3dlio()
Code: while batch := item_iter.collect_batch(collect_n):
Full Stack Trace:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index)
File "torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
File "dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
File "dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
RuntimeError: dispatch failure

⏱️ Timeline
Phase Time Duration Status
Start 03:42:17 - ✅
Environment Validation 03:42:17-03:42:21 4s ✅
File Listing 03:42:25-04:02:47 20m 22s ✅
Reshard for Epoch 1 04:02:48-04:02:50 2.29s ✅
Warmup Iteration 04:02:50+ <1s ❌ FAILED
Total Runtime 03:42:17-04:03:06 20m 49s ❌
✅ What Worked
✅ Environment validation passed
✅ CLOSED division qualification passed
✅ MPI collection across 4 hosts successful
✅ File listing completed (50.2M files in 20m 22s)
✅ File sharding completed (3,137,705 files per rank)
✅ Epoch reshard completed (2.29s)
✅ DataLoader initialization successful

[ITER_SIMPLE] worker=2 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=6 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=5 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=3 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=1 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=0 reader=ImageReaderS3Iterable files_this_worker=392214 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 666, in run
next(warmup_iter)
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 718, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1525, in _next_data
return self._process_data(data, worker_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1563, in _process_data
data.reraise()
File "/root/storage/.venv/lib/python3.12/site-packages/torch/_utils.py", line 774, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dispatch failure

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[ITER_SIMPLE] worker=4 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions