Summary
Setting storage.storage_type=direct_fs does not actually engage O_DIRECT. DLIO's pytorch_checkpointing.py tries to import the proper streaming class from mlpstorage.checkpointing (the old package name), which raises ModuleNotFoundError, and silently falls back to SimpleStreamingCheckpointing which uses plain open(path, "wb") regardless of the backend='direct_fs'
argument it receives.
This is the same module-rename bug class that #359 fixed in 15 test files
("upstream mlpstorage → mlpstorage_py references"). The DLIO copy of the import path was missed.
Reproduction
mlpstorage checkpointing run --model llama3-8b --num-processes 8 \
--client-host-memory-in-gb 282 --file --closed \
--checkpoint-folder /mnt/lustre/ckpt_8b \
--results-dir ~/mlperf-runs/results-direct \
--num-checkpoints-write=5 --num-checkpoints-read=0 \
--params storage.storage_type=direct_fs
During the write phase:
$ free -h
total used free shared buff/cache
available
Mem: 282Gi 22Gi 71Gi 10Mi 192Gi
260Gi <-- page cache growing
$ lctl get_param llite.*.max_cached_mb
max_cached_mb: 217218
used_mb: 195323 <-- Lustre client cache filled with write data, not bypassed
Root cause
dlio_benchmark/checkpointing/pytorch_checkpointing.py:122
try:
from mlpstorage.checkpointing import StreamingCheckpointing as _SC # <--
wrong
except ImportError:
from dlio_benchmark.checkpointing.simple_streaming_checkpointing import (
SimpleStreamingCheckpointing as _SC, # <--
falls back here
)
Proposed fix
One-line change in dlio_benchmark/checkpointing/pytorch_checkpointing.py:
- from mlpstorage.checkpointing import StreamingCheckpointing as _SC
+ from mlpstorage_py.checkpointing import StreamingCheckpointing as _SC
@russfellows Since storage repo's uv.lock pins your russfellows/dlio_benchmark@feat/parquet-dgen-streaming, the fix needs to land there to actually reach users. Happy to send a PR against that branch if you'd like. Thanks
Summary
Setting
storage.storage_type=direct_fsdoes not actually engage O_DIRECT. DLIO'spytorch_checkpointing.pytries to import the proper streaming class frommlpstorage.checkpointing(the old package name), which raisesModuleNotFoundError, and silently falls back toSimpleStreamingCheckpointingwhich uses plainopen(path, "wb")regardless of thebackend='direct_fs'argument it receives.
This is the same module-rename bug class that #359 fixed in 15 test files
("upstream
mlpstorage→mlpstorage_pyreferences"). The DLIO copy of the import path was missed.Reproduction
mlpstorage checkpointing run --model llama3-8b --num-processes 8 \ --client-host-memory-in-gb 282 --file --closed \ --checkpoint-folder /mnt/lustre/ckpt_8b \ --results-dir ~/mlperf-runs/results-direct \ --num-checkpoints-write=5 --num-checkpoints-read=0 \ --params storage.storage_type=direct_fsDuring the write phase:
$ free -h total used free shared buff/cache available Mem: 282Gi 22Gi 71Gi 10Mi 192Gi 260Gi <-- page cache growing $ lctl get_param llite.*.max_cached_mb max_cached_mb: 217218 used_mb: 195323 <-- Lustre client cache filled with write data, not bypassedRoot cause
Proposed fix
One-line change in
dlio_benchmark/checkpointing/pytorch_checkpointing.py:@russfellows Since storage repo's uv.lock pins your
russfellows/dlio_benchmark@feat/parquet-dgen-streaming, the fix needs to land there to actually reach users. Happy to send a PR against that branch if you'd like. Thanks