Skip to content

mlpstorage checkpointing: --params storage.storage_type=direct_fs uses pagecache #371

@crossmeta

Description

@crossmeta

Summary

Setting storage.storage_type=direct_fs does not actually engage O_DIRECT. DLIO's pytorch_checkpointing.py tries to import the proper streaming class from mlpstorage.checkpointing (the old package name), which raises ModuleNotFoundError, and silently falls back to SimpleStreamingCheckpointing which uses plain open(path, "wb") regardless of the backend='direct_fs'
argument it receives.

This is the same module-rename bug class that #359 fixed in 15 test files
("upstream mlpstoragemlpstorage_py references"). The DLIO copy of the import path was missed.

Reproduction

mlpstorage checkpointing run --model llama3-8b --num-processes 8 \
    --client-host-memory-in-gb 282 --file --closed \
    --checkpoint-folder /mnt/lustre/ckpt_8b \
    --results-dir ~/mlperf-runs/results-direct \
    --num-checkpoints-write=5 --num-checkpoints-read=0 \
    --params storage.storage_type=direct_fs

During the write phase:

$ free -h
                  total        used        free        shared     buff/cache
available
Mem:              282Gi        22Gi        71Gi        10Mi       192Gi
260Gi   <-- page cache growing

$ lctl get_param llite.*.max_cached_mb
max_cached_mb: 217218
used_mb: 195323   <-- Lustre client cache filled with write data, not bypassed

Root cause

dlio_benchmark/checkpointing/pytorch_checkpointing.py:122

    try:
        from mlpstorage.checkpointing import StreamingCheckpointing as _SC   # <--
wrong
    except ImportError:
        from dlio_benchmark.checkpointing.simple_streaming_checkpointing import (
            SimpleStreamingCheckpointing as _SC,                              # <--
 falls back here
        )

Proposed fix

One-line change in dlio_benchmark/checkpointing/pytorch_checkpointing.py:

-                from mlpstorage.checkpointing import StreamingCheckpointing as _SC
+                from mlpstorage_py.checkpointing import StreamingCheckpointing as _SC

@russfellows Since storage repo's uv.lock pins your russfellows/dlio_benchmark@feat/parquet-dgen-streaming, the fix needs to land there to actually reach users. Happy to send a PR against that branch if you'd like. Thanks

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions