Skip to content

Update python version, DLIO, and uv.lock#298

Merged
FileSystemGuy merged 3 commits intomainfrom
fsg-python-version
Apr 4, 2026
Merged

Update python version, DLIO, and uv.lock#298
FileSystemGuy merged 3 commits intomainfrom
fsg-python-version

Conversation

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Constrain python to be in the 3.12 family only,
Change the DLIO dependencies in pyproject.toml to refer to DLIO_local_changes,
Update the uv.lock file to reflect those changes,

@FileSystemGuy FileSystemGuy requested a review from a team March 31, 2026 20:59
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 31, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

This is intended to resolve #293
Hopefully it will actually do that!

idevasena
idevasena previously approved these changes Mar 31, 2026
@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

This one needs real testing before we merge it in, so please don't anyone merge it until we're sure it hasn't broken anything.

@idevasena
Copy link
Copy Markdown
Contributor

idevasena commented Apr 2, 2026

@FileSystemGuy Performed the following tests to validate PR:

  1. Verified pyproject.toml changes and uv.lock consistency
  2. Verified uv.lock reproducibility:
    confirms the lock file was generated correctly and isn't stale.
smrc@dskbd029:~/Storage_Repo_Tests/storage$ git checkout pr-298
Switched to branch 'pr-298'
smrc@dskbd029:~/Storage_Repo_Tests/storage$ cat .python-version
3.12
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep "requires-python" pyproject.toml
requires-python = ">=3.12, <3.13"
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -n "dlio-benchmark" pyproject.toml
24:    "dlio-benchmark @ git+https://github.com/mlcommons/DLIO_local_changes.git",
35:    "dlio-benchmark @ git+https://github.com/mlcommons/DLIO_local_changes.git",
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -i "wvaske" uv.lock
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -i "argonne-lcf" uv.lock
smrc@dskbd029:~/Storage_Repo_Tests/storage$ grep -i "DLIO_local_changes" uv.lock
source = { git = "https://github.com/mlcommons/DLIO_local_changes.git#4be40e6b9077674513c0defd5283faf3cbae8445" }
    { name = "dlio-benchmark", marker = "extra == 'dlio'", git = "https://github.com/mlcommons/DLIO_local_changes.git" },
    { name = "dlio-benchmark", marker = "extra == 'full'", git = "https://github.com/mlcommons/DLIO_local_changes.git" },
smrc@dskbd029:~/Storage_Repo_Tests/storage$ cp uv.lock uv.lock.from_pr
smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv lock
Using CPython 3.12.3 interpreter at: /usr/bin/python3.12
Resolved 98 packages in 1ms
smrc@dskbd029:~/Storage_Repo_Tests/storage$ diff uv.lock uv.lock.from_pr
smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv venv --python 3.12
Using CPython 3.12.3 interpreter at: /usr/bin/python3.12
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
  1. After installation and dependency resolution:
    Verified installed DLIO source:
(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ cat .venv/lib/python3.12/site-packages/dlio_benchmark-*.dist-info/direct_url.json | python3 -m json.tool
{
    "url": "https://github.com/mlcommons/DLIO_local_changes.git",
    "vcs_info": {
        "vcs": "git",
        "commit_id": "4be40e6b9077674513c0defd5283faf3cbae8445"
    }
}
(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ python3 -c "import mlpstorage; print('import OK')"
import OK
  1. Functional tests

DLIO data generation test:

(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ mlpstorage training datagen --num-processes 2     --data-dir /tmp/mlperf_storage_test  -m unet3d
Hosts is: ['127.0.0.1']
Hosts is: ['127.0.0.1']
⠋ Validating environment... 0:00:002026-04-01 22:54:19|INFO: Environment validation passed
2026-04-01 22:54:19|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/unet3d/datagen/20260401_225419
2026-04-01 22:54:19|INFO: Creating data directory: /tmp/mlperf_storage_test/unet3d...
2026-04-01 22:54:19|INFO: Creating directory: /tmp/mlperf_storage_test/unet3d/train...
2026-04-01 22:54:19|INFO: Creating directory: /tmp/mlperf_storage_test/unet3d/valid...
2026-04-01 22:54:19|INFO: Creating directory: /tmp/mlperf_storage_test/unet3d/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-01 22:54:19|STATUS: Running benchmark command:: mpirun -n 2 -host 127.0.0.1:2 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=/tmp/mlperf_storage_results/training/unet3d/datagen/20260401_225419 ++hydra.output_subdir=dlio_config ++workload.dataset.data_folder=/tmp/mlperf_storage_test/unet3d --config-dir=/home/smrc/Storage_Repo_Tests/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/tmp/mlperf_storage_test/unet3d'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 168
[OUTPUT] 2026-04-01T22:54:23.876022 Running DLIO [Generating data] with 2 process(es)
  record_length  = 146600628
  generate_data  = True
  do_train       = False
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 1
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-04-01T22:54:23.911240 Starting data generation
[OUTPUT] 2026-04-01T22:54:35.168146 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
  storage_root   = './'
  storage_options= None
  data_folder    = '/tmp/mlperf_storage_test/unet3d'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 168
  record_length  = 146600628
  generate_data  = True
  do_train       = False
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 1
[OUTPUT] ================================================================================
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================`

DLIO benchmark run test:

(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ mlpstorage training run     --model unet3d     --num-accelerators 2     --accelerator-type h100     --data-dir /tmp/mlperf_storage_test     --results-dir /tmp/mlperf_storage_results --client-host-memory-in-gb 1000
Setting attr from num_accelerators to 2
Hosts is: ['127.0.0.1']
Hosts is: ['127.0.0.1']
⠋ Validating environment... 0:00:002026-04-01 22:56:10|INFO: Environment validation passed
2026-04-01 22:56:10|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/unet3d/run/20260401_225610
2026-04-01 22:56:11|INFO: Created benchmark run: training_run_unet3d_20260401_225610
2026-04-01 22:56:11|STATUS: Verifying benchmark run for training_run_unet3d_20260401_225610
2026-04-01 22:56:11|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-01 22:56:11|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 36889, Actual: 168)
2026-04-01 22:56:11|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='unet3d', run_datetime='20260401_225610')])
2026-04-01 22:56:11|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-01 22:56:11|STATUS: Running benchmark command:: mpirun -n 2 -host 127.0.0.1:2 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/tmp/mlperf_storage_results/training/unet3d/run/20260401_225610 ++hydra.output_subdir=dlio_config ++workload.dataset.data_folder=/tmp/mlperf_storage_test/unet3d --config-dir=/home/smrc/Storage_Repo_Tests/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/tmp/mlperf_storage_test/unet3d'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 168
  record_length  = 146600628
  generate_data  = False
[OUTPUT] 2026-04-01T22:56:14.512806 Running DLIO [Training & Checkpointing] with 2 process(es)
  do_train       = True
  do_checkpoint  = True
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the 
caching effect!!!
  epochs         = 5
  batch_size     = 7
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT] ================================================================================
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT] ================================================================================
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-04-01T22:56:14.553670 Model size: 0.000010 GB 
[OUTPUT] 2026-04-01T22:56:14.553742 Total checkpoint size: 0.000010 GB 
[OUTPUT] 2026-04-01T22:56:14.553813 Max steps per epoch: 12 = 1 * 168 / 7 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-01T22:56:14.632422 Starting epoch 1: 12 steps expected
[OUTPUT] 2026-04-01T22:56:14.632621 Starting block 1
[OUTPUT] 2026-04-01T22:56:19.313152 Ending block 1 - 12 steps completed in 4.68 s
[OUTPUT] 2026-04-01T22:56:19.326037 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 97.1905
[OUTPUT] 2026-04-01T22:56:19.326221 Epoch 1 - Block 1 [Training] Throughput (samples/second): 38.2855
[OUTPUT] 2026-04-01T22:56:19.326338 Epoch 1 - Block 1 [Training] Computation time per step (second): 0.3231+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2026-04-01T22:56:19.331486 Ending epoch 1 - 12 steps completed in 4.70 s
[OUTPUT] 2026-04-01T22:56:19.336531 Starting epoch 2: 12 steps expected
[OUTPUT] 2026-04-01T22:56:19.337212 Starting block 1
[OUTPUT] 2026-04-01T22:56:23.962226 Ending block 1 - 12 steps completed in 4.63 s
[OUTPUT] 2026-04-01T22:56:23.965841 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 97.1230
[OUTPUT] 2026-04-01T22:56:23.965986 Epoch 2 - Block 1 [Training] Throughput (samples/second): 38.2590
[OUTPUT] 2026-04-01T22:56:23.966093 Epoch 2 - Block 1 [Training] Computation time per step (second): 0.3231+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2026-04-01T22:56:23.970881 Ending epoch 2 - 12 steps completed in 4.63 s
[OUTPUT] 2026-04-01T22:56:23.975115 Starting epoch 3: 12 steps expected
[OUTPUT] 2026-04-01T22:56:23.975700 Starting block 1
[OUTPUT] 2026-04-01T22:56:28.606212 Ending block 1 - 12 steps completed in 4.63 s
[OUTPUT] 2026-04-01T22:56:28.610569 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 97.0761
[OUTPUT] 2026-04-01T22:56:28.610729 Epoch 3 - Block 1 [Training] Throughput (samples/second): 38.2404
[OUTPUT] 2026-04-01T22:56:28.610866 Epoch 3 - Block 1 [Training] Computation time per step (second): 0.3231+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2026-04-01T22:56:28.615659 Ending epoch 3 - 12 steps completed in 4.64 s
[OUTPUT] 2026-04-01T22:56:28.620051 Starting epoch 4: 12 steps expected
[OUTPUT] 2026-04-01T22:56:28.620683 Starting block 1
[OUTPUT] 2026-04-01T22:56:33.245046 Ending block 1 - 12 steps completed in 4.62 s
[OUTPUT] 2026-04-01T22:56:33.248665 Epoch 4 - Block 1 [Training] Accelerator Utilization [AU] (%): 97.0818
[OUTPUT] 2026-04-01T22:56:33.248835 Epoch 4 - Block 1 [Training] Throughput (samples/second): 38.2435
[OUTPUT] 2026-04-01T22:56:33.248946 Epoch 4 - Block 1 [Training] Computation time per step (second): 0.3231+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2026-04-01T22:56:33.253622 Ending epoch 4 - 12 steps completed in 4.63 s
[OUTPUT] 2026-04-01T22:56:33.257922 Starting epoch 5: 12 steps expected
[OUTPUT] 2026-04-01T22:56:33.258482 Starting block 1
[OUTPUT] 2026-04-01T22:56:37.884269 Ending block 1 - 12 steps completed in 4.63 s
[OUTPUT] 2026-04-01T22:56:37.887884 Epoch 5 - Block 1 [Training] Accelerator Utilization [AU] (%): 96.9763
[OUTPUT] 2026-04-01T22:56:37.888032 Epoch 5 - Block 1 [Training] Throughput (samples/second): 38.2007
[OUTPUT] 2026-04-01T22:56:37.888164 Epoch 5 - Block 1 [Training] Computation time per step (second): 0.3231+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2026-04-01T22:56:37.888695 Starting saving checkpoint 1 after total step 12 for epoch 5
[OUTPUT] 2026-04-01T22:56:37.939609 Finished saving checkpoint 1 for epoch 5 in 0.0509 s; Throughput: 0.0002 GB/s
[OUTPUT] 2026-04-01T22:56:37.949224 Ending epoch 5 - 12 steps completed in 4.69 s
[OUTPUT] 2026-04-01T22:56:37.954150 Saved outputs in /tmp/mlperf_storage_results/training/unet3d/run/20260401_225610
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 2 
[METRIC] Training Accelerator Utilization [AU] (%): 97.6480 (0.0570)
[METRIC] Training Throughput (samples/second): 38.4658 (0.0227)
[METRIC] Training I/O Throughput (MB/second): 5377.8813 (3.1798)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================

[OUTPUT] 2026-04-01T22:56:37.957874 outputs saved in RANKID_output.json
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/tmp/mlperf_storage_test/unet3d'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 168
  record_length  = 146600628
  generate_data  = False
  do_train       = True
  do_checkpoint  = True
  epochs         = 5
  batch_size     = 7

@idevasena
Copy link
Copy Markdown
Contributor

I did run existing unit/ integration tests too, but see 16 failed tests and 13 errors. However, those seem to be pre-existing bugs in the main branch test suite.

(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ uv pip install -e ".[test]"
Resolved 15 packages in 428ms
      Built mlpstorage @ file:///home/smrc/Storage_Repo_Tests/storage
Prepared 7 packages in 631ms
Uninstalled 1 package in 0.92ms
Installed 7 packages in 4ms
 + coverage==7.13.5
 + iniconfig==2.3.0
 ~ mlpstorage==2.0.0b1 (from file:///home/smrc/Storage_Repo_Tests/storage)
 + pluggy==1.6.0
 + pytest==9.0.2
 + pytest-cov==7.1.0
 + pytest-mock==3.15.1
(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ pytest
================================================================= short test summary info =================================================================
FAILED tests/integration/test_benchmark_flow.py::TestTrainingBenchmarkFlow::test_training_benchmark_what_if_mode - PermissionError: [Errno 13] Permission denied: '/data'
FAILED tests/integration/test_benchmark_flow.py::TestTrainingBenchmarkFlow::test_training_benchmark_generates_correct_command - PermissionError: [Errno 13] Permission denied: '/data'
FAILED tests/integration/test_benchmark_flow.py::TestBenchmarkWithMockExecutor::test_benchmark_records_executed_commands - PermissionError: [Errno 13] Permission denied: '/data'
FAILED tests/integration/test_benchmark_flow.py::TestBenchmarkWithMockCollector::test_benchmark_uses_mock_cluster_info - PermissionError: [Errno 13] Permission denied: '/data'
FAILED tests/integration/test_benchmark_flow.py::TestMetadataGeneration::test_metadata_file_created_on_run - PermissionError: [Errno 13] Permission denied: '/data'
FAILED tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_shows_spinner - AssertionError: Expected 'progress_context' to have been called once. Called 0 times.
FAILED tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_updates_description_ssh - AssertionError: expected call not found.
FAILED tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_updates_description_mpi - AssertionError: expected call not found.
FAILED tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_end_cluster_collection_shows_spinner - AssertionError: Expected 'progress_context' to have been called once. Called 0 times.
FAILED tests/unit/test_reporting.py::TestReportGeneratorInit::test_accepts_custom_logger - SystemExit: FILE_NOT_FOUND (3)
FAILED tests/unit/test_reporting.py::TestReportGeneratorInit::test_uses_debug_from_args - SystemExit: FILE_NOT_FOUND (3)
FAILED tests/unit/test_reporting.py::TestReportGeneratorAccumulateResults::test_accumulates_from_benchmark_runs - SystemExit: FILE_NOT_FOUND (3)
FAILED tests/unit/test_reporting.py::TestReportGeneratorAccumulateResults::test_groups_by_workload - SystemExit: FILE_NOT_FOUND (3)
FAILED tests/unit/test_reporting.py::TestReportGeneratorIntegration::test_full_workflow_with_fixture_data - SystemExit: FILE_NOT_FOUND (3)
FAILED tests/unit/test_rules_calculations.py::TestGetRunsFiles::test_returns_empty_for_nonexistent_dir - AssertionError: Expected 'warning' to have been called.
FAILED tests/unit/test_rules_calculations.py::TestGetRunsFiles::test_skips_dirs_with_multiple_metadata - assert 1 == 0
ERROR tests/unit/test_reporting.py::TestReportGeneratorWriteJson::test_writes_json_file - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorWriteJson::test_json_has_proper_formatting - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorWriteCsv::test_writes_csv_file - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorWriteCsv::test_flattens_nested_dicts - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorWriteCsv::test_handles_nan_values - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorGenerateReports::test_returns_success - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorGenerateReports::test_creates_json_file - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorGenerateReports::test_creates_csv_file - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorPrintResults::test_prints_closed_results - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorPrintResults::test_prints_issues - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorPrintResults::test_prints_metrics - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorPrintResults::test_prints_metric_lists - SystemExit: FILE_NOT_FOUND (3)
ERROR tests/unit/test_reporting.py::TestReportGeneratorPrintResults::test_prints_workload_results - SystemExit: FILE_NOT_FOUND (3)
================================================== 16 failed, 831 passed, 12 skipped, 13 errors in 6.61s ==================================================

Reasoning for failure:

Category 1: test_reporting.py failures (13 errors + 5 failures) — The test fixtures create a bare results/ directory but the ReportGenerator.init now validates directory structure and calls sys.exit(EXIT_CODE.FILE_NOT_FOUND) when it doesn't find training/, checkpointing/, vector_database/, or kv_cache/ subdirectories. The tests weren't updated to match the new validation logic. This is a test-vs-code mismatch on main, not a PR #298 regression.

Category 2: test_benchmark_flow.py failures (5) — PermissionError: [Errno 13] Permission denied: '/data'. The TrainingBenchmark tries to os.makedirs('/data') which requires root. These tests need to mock the filesystem or use tmp_path. Again, pre-existing on main.

Category 3: test_benchmarks_base.py failures (4) — Progress spinner mock assertions not matching. The mocked progress_context is never called, suggesting the code path changed but tests weren't updated. Pre-existing.

Category 4: test_rules_calculations.py failures (2) — Logic changes in how nonexistent dirs and multiple metadata files are handled. Pre-existing.

@idevasena
Copy link
Copy Markdown
Contributor

Verifying the Submodule Issue (#293) is actually addressed:

(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ cat .gitmodules 2>/dev/null
[submodule "dlio_benchmark"]
path = dlio_benchmark
url = https://github.com/mlcommons/DLIO_local_changes
branch = main
(storage) smrc@dskbd029:~/Storage_Repo_Tests/storage$ git submodule status
-4be40e6b9077674513c0defd5283faf3cbae8445 dlio_benchmark

The .gitmodules file shows the submodule has been updated to point to mlcommons/DLIO_local_changes (matching the pyproject.toml change), but the - prefix in git submodule status means the submodule hasn't been initialized/checked out.

To summarize

This PR actually partially addresses issue #293 in three ways:
it aligns pyproject.toml, uv.lock, and .gitmodules to the same repo (mlcommons/DLIO_local_changes).
But the submodule is still declared — it's just not actively used since the dependency is pulled via pip/uv from the git URL in pyproject.toml.

A follow-up cleanup to remove the submodule entirely (i.e. delete .gitmodules, git rm dlio_benchmark) would fully close issue #293, since the submodule is redundant with the pyproject.toml dependency declaration.
But that's a separate concern from the correctness of this PR.

The PR looks safe to approve from functional standpoint overall.

@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

Devasena, thank you for digging into this!! Both the testing of the pyproject.toml and uv.lock files, and the suggestion that the submodule linkage is not required.

Since using dlio_benchmark as a "submodule" causes trouble for the end-user (they have to add the --recurse-submodules option when cloning the storage repo), if we can depend upon the pyproject.toml file to pull our DLIO into the user's environment, then it sounds like we should go that direction?

IIUC, that means two things:

  1. the "download and run" instructions for the benchmark must include "pip install uv; uv sync" so that the correct DLIO actually gets pulled down onto the local machine
  2. I can delete the dlio_benchmark "submodule" linkage from the storage repo

Does that sound correct, and is it a good idea?

@idevasena
Copy link
Copy Markdown
Contributor

Thank you, Curtis! Yes, that's correct and a good idea.

  1. Agree with removing the submodule as a right move:
    The submodule was the original mechanism (v0.5/v1.0) when users ran git clone --recurse-submodules and then pip install -r dlio_benchmark/requirements.txt. The v2.0 installer already moved to pyproject.toml with pip install -e . pulling DLIO from git. So, the submodule has been there since v2.0 — it just creates confusion (as issue dlio_benchmark submodule is unused? #293 highlighted) and a clone-time issue for users who forget --recurse-submodules and then wonder why the dlio_benchmark/ directory is empty.

  2. To your point on installation instructions:
    Yes, but with a nuance. The current README already says pip install -e . which pulls DLIO via the pyproject.toml git dependency. With this PR merged, that dependency correctly points to mlcommons/DLIO_local_changes. So the instructions would be either:

pip path: pip install -e ".[dlio]" (or .[full]) — works today, no uv required
uv path: uv sync --extra dlio — faster, deterministic via the lock file

Can we support both in the README, since we are not sure all users will have uv? Something like below?

# Option A: pip (works everywhere)
pip install -e ".[dlio]"

# Option B: uv (faster, reproducible)
pip install uv
uv sync --extra dlio

The key thing is that DLIO gets pulled transitively either way — the user never needs to think about submodules, separate clones, or branch names.

  1. Yes, right for deleting the submodule. The clean up steps can be:
# Remove the submodule entry from .gitmodules
git rm --cached dlio_benchmark
rm -rf dlio_benchmark
# Delete the .gitmodules file entirely (if dlio_benchmark is the only submodule)
git rm .gitmodules
# Clean up .git/modules
rm -rf .git/modules/dlio_benchmark

One last thing to look for is to make sure no scripts in the repo reference dlio_benchmark/as a local path. Quick check with:

grep -r "dlio_benchmark/" --include="*.py" --include="*.sh" --include="*.yaml" --include="*.md" .

@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

To dig into one sub-topic: I'm attracted to the idea that uv will ensure all the correct versions of all the python libraries we need are installed. My impression is that pip alone will not do that (ie: uv.lock is not referenced by pip?).

I'm building PR's for the other topics you pointed out.

@idevasena
Copy link
Copy Markdown
Contributor

@FileSystemGuy You're right. pip does not read uv.lock. The lock file is purely a uv artifact. So the two paths aren't equivalent in terms of reproducibility.

Shall we just then make uv sync the official install path in the README and submission guidelines, ensuring benchmark reproducibility. We can mention pip install -e ".[dlio]" as a "works but not guaranteed identical" if needed.

So, the README will then just be updated with below:

pip install uv
uv sync --extra dlio

@russfellows what are your thoughts on using just uv as opposed to pip fallback or pip with constraints as a middle ground?

@russfellows
Copy link
Copy Markdown
Contributor

russfellows commented Apr 3, 2026 via email

Without this change, the user has to add "-extra dlio" to the uv sync command, which they likely will do, so better to download it even if they don't use it than to not get it when they need it.
@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

FileSystemGuy commented Apr 3, 2026

Based upon Russ' research I made one change in the PR, I moved the dlio dependency to required rather than optional. Without this change, the user has to add "-extra dlio" to the "uv sync" command, which they likely will forget to do, so better to download it even if they don't use it than to not get it when they need it.

I will make the suggested changes to the README to note that "uv" is the required package manager, including instructions on how to install uv, that pip is explicitly not supported for use with the benchmark, and that "uv run" should be prepended to all use of mlpstorage to ensure that all the correct versions of libraries are used. I'll also see if there's a way to force "uv run" into mlpstorage itself to guarantee that it is used on every benchmark execution. Is this too micromanaging, or is this better ease-of-use and repeatability?

@FileSystemGuy
Copy link
Copy Markdown
Contributor Author

I took my own advice and decided to fix this once and for all.

I created PR#308 to prepend "uv run" to every invocation of mlpstorage. Whether they used pip or not, uv will force the creation and use of the correct virtual environment. If they haven't installed uv yet, then the command will fail and they will have to install uv.

How's that for a big hammer? :-)

@FileSystemGuy FileSystemGuy added the DLIO or mlpstorage related to code in mlpstorage or dlio label Apr 3, 2026
@FileSystemGuy FileSystemGuy merged commit 83abf2e into main Apr 4, 2026
2 checks passed
@FileSystemGuy FileSystemGuy deleted the fsg-python-version branch April 4, 2026 19:53
@github-actions github-actions bot locked and limited conversation to collaborators Apr 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

DLIO or mlpstorage related to code in mlpstorage or dlio

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants