Skip to content

Submission checker reads combined_params / params_dict, but writer emits parameters / override_parameters (3.0.26) #598

Description

@wolfgang-desalvador

Summary

In mlpstorage 3.0.26 the BaseBenchmark.metadata writer and the
submission_checker readers disagree about the top-level keys under which
the merged DLIO parameters and user-override parameters are stored in
*_metadata.json. The writer was renamed as part of the fix for #365
(combined_paramsparameters, params_dictoverride_parameters)
but several reader sites in submission_checker/checks/ still look up the
old names. As a result, every check that depends on these dicts silently
sees {} and either:

  • emits a spurious VIOLATION ("dataset parameters not found in
    metadata", "record length is 0", "datagen size 0.00GiB < required …"), or
  • silently no-ops a CLOSED/OPEN allow-list check that should have fired,

depending on whether the check fails-open or fails-closed when the dict is
empty.

This blocks mlpstorage validate from passing on any locally-produced
training submission, even when the run itself is correct.

Environment

mlpstorage version 3.0.26
Install pip install mlpstorage into a venv
Affected commands mlpstorage validate, mlpstorage reports reportgen (partial — reportgen also uses parameters, so it works)

Reproduction

  1. Run any training benchmark to completion (e.g. mlpstorage training run --model unet3d ...) so that
    closed/<orgname>/results/<system>/training/unet3d/run/<ts>/training_unet3d_metadata.json is
    written.
  2. Confirm the JSON contains the new key names:
    python -c 'import json,sys; m=json.load(open(sys.argv[1])); print(sorted(k for k in m if "param" in k))' \
      closed/<orgname>/results/<system>/training/unet3d/run/<ts>/training_unet3d_metadata.json
    # → ['override_parameters', 'parameters']
  3. Run mlpstorage validate <results-dir>.
  4. Observe spurious violations:
    • [3.1.1] trainingVerifyDatasizeUsage: dataset parameters not found in metadata
    • [3.1.2] trainingRecalculateDatasetSize: record length is 0, cannot calculate dataset size
    • [3.2.1] trainingDatagenMinimumSize: datagen size 0.00GiB is less than required 0.00GiB

Root cause

The writer at
mlpstorage_py/benchmarks/base.py lines 408–420
deliberately stores the merged config under parameters and the user-overrides under
override_parameters (this rename is referenced in the inline comment that
calls out #365):

# Parameters - YAML defaults with CLI overrides folded in (fixes #365).
# combined_params alone omits overrides added after __init__ (e.g.
# checkpoint.num_checkpoints_*), causing split-phase runs to double-count.
if hasattr(self, 'combined_params'):
    metadata['parameters'] = self._apply_dotted_overrides(
        self.combined_params, getattr(self, 'params_dict', {}))
else:
    metadata['parameters'] = {}

# Override parameters - user-specified overrides only
if hasattr(self, 'params_dict'):
    metadata['override_parameters'] = self.params_dict
else:
    metadata['override_parameters'] = {}

But the reader-side checks under mlpstorage_py/submission_checker/checks/
still call metadata.get("combined_params", {}) / metadata.get("params_dict", {}).
Because the lookups silently default to {}, the downstream code paths
either crash with a record_length == 0 early-exit or pass the wrong
allow-list check.

Affected reader sites

File Line Current lookup Should be
submission_checker/checks/training_checks.py 227 metadata.get("combined_params", {}) metadata.get("parameters", {})
submission_checker/checks/training_checks.py 238 metadata.get("combined_params", {}) metadata.get("parameters", {})
submission_checker/checks/training_checks.py 579 metadata.get("params_dict", {}) metadata.get("override_parameters", {})
submission_checker/checks/training_checks.py 642 metadata.get("params_dict", {}) metadata.get("override_parameters", {})
submission_checker/checks/checkpointing_checks.py 160 metadata.get("combined_params", {}) metadata.get("parameters", {})
submission_checker/checks/checkpointing_checks.py 223 metadata.get("params_dict", {}) metadata.get("override_parameters", {})
submission_checker/checks/checkpointing_checks.py 504 metadata.get("params_dict", {}) metadata.get("override_parameters", {})
submission_checker/checks/vdb_checks.py 880 metadata.get("params_dict", {}) metadata.get("override_parameters", {})
submission_checker/checks/vdb_checks.py 941 metadata.get("params_dict", {}) metadata.get("override_parameters", {})

The tests under tests/test_training_check_tool_injected_params.py and
tests/test_vdb_checks.py construct fixture metadata using the old
key names (params_dict, combined_params), so the existing test suite
does not catch this — the tests test the checker against itself, not
against the actual writer output.

Expected behaviour

mlpstorage validate against a clean, locally-produced submission tree
should pass all "parameter shape" checks (3.1.1, 3.1.2, 3.2.1, 3.6.2,
3.6.3 in the training suite; the corresponding checkpointing/VDB rules)
without manual JSON post-processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions