Skip to content

Bug: Submission checker rule 3.1.2 double-counts host memory, doubling the required dataset size #669

Description

@mirajeev

Summary

Rule 3.1.2 recomputes the minimum required dataset size from total cluster host memory. It derives "total host memory" by reading only the first element of the per-host host_memory_GB array and multiplying it by num_hosts:

# training_checks.py
host_memory_gb   = summary.get("host_memory_GB", [0])[0]   # L164  -> index [0] only
...
total_host_memory = num_hosts * host_memory_gb             # L180  -> assumes ALL hosts == host_memory_GB[0]
min_samples_memory = (total_host_memory * HOST_MEMORY_MULTIPLIER * 1024**3 / record_length)  # L181

This assumes the host_memory_GB array is homogeneous (every host equals index [0]). In practice the DLIO-generated summary.json array is not a clean one-value-per-host list — it contains duplicated (2×) and zero entries whose sum equals the true total. When host_memory_GB[0] happens to be an inflated (doubled) entry, num_hosts * host_memory_GB[0] yields exactly double the real cluster memory, and the minimum-dataset-size requirement doubles.

The correct total is sum(host_memory_GB), not num_hosts * host_memory_GB[0].


Environment / evidence (real submission)

  • Cluster: 15 client hosts, 30 accelerators (2 ranks/host), reader.batch_size=7.
  • Per-host RAM (verified independently via /proc/meminfo on every host, and in the tool's own collector-staging/cluster_info.json): MemTotal = 197,223,052 kB = 188.09 GiB on all 15 hosts. True total = 15 × 188.09 ≈ 2,821 GiB.
  • record_length_bytes = 146,600,628, num_samples_per_file = 1, HOST_MEMORY_MULTIPLIER = 5.
  • Dataset generated: num_files_train = 105,000.

summary.json host_memory_GB (per run):

[376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 187.90, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Note: 376.18 ≈ 2 × 188.09 and sum(array) = 2,821 GiB = the correct cluster total, but array[0] = 376.18 is a doubled value.


The miscalculation

Quantity Checker (buggy) Correct
Total host memory num_hosts × array[0] = 15 × 376.18 = 5,642.7 GiB sum(array) = 2,821 GiB
min_samples_memory = mem × 5 × 1024³ / record_length 206,641 files 103,313 files
Verdict vs 105,000 generated FAIL ("actual files 105000 < minimum required 206641") PASS

The cluster memory is inflated by exactly 2×, so the required dataset is inflated by exactly 2×. The submission is genuinely compliant (105,000 ≥ 103,313) but is falsely reported invalid.


Expected vs. actual behavior

  • Expected: total host memory used by rule 3.1.2 equals the real aggregate cluster RAM (~2,821 GiB), giving a minimum of 103,313 files → PASS.
  • Actual: total host memory is num_hosts × host_memory_GB[0] (~5,643 GiB), giving 206,641 files → FAIL.

Root cause

training_checks.py (rule 3.1.2) treats host_memory_GB as a homogeneous per-host scalar and scales host_memory_GB[0] by num_hosts. The array it receives from DLIO's summary.json is a per-rank/reduced array where entries are not one-per-host: some hosts appear at 2× (rank-doubled), some at 0. Its sum is correct; index [0] × num_hosts is not.

There are effectively two defects; the checker-side one is decisive:

  1. (Primary, checker) training_checks.py L164/L180: should aggregate the whole array (sum(host_memory_GB)), not num_hosts * host_memory_GB[0].
  2. (Secondary, DLIO summary) host_memory_GB in summary.json is emitted as a non-uniform per-rank array (duplicated/zero entries) rather than one clean value per physical host. Even if the checker is fixed, this array is misleading to any consumer that indexes it positionally.

Proposed fix

In mlpstorage_py/submission_checker/checks/training_checks.py, rule 3.1.2, replace the index-[0]-times-num_hosts logic with a sum over the reported per-host values:

# Before
host_memory_gb    = summary.get("host_memory_GB", [0])[0]
...
total_host_memory = num_hosts * host_memory_gb

# After
host_memory_list  = summary.get("host_memory_GB", []) or []
total_host_memory = sum(host_memory_list)          # aggregate real cluster RAM
# (optionally) guard: if not total_host_memory: fall back / warn

sum(host_memory_GB) yields the correct 2,821 GiB regardless of how DLIO distributes the values across array positions.

Additionally (secondary), DLIO's summary.json should emit host_memory_GB as one value per unique host (length == num_hosts, no zero padding, no rank doubling) so positional consumers are robust.


Workaround (submitter side)

None that is clean without altering the CLOSED code hash. Options: (a) generate a dataset sized to the inflated requirement (~206,641 files, roughly 2× storage — wasteful and only masks the bug), or (b) obtain a waiver/exception given the verified true memory and the corrected calculation above.


Reproduction

  1. Run MLPerf Storage v3 training / UNet3D (CLOSED, file backend) on a multi-rank-per-host cluster (e.g., 2 accelerators/host) so DLIO's host_memory_GB array contains doubled/zero entries.
  2. Size the dataset to the correct memory-based minimum (sum(host_memory_GB) × 5 × 1024³ / record_length).
  3. Run mlpstorage validate <results-dir>.
  4. Rule 3.1.2 fails with dataset size mismatch: actual files N < minimum required 2N, where 2N reflects num_hosts × host_memory_GB[0] rather than sum(host_memory_GB).

Suggested references for the tracker

  • File: mlpstorage_py/submission_checker/checks/training_checks.py, rule trainingRecalculateDatasetSize (decorator L128), lines L137 (HOST_MEMORY_MULTIPLIER = 5), L164 (host_memory_GB[0]), L180 (num_hosts * host_memory_gb), L181 (min_samples_memory).
  • DLIO summary.json field host_memory_GB population (per-rank vs per-host).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions