Bug: Submission checker rule 3.1.2 double-counts host memory, doubling the required dataset size


## Summary

Rule 3.1.2 recomputes the minimum required dataset size from total cluster host memory. It derives "total host memory" by reading **only the first element** of the per-host `host_memory_GB` array and multiplying it by `num_hosts`:

```python
# training_checks.py
host_memory_gb   = summary.get("host_memory_GB", [0])[0]   # L164  -> index [0] only
...
total_host_memory = num_hosts * host_memory_gb             # L180  -> assumes ALL hosts == host_memory_GB[0]
min_samples_memory = (total_host_memory * HOST_MEMORY_MULTIPLIER * 1024**3 / record_length)  # L181
```

This assumes the `host_memory_GB` array is homogeneous (every host equals index `[0]`). In practice the DLIO-generated `summary.json` array is **not** a clean one-value-per-host list — it contains duplicated (2×) and zero entries whose **sum** equals the true total. When `host_memory_GB[0]` happens to be an inflated (doubled) entry, `num_hosts * host_memory_GB[0]` yields **exactly double** the real cluster memory, and the minimum-dataset-size requirement doubles.

The correct total is `sum(host_memory_GB)`, not `num_hosts * host_memory_GB[0]`.

---

## Environment / evidence (real submission)

- Cluster: 15 client hosts, 30 accelerators (2 ranks/host), `reader.batch_size=7`.
- Per-host RAM (verified independently via `/proc/meminfo` on every host, and in the tool's own `collector-staging/cluster_info.json`): `MemTotal = 197,223,052 kB = 188.09 GiB` on all 15 hosts. **True total = 15 × 188.09 ≈ 2,821 GiB.**
- `record_length_bytes = 146,600,628`, `num_samples_per_file = 1`, `HOST_MEMORY_MULTIPLIER = 5`.
- Dataset generated: `num_files_train = 105,000`.

`summary.json` `host_memory_GB` (per run):

```
[376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 187.90, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

Note: `376.18 ≈ 2 × 188.09` and `sum(array) = 2,821 GiB` = the correct cluster total, but `array[0] = 376.18` is a **doubled** value.

---

## The miscalculation

| Quantity | Checker (buggy) | Correct |
|---|---|---|
| Total host memory | `num_hosts × array[0]` = 15 × 376.18 = **5,642.7 GiB** | `sum(array)` = **2,821 GiB** |
| `min_samples_memory` = mem × 5 × 1024³ / record_length | **206,641 files** | **103,313 files** |
| Verdict vs 105,000 generated | **FAIL** ("actual files 105000 < minimum required 206641") | **PASS** |

The cluster memory is inflated by exactly 2×, so the required dataset is inflated by exactly 2×. The submission is genuinely compliant (105,000 ≥ 103,313) but is falsely reported invalid.

---

## Expected vs. actual behavior

- **Expected:** total host memory used by rule 3.1.2 equals the real aggregate cluster RAM (~2,821 GiB), giving a minimum of 103,313 files → PASS.
- **Actual:** total host memory is `num_hosts × host_memory_GB[0]` (~5,643 GiB), giving 206,641 files → FAIL.

---

## Root cause

`training_checks.py` (rule 3.1.2) treats `host_memory_GB` as a homogeneous per-host scalar and scales `host_memory_GB[0]` by `num_hosts`. The array it receives from DLIO's `summary.json` is a per-rank/reduced array where entries are not one-per-host: some hosts appear at 2× (rank-doubled), some at 0. Its **sum** is correct; **index [0] × num_hosts** is not.

There are effectively two defects; the checker-side one is decisive:

1. **(Primary, checker) `training_checks.py` L164/L180:** should aggregate the whole array (`sum(host_memory_GB)`), not `num_hosts * host_memory_GB[0]`.
2. **(Secondary, DLIO summary) `host_memory_GB`** in `summary.json` is emitted as a non-uniform per-rank array (duplicated/zero entries) rather than one clean value per physical host. Even if the checker is fixed, this array is misleading to any consumer that indexes it positionally.

---

## Proposed fix

In `mlpstorage_py/submission_checker/checks/training_checks.py`, rule 3.1.2, replace the index-`[0]`-times-`num_hosts` logic with a sum over the reported per-host values:

```python
# Before
host_memory_gb    = summary.get("host_memory_GB", [0])[0]
...
total_host_memory = num_hosts * host_memory_gb

# After
host_memory_list  = summary.get("host_memory_GB", []) or []
total_host_memory = sum(host_memory_list)          # aggregate real cluster RAM
# (optionally) guard: if not total_host_memory: fall back / warn
```

`sum(host_memory_GB)` yields the correct 2,821 GiB regardless of how DLIO distributes the values across array positions.

Additionally (secondary), DLIO's `summary.json` should emit `host_memory_GB` as one value per unique host (length == `num_hosts`, no zero padding, no rank doubling) so positional consumers are robust.

---

## Workaround (submitter side)

None that is clean without altering the CLOSED code hash. Options: (a) generate a dataset sized to the inflated requirement (~206,641 files, roughly 2× storage — wasteful and only masks the bug), or (b) obtain a waiver/exception given the verified true memory and the corrected calculation above.

---

## Reproduction

1. Run MLPerf Storage v3 training / UNet3D (CLOSED, `file` backend) on a multi-rank-per-host cluster (e.g., 2 accelerators/host) so DLIO's `host_memory_GB` array contains doubled/zero entries.
2. Size the dataset to the **correct** memory-based minimum (`sum(host_memory_GB) × 5 × 1024³ / record_length`).
3. Run `mlpstorage validate <results-dir>`.
4. Rule 3.1.2 fails with `dataset size mismatch: actual files N < minimum required 2N`, where `2N` reflects `num_hosts × host_memory_GB[0]` rather than `sum(host_memory_GB)`.

---

## Suggested references for the tracker

- File: `mlpstorage_py/submission_checker/checks/training_checks.py`, rule `trainingRecalculateDatasetSize` (decorator L128), lines L137 (`HOST_MEMORY_MULTIPLIER = 5`), **L164** (`host_memory_GB[0]`), **L180** (`num_hosts * host_memory_gb`), L181 (`min_samples_memory`).
- DLIO `summary.json` field `host_memory_GB` population (per-rank vs per-host).

Quantity	Checker (buggy)	Correct
Total host memory	`num_hosts × array[0]` = 15 × 376.18 = 5,642.7 GiB	`sum(array)` = 2,821 GiB
`min_samples_memory` = mem × 5 × 1024³ / record_length	206,641 files	103,313 files
Verdict vs 105,000 generated	FAIL ("actual files 105000 < minimum required 206641")	PASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Submission checker rule 3.1.2 double-counts host memory, doubling the required dataset size #669

Summary

Environment / evidence (real submission)

The miscalculation

Expected vs. actual behavior

Root cause

Proposed fix

Workaround (submitter side)

Reproduction

Suggested references for the tracker

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bug: Submission checker rule 3.1.2 double-counts host memory, doubling the required dataset size #669

Description

Summary

Environment / evidence (real submission)

The miscalculation

Expected vs. actual behavior

Root cause

Proposed fix

Workaround (submitter side)

Reproduction

Suggested references for the tracker

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions