@wolfgang-desalvador asked how the scores are aggregated across and within invocations of DLIO for training, here is what Claude tells me:
❯ Excellent research, thank you. So, the per-DLIO-invocation aggregation is the mean() of the scores from the 5
epochs within that invocation. The mlpstorage tool doesn't currently do any aggregation of the scores returned
by each of the 5 (I'm excluding the warmup run) invocations of DLIO, but it needs to. Is that an accurate
summary?
● Yes, that's an accurate summary — with one small precision worth calling out on the per-invocation side.
Per-DLIO-invocation aggregation (single summary.json) — accurate:
- The headline train_au_mean_percentage is np.mean(train_au) across the 5 per-epoch AUs.
- The headline B/W (train_throughput_mean_samples_per_second, train_io_mean_MB_per_second) is
np.mean(train_throughput) across the 5 per-epoch throughputs (converted to MiB/s via record_size).
- Small precision: those per-epoch AU/throughput values that DLIO takes the mean of are themselves step-weighted
means of the block-level values within the epoch (statscounter.py:end_train, au = np.sum(au*steps)/np.sum(steps)).
So the invocation-level score is a mean of step-weighted-mean-per-epoch values — you can accurately call it "the
mean of the scores from the 5 epochs" for user-facing purposes; just noting the epoch-level number isn't a naive
average of blocks.
Cross-invocation aggregation (across the 5 measured invocations) — accurate:
- Neither DLIO nor mlpstorage currently reduces the five per-invocation values into a single per-submission AU% or
B/W.
- DLIO never sees more than one invocation, so this is properly mlpstorage's responsibility.
- mlpstorage has two stub sites where it belongs:
- mlpstorage_py/report_generator.py:423 — workload-group Result built with metrics={} # TODO: Add function to
aggregate metrics
- mlpstorage_py/submission_checker/results.py:56–69 — ResultExporter.add_result() emits empty strings for the
3D-Unet - Read B/W (GiB/s) column and its peers
- And the aggregation method itself is undefined in Rules.md — §2.1.17 only says the warmup "will not be included
in the reported performance," but doesn't specify whether the remaining 5 are combined by mean, min, geometric
mean, or something else.
So "mlpstorage needs to aggregate the 5 measured invocations" is correct, and there are actually two open
questions: (1) implement the code, and (2) decide/document in Rules.md which reduction to use — the current
codebase, the current spec, and the current advisor folklore ("it's the min") all disagree, so a written rule is a
prerequisite to the implementation.
@wolfgang-desalvador asked how the scores are aggregated across and within invocations of DLIO for training, here is what Claude tells me:
❯ Excellent research, thank you. So, the per-DLIO-invocation aggregation is the mean() of the scores from the 5
epochs within that invocation. The mlpstorage tool doesn't currently do any aggregation of the scores returned
by each of the 5 (I'm excluding the warmup run) invocations of DLIO, but it needs to. Is that an accurate
summary?
● Yes, that's an accurate summary — with one small precision worth calling out on the per-invocation side.
Per-DLIO-invocation aggregation (single summary.json) — accurate:
np.mean(train_throughput) across the 5 per-epoch throughputs (converted to MiB/s via record_size).
means of the block-level values within the epoch (statscounter.py:end_train, au = np.sum(au*steps)/np.sum(steps)).
So the invocation-level score is a mean of step-weighted-mean-per-epoch values — you can accurately call it "the
mean of the scores from the 5 epochs" for user-facing purposes; just noting the epoch-level number isn't a naive
average of blocks.
Cross-invocation aggregation (across the 5 measured invocations) — accurate:
B/W.
aggregate metrics
3D-Unet - Read B/W (GiB/s) column and its peers
in the reported performance," but doesn't specify whether the remaining 5 are combined by mean, min, geometric
mean, or something else.
So "mlpstorage needs to aggregate the 5 measured invocations" is correct, and there are actually two open
questions: (1) implement the code, and (2) decide/document in Rules.md which reduction to use — the current
codebase, the current spec, and the current advisor folklore ("it's the min") all disagree, so a written rule is a
prerequisite to the implementation.