mlpstorage multi-run score aggregation

@wolfgang-desalvador asked how the scores are aggregated across and within invocations of DLIO for training, here is what Claude tells me:

❯ Excellent research, thank you. So, the per-DLIO-invocation aggregation is the mean() of the scores from the 5
  epochs within that invocation.  The mlpstorage tool doesn't currently do any aggregation of the scores returned
  by each of the 5 (I'm excluding the warmup run) invocations of DLIO, but it needs to.  Is that an accurate
  summary?

● Yes, that's an accurate summary — with one small precision worth calling out on the per-invocation side.

  Per-DLIO-invocation aggregation (single summary.json) — accurate:
  - The headline train_au_mean_percentage is np.mean(train_au) across the 5 per-epoch AUs.
  - The headline B/W (train_throughput_mean_samples_per_second, train_io_mean_MB_per_second) is
  np.mean(train_throughput) across the 5 per-epoch throughputs (converted to MiB/s via record_size).
  - Small precision: those per-epoch AU/throughput values that DLIO takes the mean of are themselves step-weighted
  means of the block-level values within the epoch (statscounter.py:end_train, au = np.sum(au*steps)/np.sum(steps)).
   So the invocation-level score is a mean of step-weighted-mean-per-epoch values — you can accurately call it "the
  mean of the scores from the 5 epochs" for user-facing purposes; just noting the epoch-level number isn't a naive
  average of blocks.

  Cross-invocation aggregation (across the 5 measured invocations) — accurate:
  - Neither DLIO nor mlpstorage currently reduces the five per-invocation values into a single per-submission AU% or
   B/W.
  - DLIO never sees more than one invocation, so this is properly mlpstorage's responsibility.
  - mlpstorage has two stub sites where it belongs:
    - mlpstorage_py/report_generator.py:423 — workload-group Result built with metrics={}  # TODO: Add function to
  aggregate metrics
    - mlpstorage_py/submission_checker/results.py:56–69 — ResultExporter.add_result() emits empty strings for the
  3D-Unet - Read B/W (GiB/s) column and its peers
  - And the aggregation method itself is undefined in Rules.md — §2.1.17 only says the warmup "will not be included
  in the reported performance," but doesn't specify whether the remaining 5 are combined by mean, min, geometric
  mean, or something else.

  So "mlpstorage needs to aggregate the 5 measured invocations" is correct, and there are actually two open
  questions: (1) implement the code, and (2) decide/document in Rules.md which reduction to use — the current
  codebase, the current spec, and the current advisor folklore ("it's the min") all disagree, so a written rule is a
   prerequisite to the implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mlpstorage multi-run score aggregation #645

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

mlpstorage multi-run score aggregation #645

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions