Skip to content

feat(ai): tiny-AI 3-arch LOSO evaluation harness + Research-0023#176

Merged
lusoris merged 1 commit intomasterfrom
feat/loso-3arch-eval
Apr 28, 2026
Merged

feat(ai): tiny-AI 3-arch LOSO evaluation harness + Research-0023#176
lusoris merged 1 commit intomasterfrom
feat/loso-3arch-eval

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 28, 2026

Summary

Extends PR #165's eval_loso_mlp_small.py to score all three ADR-0203 architectures (mlp_small, mlp_medium, linear) on their respective LOSO folds in one pass. New ai/scripts/eval_loso_3arch.py reuses the _load_session external-data workaround + _load_clip per-clip cache loader from PR #165 — no helper duplication.

Headline results (Netflix Public corpus, 9 folds × 30 epochs each)

arch params mean PLCC mean SROCC mean RMSE
mlp_small 257 0.9808 ± 0.0214 0.9848 ± 0.0176 14.907 ± 2.218
mlp_medium 2 561 0.9727 ± 0.0202 0.9794 ± 0.0156 10.848 ± 2.302
linear 7 0.3679 ± 0.0773 0.4861 ± 0.0975 57.868 ± 5.867

Confirms ADR-0203's earlier single-split finding under proper LOSO:

  • mlp_small wins on ranking — default vmaf_tiny_v1.onnx stays.
  • mlp_medium wins on absolute fit (~27 % RMSE reduction) — alternate vmaf_tiny_v1_medium.onnx stays for absolute-VMAF-agreement users.
  • linear is a clear sanity floor (PLCC 0.37 → 6 features carry strong non-linear signal). Does not ship.

FoxBird is the per-fold outlier on both MLPs (lowest PLCC ≈ 0.93 on both arch). Content-distribution mismatch within the existing 9-source Netflix Public corpus, not arch-specific overfitting. The corpus is the full Netflix Public Dataset already at .workingdir2/netflix/ — so the unblocker for that variance is a different / larger training corpus (KoNViD-1k, BVI-DVC, AOM-CTC source sets), not "more Netflix Public".

What changed

File Change
ai/scripts/eval_loso_3arch.py new harness; reuses PR #165 helpers
docs/research/0023-loso-3arch-results.md new research digest with per-fold tables + reproducer
CHANGELOG.md Unreleased § Added
docs/rebase-notes.md new entry 0072

Deep-dive deliverables (ADR-0108)

  • Research digestdocs/research/0023-loso-3arch-results.md
  • Decision matrix — no alternatives needed: extends PR feat(ai): tiny-AI LOSO evaluation harness for mlp_small #165's harness; the only design decision (reuse vs fork the helpers) is settled inline (reuse).
  • AGENTS.md invariant note — no rebase-sensitive AGENTS invariants needed; rebase note 0072 captures the helper-reuse invariant.
  • Reproducer / smoke-test command — see "How to reproduce" in Research-0023 §6
  • CHANGELOG.md — Unreleased § Added
  • Rebase notedocs/rebase-notes.md entry 0072

Test plan

  • pre-commit run --files on touched files — passed
  • python ai/scripts/eval_loso_3arch.py — reproduces the headline numbers (PLCC 0.9808 / 0.9727 / 0.3679 across the three arch)
  • All 27 fold ONNX (3 arch × 9 folds) load cleanly via the PR feat(ai): tiny-AI LOSO evaluation harness for mlp_small #165 _load_session helper

🤖 Generated with Claude Code

@lusoris lusoris force-pushed the feat/loso-3arch-eval branch 2 times, most recently from cc95505 to 1f1f03a Compare April 28, 2026 19:42
Extends PR #165's `eval_loso_mlp_small.py` to score `mlp_small`,
`mlp_medium`, and `linear` regressors on their respective LOSO folds
in one pass. New `ai/scripts/eval_loso_3arch.py` reuses the
`_load_session` external-data workaround + `_load_clip` per-clip
cache loader from the PR #165 helper.

Headline results on the Netflix corpus (9 folds × 30 epochs each,
mean ± std across folds):

  arch          params  mean PLCC          mean SROCC         mean RMSE
  mlp_small     257     0.9808 ± 0.0214    0.9848 ± 0.0176    14.907 ± 2.218
  mlp_medium    2 561   0.9727 ± 0.0202    0.9794 ± 0.0156    10.848 ± 2.302
  linear        7       0.3679 ± 0.0773    0.4861 ± 0.0975    57.868 ± 5.867

Confirms ADR-0203's earlier single-split finding under proper LOSO:

* mlp_small wins on ranking (default `vmaf_tiny_v1.onnx` stays).
* mlp_medium wins on absolute fit (~27 % RMSE reduction; alternate
  `vmaf_tiny_v1_medium.onnx` stays for absolute-VMAF-agreement
  users).
* linear is a clear sanity floor (PLCC 0.37 → 6 features carry
  strong non-linear signal that linear can't capture). Does not
  ship.

FoxBird is the per-fold outlier on both MLPs (lowest PLCC ≈ 0.93 on
both arch). Same outlier on both arch rules out arch-specific
overfitting; it's a corpus-distribution issue. T6-1a (Netflix
Public Dataset access) is the natural unblocker.

Per-fold tables + cross-arch observations + reproducer in
docs/research/0023-loso-3arch-results.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris force-pushed the feat/loso-3arch-eval branch from 1f1f03a to 1db3e4b Compare April 28, 2026 19:53
@lusoris lusoris merged commit 0e483a3 into master Apr 28, 2026
49 checks passed
@lusoris lusoris deleted the feat/loso-3arch-eval branch April 28, 2026 20:18
@github-actions github-actions Bot mentioned this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant