feat(ai): tiny-AI 3-arch LOSO evaluation harness + Research-0023 by lusoris · Pull Request #176 · lusoris/vmaf

lusoris · 2026-04-28T19:06:09Z

Summary

Extends PR #165's eval_loso_mlp_small.py to score all three ADR-0203 architectures (mlp_small, mlp_medium, linear) on their respective LOSO folds in one pass. New ai/scripts/eval_loso_3arch.py reuses the _load_session external-data workaround + _load_clip per-clip cache loader from PR #165 — no helper duplication.

Headline results (Netflix Public corpus, 9 folds × 30 epochs each)

arch	params	mean PLCC	mean SROCC	mean RMSE
`mlp_small`	257	0.9808 ± 0.0214	0.9848 ± 0.0176	14.907 ± 2.218
`mlp_medium`	2 561	0.9727 ± 0.0202	0.9794 ± 0.0156	10.848 ± 2.302
`linear`	7	0.3679 ± 0.0773	0.4861 ± 0.0975	57.868 ± 5.867

Confirms ADR-0203's earlier single-split finding under proper LOSO:

mlp_small wins on ranking — default vmaf_tiny_v1.onnx stays.
mlp_medium wins on absolute fit (~27 % RMSE reduction) — alternate vmaf_tiny_v1_medium.onnx stays for absolute-VMAF-agreement users.
linear is a clear sanity floor (PLCC 0.37 → 6 features carry strong non-linear signal). Does not ship.

FoxBird is the per-fold outlier on both MLPs (lowest PLCC ≈ 0.93 on both arch). Content-distribution mismatch within the existing 9-source Netflix Public corpus, not arch-specific overfitting. The corpus is the full Netflix Public Dataset already at .workingdir2/netflix/ — so the unblocker for that variance is a different / larger training corpus (KoNViD-1k, BVI-DVC, AOM-CTC source sets), not "more Netflix Public".

What changed

File	Change
`ai/scripts/eval_loso_3arch.py`	new harness; reuses PR #165 helpers
`docs/research/0023-loso-3arch-results.md`	new research digest with per-fold tables + reproducer
`CHANGELOG.md`	Unreleased § Added
`docs/rebase-notes.md`	new entry 0072

Deep-dive deliverables (ADR-0108)

Research digest — docs/research/0023-loso-3arch-results.md
Decision matrix — no alternatives needed: extends PR feat(ai): tiny-AI LOSO evaluation harness for mlp_small #165's harness; the only design decision (reuse vs fork the helpers) is settled inline (reuse).
AGENTS.md invariant note — no rebase-sensitive AGENTS invariants needed; rebase note 0072 captures the helper-reuse invariant.
Reproducer / smoke-test command — see "How to reproduce" in Research-0023 §6
CHANGELOG.md — Unreleased § Added
Rebase note — docs/rebase-notes.md entry 0072

Test plan

pre-commit run --files on touched files — passed
python ai/scripts/eval_loso_3arch.py — reproduces the headline numbers (PLCC 0.9808 / 0.9727 / 0.3679 across the three arch)
All 27 fold ONNX (3 arch × 9 folds) load cleanly via the PR feat(ai): tiny-AI LOSO evaluation harness for mlp_small #165 _load_session helper

🤖 Generated with Claude Code

Extends PR #165's `eval_loso_mlp_small.py` to score `mlp_small`, `mlp_medium`, and `linear` regressors on their respective LOSO folds in one pass. New `ai/scripts/eval_loso_3arch.py` reuses the `_load_session` external-data workaround + `_load_clip` per-clip cache loader from the PR #165 helper. Headline results on the Netflix corpus (9 folds × 30 epochs each, mean ± std across folds): arch params mean PLCC mean SROCC mean RMSE mlp_small 257 0.9808 ± 0.0214 0.9848 ± 0.0176 14.907 ± 2.218 mlp_medium 2 561 0.9727 ± 0.0202 0.9794 ± 0.0156 10.848 ± 2.302 linear 7 0.3679 ± 0.0773 0.4861 ± 0.0975 57.868 ± 5.867 Confirms ADR-0203's earlier single-split finding under proper LOSO: * mlp_small wins on ranking (default `vmaf_tiny_v1.onnx` stays). * mlp_medium wins on absolute fit (~27 % RMSE reduction; alternate `vmaf_tiny_v1_medium.onnx` stays for absolute-VMAF-agreement users). * linear is a clear sanity floor (PLCC 0.37 → 6 features carry strong non-linear signal that linear can't capture). Does not ship. FoxBird is the per-fold outlier on both MLPs (lowest PLCC ≈ 0.93 on both arch). Same outlier on both arch rules out arch-specific overfitting; it's a corpus-distribution issue. T6-1a (Netflix Public Dataset access) is the natural unblocker. Per-fold tables + cross-arch observations + reproducer in docs/research/0023-loso-3arch-results.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lusoris force-pushed the feat/loso-3arch-eval branch 2 times, most recently from cc95505 to 1f1f03a Compare April 28, 2026 19:42

lusoris mentioned this pull request Apr 28, 2026

feat(ai): combined Netflix + KoNViD-1k tiny-AI trainer driver #180

Merged

8 tasks

lusoris force-pushed the feat/loso-3arch-eval branch from 1f1f03a to 1db3e4b Compare April 28, 2026 19:53

lusoris merged commit 0e483a3 into master Apr 28, 2026
49 checks passed

lusoris deleted the feat/loso-3arch-eval branch April 28, 2026 20:18

github-actions Bot mentioned this pull request Apr 28, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ai): tiny-AI 3-arch LOSO evaluation harness + Research-0023#176

feat(ai): tiny-AI 3-arch LOSO evaluation harness + Research-0023#176
lusoris merged 1 commit intomasterfrom
feat/loso-3arch-eval

lusoris commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lusoris commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline results (Netflix Public corpus, 9 folds × 30 epochs each)

What changed

Deep-dive deliverables (ADR-0108)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lusoris commented Apr 28, 2026 •

edited

Loading