feat(psnr_hvs): NEON aarch64 sister port — 8×8 integer DCT vectorized (T3-5-neon, ADR-0160) by lusoris · Pull Request #97 · lusoris/vmaf

lusoris · 2026-04-24T17:17:34Z

Summary

Sister port to ADR-0159
(AVX2): new libvmaf/src/feature/arm64/psnr_hvs_neon.c under
the same byte-for-byte Netflix-golden bit-exactness contract.
NEON's 4-wide int32x4_t → each 8-column row splits into
lo (cols 0-3) + hi (cols 4-7); the 30-butterfly runs twice
per DCT pass, and the 8×8 transpose decomposes into four 4×4
vtrn1q_s32 / vtrn2q_s32 / vtrn1q_s64 / vtrn2q_s64
stages plus a top-right ↔ bottom-left block swap.
Float accumulators stay scalar per ADR-0139/0159 rule;
accumulate_error() threads the outer ret by pointer (the
ADR-0159 summation-order lesson inherited — a local float
accumulator would drift the Netflix golden by ~5.5e-5).
ISA-parity matrix for psnr_hvs closes: scalar + AVX2 + NEON.
Closes backlog T3-5-neon.

Test plan

test_psnr_hvs_neon under qemu-aarch64-static: 5/5 DCT
subtests pass (3 xorshift seeds + delta + constant)
Netflix golden 576×324 8-bit pair scalar-vs-NEON CLI diff
under QEMU: byte-identical psnr_hvs_{y,cb,cr} scores
meson test -C build x86: 36/36 (no AVX2 regression)
clang-tidy clean on all touched files (build + build-aarch64)
assertion-density PASS, copyright PASS, pre-commit PASS
CI native-aarch64 job covers 1080p 10-bit pairs (QEMU
segfaults on heavy 10-bit threadpool — known emulator limit)
Netflix CPU Golden Tests gate

Deep-dive deliverables (ADR-0108)

Research digest: docs/research/0014-psnr-hvs-neon.md
Decision matrix: ADR-0160 ## Alternatives considered (4 alternatives scored)
AGENTS.md invariant: libvmaf/src/feature/AGENTS.md (new NEON section)
Reproducer: ninja -C build-aarch64 && qemu-aarch64-static -L /usr/aarch64-linux-gnu/ build-aarch64/test/test_psnr_hvs_neon
CHANGELOG entry: fork-unreleased § Added
Rebase-notes: §0052 now covers both SIMD sister TUs

🤖 Generated with Claude Code

… (T3-5-neon, ADR-0160) Sister port to ADR-0159 (AVX2). aarch64 users now get the same byte-identical Xiph/Daala 8×8 integer DCT vectorization as x86 via `int32x4_t` NEON intrinsics. Vectorization strategy (half-wide split): - NEON's 4-wide `int32x4_t` means each 8-column row splits into `r_k_lo` (cols 0-3) + `r_k_hi` (cols 4-7). - The 30-butterfly network runs twice per DCT pass (once per half), mirroring the AVX2 TU line-for-line with `int32x4_t` substituted for `__m256i`. - 8×8 transpose decomposes into four `transpose4x4_s32` stages (via aarch64 `vtrn1q_s32` / `vtrn2q_s32` / `vtrn1q_s64` / `vtrn2q_s64` — armv7 `vtrnq_s64` doesn't exist on aarch64) plus a top-right ↔ bottom-left block swap. - Float accumulators (means/variances/mask/error) stay scalar per ADR-0139/0159. `accumulate_error()` threads the outer `ret` by pointer (ADR-0159 summation-order lesson carried through). Runtime dispatch: `psnr_hvs.c init()` gains ARCH_AARCH64 branch picking `calc_psnrhvs_neon` when `VMAF_ARM_CPU_FLAG_NEON` is set. Verification: - `test_psnr_hvs_neon` under qemu-aarch64-static: 5/5 DCT subtests pass (3 xorshift seeds + delta + constant). - Netflix golden 576×324 8-bit pair scalar-vs-NEON CLI diff under QEMU: byte-identical `psnr_hvs_{y,cb,cr}` scores, only `<fyi fps>` timing header differs. - 1080p 10-bit pairs covered by native-aarch64 CI + Netflix CPU Golden Tests gate (QEMU segfaults on heavy 10-bit threadpool — known emulator limit). - `meson test -C build` x86: 36/36 (no regression in AVX2 path). - clang-tidy clean on all touched files (build + build-aarch64); assertion-density PASS (65 asserts / 35 funcs, avg 1.86). ISA-parity matrix for psnr_hvs now closes: scalar + AVX2 + NEON. AVX-512 and SVE2 remain unscheduled. Ships ADR-0108 six deep-dive deliverables: ADR-0160, research digest 0014, rebase-notes §0052 extension (NEON sister added to Touches list), AGENTS.md invariant note, CHANGELOG entry, MD-lint cleanup of §0037/§0038 pre-existing warnings in rebase-notes.md (per feedback_fix_md_warnings rule). Closes backlog T3-5-neon.

…nner guide, doc-drift enforcement (#103) Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md` plus the audit-flagged docs gaps that surfaced during scope-checking. T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165) Closes Issue #20. New tracked file `docs/state.md` (Open / Recently closed / Confirmed not-affected / Deferred) is the canonical in-tree bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR update on every bug close / open / rule-out. PR template carries a checkbox; opt-out `no state delta: REASON` for PRs without bug-status impact. ADRs cover decisions, this file covers bug status. T7-2 — MCP server release artifact channel (ADR-0166) Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release attachment with Sigstore keyless signing + PEP 740 attestations + SLSA L3 provenance. Wired as new `mcp-build` / `mcp-sign` / `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher binding required (operational note in the ADR). T7-3 — Self-hosted GPU runner enrollment guide New docs/development/self-hosted-runner.md pins the registration steps so an operator can stand a runner up in ~10 minutes. Per popup 2026-04-25 the user's local dev box (CUDA + Intel) will be the first runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`, `avx512`) reserved for future job targeting. ADR-0167 — Path-mapped doc-drift enforcement Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing checks because the workflow was advisory + accepted ADR additions as "docs were touched"). Two layers: Layer 1 (in-session): new project hook `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an informational `NOTICE` when a user-discoverable surface is touched but no matching `docs/<topic>/` file is touched. Mirrors the `auto-snapshot-warn.sh` pattern — informational stderr, no block. Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check` promoted from advisory (`continue-on-error: true`) to blocking + rewritten with a path-mapped surface→docs check. ADR additions no longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine internal-refactor / bug-fix / test PRs. Audit fixes (2 HIGH + 4 MEDIUM): - docs/api/gpu.md — vmaf_cuda_state_free() public API documented (was: missing entirely, despite the symbol shipping in PR #94). - docs/api/index.md — -EAGAIN error code added to the error semantics list (PR #91 / ADR-0154). - docs/api/index.md — vmaf_read_pictures monotonic-index requirement documented (PR #88 / ADR-0152). - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated (was: "scalar only. SIMD / GPU paths are follow-up workstreams" 36 hours after PRs #98/#99/#100 landed all three SIMD ports). - docs/metrics/features.md — PSNR-HVS backends updated for AVX2 (PR #96) + NEON (PR #97). - docs/metrics/features.md — float_ms_ssim <176×176 minimum documented (PR #90 / ADR-0153). ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20. Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lusoris merged commit 98d359f into master Apr 24, 2026
46 checks passed

lusoris deleted the port/psnr-hvs-neon-t3-5 branch April 24, 2026 17:36

github-actions Bot mentioned this pull request Apr 24, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(psnr_hvs): NEON aarch64 sister port — 8×8 integer DCT vectorized (T3-5-neon, ADR-0160)#97

feat(psnr_hvs): NEON aarch64 sister port — 8×8 integer DCT vectorized (T3-5-neon, ADR-0160)#97
lusoris merged 1 commit intomasterfrom
port/psnr-hvs-neon-t3-5

lusoris commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lusoris commented Apr 24, 2026

Summary

Test plan

Deep-dive deliverables (ADR-0108)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant