Conversation
… (T3-5-neon, ADR-0160)
Sister port to ADR-0159 (AVX2). aarch64 users now get the same
byte-identical Xiph/Daala 8×8 integer DCT vectorization as x86 via
`int32x4_t` NEON intrinsics.
Vectorization strategy (half-wide split):
- NEON's 4-wide `int32x4_t` means each 8-column row splits into
`r_k_lo` (cols 0-3) + `r_k_hi` (cols 4-7).
- The 30-butterfly network runs twice per DCT pass (once per half),
mirroring the AVX2 TU line-for-line with `int32x4_t` substituted
for `__m256i`.
- 8×8 transpose decomposes into four `transpose4x4_s32` stages
(via aarch64 `vtrn1q_s32` / `vtrn2q_s32` / `vtrn1q_s64` /
`vtrn2q_s64` — armv7 `vtrnq_s64` doesn't exist on aarch64) plus
a top-right ↔ bottom-left block swap.
- Float accumulators (means/variances/mask/error) stay scalar per
ADR-0139/0159. `accumulate_error()` threads the outer `ret` by
pointer (ADR-0159 summation-order lesson carried through).
Runtime dispatch: `psnr_hvs.c init()` gains ARCH_AARCH64 branch
picking `calc_psnrhvs_neon` when `VMAF_ARM_CPU_FLAG_NEON` is set.
Verification:
- `test_psnr_hvs_neon` under qemu-aarch64-static: 5/5 DCT subtests
pass (3 xorshift seeds + delta + constant).
- Netflix golden 576×324 8-bit pair scalar-vs-NEON CLI diff under
QEMU: byte-identical `psnr_hvs_{y,cb,cr}` scores, only `<fyi fps>`
timing header differs.
- 1080p 10-bit pairs covered by native-aarch64 CI + Netflix CPU
Golden Tests gate (QEMU segfaults on heavy 10-bit threadpool —
known emulator limit).
- `meson test -C build` x86: 36/36 (no regression in AVX2 path).
- clang-tidy clean on all touched files (build + build-aarch64);
assertion-density PASS (65 asserts / 35 funcs, avg 1.86).
ISA-parity matrix for psnr_hvs now closes: scalar + AVX2 + NEON.
AVX-512 and SVE2 remain unscheduled.
Ships ADR-0108 six deep-dive deliverables: ADR-0160, research
digest 0014, rebase-notes §0052 extension (NEON sister added to
Touches list), AGENTS.md invariant note, CHANGELOG entry, MD-lint
cleanup of §0037/§0038 pre-existing warnings in rebase-notes.md
(per feedback_fix_md_warnings rule).
Closes backlog T3-5-neon.
lusoris
added a commit
that referenced
this pull request
Apr 24, 2026
…nner guide, doc-drift enforcement (#103) Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md` plus the audit-flagged docs gaps that surfaced during scope-checking. T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165) Closes Issue #20. New tracked file `docs/state.md` (Open / Recently closed / Confirmed not-affected / Deferred) is the canonical in-tree bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR update on every bug close / open / rule-out. PR template carries a checkbox; opt-out `no state delta: REASON` for PRs without bug-status impact. ADRs cover decisions, this file covers bug status. T7-2 — MCP server release artifact channel (ADR-0166) Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release attachment with Sigstore keyless signing + PEP 740 attestations + SLSA L3 provenance. Wired as new `mcp-build` / `mcp-sign` / `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher binding required (operational note in the ADR). T7-3 — Self-hosted GPU runner enrollment guide New docs/development/self-hosted-runner.md pins the registration steps so an operator can stand a runner up in ~10 minutes. Per popup 2026-04-25 the user's local dev box (CUDA + Intel) will be the first runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`, `avx512`) reserved for future job targeting. ADR-0167 — Path-mapped doc-drift enforcement Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing checks because the workflow was advisory + accepted ADR additions as "docs were touched"). Two layers: Layer 1 (in-session): new project hook `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an informational `NOTICE` when a user-discoverable surface is touched but no matching `docs/<topic>/` file is touched. Mirrors the `auto-snapshot-warn.sh` pattern — informational stderr, no block. Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check` promoted from advisory (`continue-on-error: true`) to blocking + rewritten with a path-mapped surface→docs check. ADR additions no longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine internal-refactor / bug-fix / test PRs. Audit fixes (2 HIGH + 4 MEDIUM): - docs/api/gpu.md — vmaf_cuda_state_free() public API documented (was: missing entirely, despite the symbol shipping in PR #94). - docs/api/index.md — -EAGAIN error code added to the error semantics list (PR #91 / ADR-0154). - docs/api/index.md — vmaf_read_pictures monotonic-index requirement documented (PR #88 / ADR-0152). - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated (was: "scalar only. SIMD / GPU paths are follow-up workstreams" 36 hours after PRs #98/#99/#100 landed all three SIMD ports). - docs/metrics/features.md — PSNR-HVS backends updated for AVX2 (PR #96) + NEON (PR #97). - docs/metrics/features.md — float_ms_ssim <176×176 minimum documented (PR #90 / ADR-0153). ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20. Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(AVX2): new
libvmaf/src/feature/arm64/psnr_hvs_neon.cunderthe same byte-for-byte Netflix-golden bit-exactness contract.
int32x4_t→ each 8-column row splits intolo(cols 0-3) +hi(cols 4-7); the 30-butterfly runs twiceper DCT pass, and the 8×8 transpose decomposes into four 4×4
vtrn1q_s32/vtrn2q_s32/vtrn1q_s64/vtrn2q_s64stages plus a top-right ↔ bottom-left block swap.
accumulate_error()threads the outerretby pointer (theADR-0159 summation-order lesson inherited — a local float
accumulator would drift the Netflix golden by ~5.5e-5).
psnr_hvscloses: scalar + AVX2 + NEON.Test plan
test_psnr_hvs_neonunderqemu-aarch64-static: 5/5 DCTsubtests pass (3 xorshift seeds + delta + constant)
under QEMU: byte-identical
psnr_hvs_{y,cb,cr}scoresmeson test -C buildx86: 36/36 (no AVX2 regression)segfaults on heavy 10-bit threadpool — known emulator limit)
Deep-dive deliverables (ADR-0108)
## Alternatives considered(4 alternatives scored)ninja -C build-aarch64 && qemu-aarch64-static -L /usr/aarch64-linux-gnu/ build-aarch64/test/test_psnr_hvs_neon🤖 Generated with Claude Code