Skip to content

feat(psnr_hvs): NEON aarch64 sister port — 8×8 integer DCT vectorized (T3-5-neon, ADR-0160)#97

Merged
lusoris merged 1 commit intomasterfrom
port/psnr-hvs-neon-t3-5
Apr 24, 2026
Merged

feat(psnr_hvs): NEON aarch64 sister port — 8×8 integer DCT vectorized (T3-5-neon, ADR-0160)#97
lusoris merged 1 commit intomasterfrom
port/psnr-hvs-neon-t3-5

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 24, 2026

Summary

  • Sister port to ADR-0159
    (AVX2): new libvmaf/src/feature/arm64/psnr_hvs_neon.c under
    the same byte-for-byte Netflix-golden bit-exactness contract.
  • NEON's 4-wide int32x4_t → each 8-column row splits into
    lo (cols 0-3) + hi (cols 4-7); the 30-butterfly runs twice
    per DCT pass, and the 8×8 transpose decomposes into four 4×4
    vtrn1q_s32 / vtrn2q_s32 / vtrn1q_s64 / vtrn2q_s64
    stages plus a top-right ↔ bottom-left block swap.
  • Float accumulators stay scalar per ADR-0139/0159 rule;
    accumulate_error() threads the outer ret by pointer (the
    ADR-0159 summation-order lesson inherited — a local float
    accumulator would drift the Netflix golden by ~5.5e-5).
  • ISA-parity matrix for psnr_hvs closes: scalar + AVX2 + NEON.
  • Closes backlog T3-5-neon.

Test plan

  • test_psnr_hvs_neon under qemu-aarch64-static: 5/5 DCT
    subtests pass (3 xorshift seeds + delta + constant)
  • Netflix golden 576×324 8-bit pair scalar-vs-NEON CLI diff
    under QEMU: byte-identical psnr_hvs_{y,cb,cr} scores
  • meson test -C build x86: 36/36 (no AVX2 regression)
  • clang-tidy clean on all touched files (build + build-aarch64)
  • assertion-density PASS, copyright PASS, pre-commit PASS
  • CI native-aarch64 job covers 1080p 10-bit pairs (QEMU
    segfaults on heavy 10-bit threadpool — known emulator limit)
  • Netflix CPU Golden Tests gate

Deep-dive deliverables (ADR-0108)

  1. Research digest: docs/research/0014-psnr-hvs-neon.md
  2. Decision matrix: ADR-0160 ## Alternatives considered (4 alternatives scored)
  3. AGENTS.md invariant: libvmaf/src/feature/AGENTS.md (new NEON section)
  4. Reproducer: ninja -C build-aarch64 && qemu-aarch64-static -L /usr/aarch64-linux-gnu/ build-aarch64/test/test_psnr_hvs_neon
  5. CHANGELOG entry: fork-unreleased § Added
  6. Rebase-notes: §0052 now covers both SIMD sister TUs

🤖 Generated with Claude Code

… (T3-5-neon, ADR-0160)

Sister port to ADR-0159 (AVX2). aarch64 users now get the same
byte-identical Xiph/Daala 8×8 integer DCT vectorization as x86 via
`int32x4_t` NEON intrinsics.

Vectorization strategy (half-wide split):
- NEON's 4-wide `int32x4_t` means each 8-column row splits into
  `r_k_lo` (cols 0-3) + `r_k_hi` (cols 4-7).
- The 30-butterfly network runs twice per DCT pass (once per half),
  mirroring the AVX2 TU line-for-line with `int32x4_t` substituted
  for `__m256i`.
- 8×8 transpose decomposes into four `transpose4x4_s32` stages
  (via aarch64 `vtrn1q_s32` / `vtrn2q_s32` / `vtrn1q_s64` /
  `vtrn2q_s64` — armv7 `vtrnq_s64` doesn't exist on aarch64) plus
  a top-right ↔ bottom-left block swap.
- Float accumulators (means/variances/mask/error) stay scalar per
  ADR-0139/0159. `accumulate_error()` threads the outer `ret` by
  pointer (ADR-0159 summation-order lesson carried through).

Runtime dispatch: `psnr_hvs.c init()` gains ARCH_AARCH64 branch
picking `calc_psnrhvs_neon` when `VMAF_ARM_CPU_FLAG_NEON` is set.

Verification:
- `test_psnr_hvs_neon` under qemu-aarch64-static: 5/5 DCT subtests
  pass (3 xorshift seeds + delta + constant).
- Netflix golden 576×324 8-bit pair scalar-vs-NEON CLI diff under
  QEMU: byte-identical `psnr_hvs_{y,cb,cr}` scores, only `<fyi fps>`
  timing header differs.
- 1080p 10-bit pairs covered by native-aarch64 CI + Netflix CPU
  Golden Tests gate (QEMU segfaults on heavy 10-bit threadpool —
  known emulator limit).
- `meson test -C build` x86: 36/36 (no regression in AVX2 path).
- clang-tidy clean on all touched files (build + build-aarch64);
  assertion-density PASS (65 asserts / 35 funcs, avg 1.86).

ISA-parity matrix for psnr_hvs now closes: scalar + AVX2 + NEON.
AVX-512 and SVE2 remain unscheduled.

Ships ADR-0108 six deep-dive deliverables: ADR-0160, research
digest 0014, rebase-notes §0052 extension (NEON sister added to
Touches list), AGENTS.md invariant note, CHANGELOG entry, MD-lint
cleanup of §0037/§0038 pre-existing warnings in rebase-notes.md
(per feedback_fix_md_warnings rule).

Closes backlog T3-5-neon.
@lusoris lusoris merged commit 98d359f into master Apr 24, 2026
46 checks passed
@lusoris lusoris deleted the port/psnr-hvs-neon-t3-5 branch April 24, 2026 17:36
@github-actions github-actions Bot mentioned this pull request Apr 24, 2026
lusoris added a commit that referenced this pull request Apr 24, 2026
…nner guide, doc-drift enforcement (#103)

Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md`
plus the audit-flagged docs gaps that surfaced during scope-checking.

T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165)
  Closes Issue #20. New tracked file `docs/state.md` (Open / Recently
  closed / Confirmed not-affected / Deferred) is the canonical in-tree
  bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR
  update on every bug close / open / rule-out. PR template carries a
  checkbox; opt-out `no state delta: REASON` for PRs without bug-status
  impact. ADRs cover decisions, this file covers bug status.

T7-2 — MCP server release artifact channel (ADR-0166)
  Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release
  attachment with Sigstore keyless signing + PEP 740 attestations + SLSA
  L3 provenance. Wired as new `mcp-build` / `mcp-sign` /
  `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this
  lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher
  binding required (operational note in the ADR).

T7-3 — Self-hosted GPU runner enrollment guide
  New docs/development/self-hosted-runner.md pins the registration
  steps so an operator can stand a runner up in ~10 minutes. Per popup
  2026-04-25 the user's local dev box (CUDA + Intel) will be the first
  runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`,
  `avx512`) reserved for future job targeting.

ADR-0167 — Path-mapped doc-drift enforcement
  Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed
  in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing
  checks because the workflow was advisory + accepted ADR additions
  as "docs were touched"). Two layers:

  Layer 1 (in-session): new project hook
  `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an
  informational `NOTICE` when a user-discoverable surface is touched
  but no matching `docs/<topic>/` file is touched. Mirrors the
  `auto-snapshot-warn.sh` pattern — informational stderr, no block.

  Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check`
  promoted from advisory (`continue-on-error: true`) to blocking +
  rewritten with a path-mapped surface→docs check. ADR additions no
  longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine
  internal-refactor / bug-fix / test PRs.

Audit fixes (2 HIGH + 4 MEDIUM):
  - docs/api/gpu.md — vmaf_cuda_state_free() public API documented
    (was: missing entirely, despite the symbol shipping in PR #94).
  - docs/api/index.md — -EAGAIN error code added to the error
    semantics list (PR #91 / ADR-0154).
  - docs/api/index.md — vmaf_read_pictures monotonic-index requirement
    documented (PR #88 / ADR-0152).
  - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated
    (was: "scalar only. SIMD / GPU paths are follow-up workstreams"
    36 hours after PRs #98/#99/#100 landed all three SIMD ports).
  - docs/metrics/features.md — PSNR-HVS backends updated for AVX2
    (PR #96) + NEON (PR #97).
  - docs/metrics/features.md — float_ms_ssim <176×176 minimum
    documented (PR #90 / ADR-0153).

ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20.

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant