Skip to content

[feat] eval: add audio metrics#1352

Merged
SolitaryThinker merged 6 commits into
mainfrom
shao/eval-audio
May 16, 2026
Merged

[feat] eval: add audio metrics#1352
SolitaryThinker merged 6 commits into
mainfrom
shao/eval-audio

Conversation

@shaoxiongduan
Copy link
Copy Markdown
Collaborator

Purpose

Add seven audio.* metrics to fastvideo.eval, partially ported from
hkchengrex/av-benchmark (the V2A literature's de-facto eval harness used
by MMAudio, FoleyCrafter, V2A-Mapper). Closes the audio gap in the eval
suite so video-with-audio models like LTX2 can be scored end-to-end through
the same registry-driven API as common.*, vbench.*, and physics_iq.*.

Fixes #

Changes

  • New metrics under fastvideo/eval/metrics/audio/:
    • audio.clap_score (HF ClapModel, laion/clap-htsat-fused)
    • audio.audiobox_aesthetics (Meta audiobox_aesthetics, PQ as score)
    • audio.kl_divergence (PaSST AudioSet-527 logits, KL(gt || pred))
    • audio.frechet_distance (PaSST 768-d embeds, corpus-vs-corpus FAD)
    • audio.wer (Whisper-base default; GLM-ASR and SenseVoice optional)
    • audio.desync (Synchformer, av-benchmark DeSync)
    • audio.imagebind_score (ImageBind huge, audio/video cosine)
  • New [eval-audio] extra in pyproject.toml; ImageBind is git-sourced via
    [tool.uv.sources] (CC BY-NC-SA 4.0 so kept out of the FastVideo tree).
  • Vendored upstream under fastvideo/third_party/eval/ alongside the
    existing VBench submodule:
    • synchformer/ (MIT) for audio.desync
    • glmasr/ (Apache-2.0) for audio.wer (glm_asr) on transformers 4.57
  • Thread-safety guards for the multi-GPU pool path:
    • decord.bridge.set_bridge("torch") re-set on the worker thread in
      imagebind_score.compute() (decord's bridge is threading.local with
      a missing global upstream; setting on the main thread is lost).
    • threading.Lock around pytorchvideo's decord-backed video decode.
    • contextlib.redirect_stdout(devnull) removed from FAD and KL setup
      paths (races across workers and closes sys.stdout).
  • Numerical robustness in audio.frechet_distance: filter non-finite PaSST
    rows before np.cov, surface drop counts in details, always-on
    eps * I regularization on both covariances.
  • Three equivalent ways to supply the FAD reference set: per-sample paired
    (reference_audio on each sample, matches V2A manifests), role-tagged
    samples (role="reference"), or a cached .pt features file pointed to
    by FASTVIDEO_FAD_REF_FEATURES.
  • Eval README and docs/contributing/eval-metrics.md updated:
    install matrix, per-metric input contracts, vendored-tree disclosure,
    upstream-citation table, three-pattern upstream-wrapping guide (submodule
    / vendored / git-source-via-uv).
  • LTX2 audio example: examples/inference/eval/basic_ltx2_audio_eval.py.

Test Plan

# Lint
pre-commit run --from-ref origin/main --to-ref HEAD

# 7 metrics x 9 corruption variants of one LTX2 clip on 4 H200s
srun --gpus=4 --gres=gpu:4 python \
  outputs_video/ltx2_audio_eval/_score_ltx2_corruptions.py --num-gpus 4

# Single-GPU sanity on real movie clip + LTX2 video
srun --gpus=1 --gres=gpu:1 python \
  outputs_video/ltx2_audio_eval/_score_all_audio.py

Variants exercised by the corruption script: original LTX2 audio, silence,
white noise, low volume (-20 dB), reversed audio, pitch-shifted +6
semitones, audio delayed +1 s, audio advanced -1 s, audio replaced with an
unrelated Inception clip.

Test Results

Pre-commit: yapf, ruff, codespell, pymarkdown, mypy, filename, suggestion
all pass on the 5-commit diff.

End-to-end 4-GPU eval on 9 corruption variants of the LTX2 clip
(eval wall time ~45 s for evaluate() plus ~104 s of model loading at
ctor; per-sample ~5 s averaged across 4 workers):

Variant clap audiobox PQ KL WER desync s imagebind
original 0.187 7.56 0.00 0.00 0.20 0.380
silence NaN 6.88 (skip) 1.00 2.00 0.006
white_noise 0.071 4.91 5.01 1.00 1.40 0.080
low_volume 0.139 7.37 0.12 0.00 0.20 0.366
reversed 0.277 6.93 0.25 1.00 0.80 0.370
pitch_up 0.184 6.55 0.45 0.78 0.20 0.311
desync_delay_1s 0.128 8.11 0.39 0.28 1.20 0.344
desync_advance_1s 0.101 8.04 0.17 0.22 0.80 0.371
unrelated_audio -0.098 6.59 3.97 1.00 2.00 0.070

Corpus FAD across the 9 variants paired against the original: 739.389
(silence drops 1 non-finite PaSST embed, surfaced in
details.n_gen_dropped_nonfinite).

How to read the table

clap rewards timbre/category match with the prompt and ignores temporal
order, so reversed speech can score higher than the original. audiobox
(PQ) measures audio quality in isolation and is amplitude-invariant.
KL and imagebind correctly drop toward zero on noise/silence/unrelated
audio. desync lands at the Synchformer ±2 s grid's nearest match to the
ground-truth shift (0.2 s for original, 1.2/0.8 s for the 1 s shifts, 2.0 s
for silence and unrelated audio). wer reaches 1.0 on non-speech and ~0.78
on pitch-shifted speech where Whisper recovers some words.

Checklist

  • I ran pre-commit run --all-files and fixed all issues
  • I added or updated tests for my changes
  • I updated documentation if needed
  • I considered GPU memory impact of my changes

For model/pipeline changes, also check:

  • I verified SSIM regression tests pass (not a model/pipeline change)
  • I updated the support matrix if adding a new model (not adding a model)

@mergify mergify Bot added type: feat New feature or capability scope: inference Inference pipeline, serving, CLI scope: docs Documentation labels May 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 15, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive audio evaluation suite for video generations, adding metrics such as CLAP, AudioBox Aesthetics, KL Divergence, FAD, WER, DeSync, and ImageBind. The implementation includes new metric classes, documentation updates, and the vendoring of necessary third-party components like Synchformer and GLM-ASR. Key feedback identifies potential runtime errors in the Fréchet Audio Distance calculation due to incorrect empty array initialization and highlights maintainability concerns regarding unused files and non-standard import practices in the vendored code.

self._ref_buf.extend(other._ref_buf)

def finalize(self) -> MetricResult:
gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, ))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Initializing an empty array with np.empty((0,)) creates a 1D array. When self._gen_buf is empty, gen_all becomes this 1D array, and the subsequent call to np.isfinite(gen_all).all(axis=1) on line 182 would raise an AxisError because axis 1 is out of bounds. To prevent this, initialize the empty array with the correct number of dimensions.

Suggested change
gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, ))
gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, PASST_EMBED_DIM))

n_ref_dropped = 0
ref_source = "cached"
else:
ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, ))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the gen_all array, ref_all should be initialized as a 2D array to avoid a potential AxisError on the next line when self._ref_buf is empty.

Suggested change
ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, ))
ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, PASST_EMBED_DIM))

vsegs = _segment_video(frames).unsqueeze(0) # (B=1, S, T_seg, C, H, W)
# Synchformer's extract_vfeats wants (B, S, T_seg, C, H, W); fold (B, S) → (B*S, 1, ...).
b, s = vsegs.shape[:2]
from einops import rearrange
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and to avoid re-importing on every call, it's best practice to move this import statement to the top of the file with the other imports.

Comment on lines +83 to +85
import decord
from imagebind import data as ib_data
from imagebind.models.imagebind_model import ModalityType
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code organization and to avoid re-importing on every call, these imports should be moved to the top of the file.

import torch
from torchvision.models.resnet import BasicBlock, Bottleneck, ResNet

sys.path.append('.') # nopep8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be unused in the project and contains a sys.path modification which is generally discouraged as it can lead to unpredictable import behavior. The imports on lines 11-13 also seem to rely on a project structure that doesn't exist here. If this file is not needed, it would be best to remove it to improve maintainability. If it is needed, the imports should be refactored to be relative to the fastvideo package.

shaoxiongduan and others added 5 commits May 15, 2026 22:10
Seven `audio.*` metrics ported 1:1 from `hkchengrex/av-benchmark`, the
V2A literature's de-facto eval harness:

- audio.clap_score          (HF ClapModel, laion/clap-htsat-fused)
- audio.audiobox_aesthetics (Meta audiobox_aesthetics, PQ as score)
- audio.kl_divergence       (PaSST AudioSet-527 logits, KL(gt||pred))
- audio.frechet_distance    (PaSST 768-d embeds, corpus-vs-corpus FAD)
- audio.wer                 (Whisper-base default; GLM-ASR / SenseVoice
                             backends; MagiHuman-style CJK char-level)
- audio.desync              (Synchformer, av-benchmark DeSync; vendored)
- audio.imagebind_score     (ImageBind huge, audio<->video cosine)

`audio.frechet_distance` is set-vs-set; the other six are per-sample.
FAD supports three equivalent ref-supply modes — paired samples (V2A
manifest convention), role-tagged samples, or a cached features file
via `FASTVIDEO_FAD_REF_FEATURES` — all producing identical math.

Multi-GPU thread safety: `decord.bridge.set_bridge("torch")` is set
on the worker thread inside ImageBind's compute() to work around
decord's threading.local-with-missing-`global` bug. A module-level
threading.Lock around the pytorchvideo decode call serializes that
step across workers. PaSST setup no longer uses
`contextlib.redirect_stdout` (it races across workers and closes
sys.stdout). FAD's finalize() filters non-finite PaSST embeds (silent
audio drives the softmax into log(0)) and KL skips with a clear
message in the same situation.

New `[eval-audio]` extra in pyproject covers everything; ImageBind is
git-sourced via `[tool.uv.sources]` (CC BY-NC-SA 4.0, not vendored).
Synchformer is vendored under `_synchformer/` (MIT) and a
transformers-4.57-compatible build of GLM-ASR under `wer/_glmasr/`
(Apache-2.0). Both vendored trees keep their upstream LICENSE files.

eval README updated with the install matrix, vendored-tree
disclosure, per-metric input-contract table, and upstream-citation
table. Includes a runnable LTX2 example at
`examples/inference/eval/basic_ltx2_audio_eval.py`.

Co-Authored-By: klhhhhh <1412841649@qq.com>
`fastvideo/eval/metrics/audio/_synchformer/` (Synchformer, MIT) and
`fastvideo/eval/metrics/audio/wer/_glmasr/` (GLM-ASR, Apache-2.0) are
byte-for-byte upstream sources. yapf/codespell/mypy were lighting up
the diff with style fixes the vendoring contract forbids touching.
Add both paths to the top-level pre-commit `exclude` so every hook
(yapf, mypy, codespell, pymarkdown) skips them — mirrors the existing
treatment of `fastvideo/third_party/` for the VBench submodule. Ruff
already has the same exclude in pyproject.toml via `extend-exclude`.
Move the audio metrics' two vendored upstream packages alongside the
existing VBench submodule under ``fastvideo/third_party/eval/``:

- ``fastvideo/eval/metrics/audio/_synchformer/`` →
  ``fastvideo/third_party/eval/synchformer/``
- ``fastvideo/eval/metrics/audio/wer/_glmasr/`` →
  ``fastvideo/third_party/eval/glmasr/``

This consolidates all upstream-provenance code under one tree and lets
the existing top-level ``fastvideo/third_party/.*`` exclude in
``.pre-commit-config.yaml`` cover both new trees automatically. The
audio-specific entries previously added to ``.pre-commit-config.yaml``
and ``pyproject.toml [tool.ruff] extend-exclude`` are reverted.

Also trims module docstrings and comments across the audio metrics to
match FastVideo's house style (1–8 lines per module, comments only
where the WHY is non-obvious). Removed:

- ``redirect_stdout``-history commentary in FAD/KL setup paths
- FAD's three-mode decision-tree docstring in ``accumulate``
- ImageBind's per-line bug history in ``_IB_DECODE_LOCK`` and
  ``compute``'s ``decord.bridge`` setter
- DeSync's "Clip-length assumption" sub-section
- WER's per-backend description block

FAD on the corruption fixtures returns 739.389 (bit-identical to the
pre-refactor value); all 8 per-sample scores in the corruption suite
match.
…ents

README updates:
- audio metric list in the layout tree was missing desync + imagebind_score
- third_party/eval/ tree now lists synchformer + glmasr alongside vbench
- BaseMetric.compute signature: was list[MetricResult], actually MetricResult
- cache layout: dropped stale AMT mention, added Synchformer + ImageBind
- vbench install row corrected to "11 of 16 by default; +4 with detectron2"
- upstream-wrapping section now describes all three patterns coexisting
  in the suite (submodule / vendored / git-source-via-uv)

Contributing guide updates:
- example group list: vlm → videoscore2
- file-layout tree: dropped fictional vlm/, added audio/ and videoscore2/
- TL;DR now lists five recipes covering submodule / vendor / git-source

pyproject.toml: trim the 27-line eval-audio comment block down to one
line and revert the cosmetic backtick churn elsewhere. Net change vs
origin/main is now ~6 lines of additions (the eval-audio extra, the
imagebind source entry, codespell "passt", eval-full pulling in audio).
Net diff vs origin/main is now 4 lines of pure config (imagebind source
entry, eval-audio extra, eval-full pulls in audio, codespell ignores
"passt"). No new comments.
Copy link
Copy Markdown
Collaborator

@SolitaryThinker SolitaryThinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review findings from the audio metric pass.


# Video preprocessing → (S, T_seg, C, 224, 224)
frames = video.float().to(self.device)
src_fps = self._src_fps if self._src_fps is not None else _SYNC_FPS
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores the per-sample fps that callers can pass through Evaluator.evaluate(...) and falls back to 25 fps for all pool-decoded clips. For 24/30/8 fps inputs, _resample_video computes the wrong clip duration and segment positions while the audio waveform stays at native duration, so the video/audio windows no longer line up. Please prefer sample.get("fps") when present, or skip when neither the sample nor constructor provides the source FPS.

# |grid|-value → average across the two directions.
sync_grid = self._grid.to(self.device) if self._grid is not None else None
assert sync_grid is not None
s_used = min(_NUM_SEG_PER_DIRECTION, vfeats.shape[1], afeats.shape[1])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clips shorter than the expected Synchformer window, s_used can be less than 14, but compare_v_a() still feeds that shorter token sequence into the vendored transformer. That transformer has a fixed positional embedding for the 14-segment shape and adds it without slicing/padding, so short-but-decodable clips can shape-mismatch here instead of returning a skipped result. Please guard for the required segment count before calling compare_v_a() or pad to the expected token length.

self.old_stft = torch.stft

def __enter__(self) -> None:
torch.stft = partial(torch.stft, return_complex=False) # type: ignore[assignment]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context manager mutates process-global torch.stft around each PaSST forward. Multi-GPU eval runs worker threads in the same process, so KL/FAD calls can overlap: one thread can restore torch.stft while another PaSST forward still expects the patched callable, or restore another thread's partial wrapper. Current hear21passt already passes return_complex=False upstream, so please remove this monkeypatch, or protect it with a shared lock if it is truly still required.

if self._predictor is not None:
return
from audiobox_aesthetics.infer import initialize_predictor
self._predictor = initialize_predictor()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize_predictor() chooses its own device; upstream defaults to torch.device("cuda") when CUDA is available. In num_gpus > 1, each EvalWorker calls m.to("cuda:i") before setup, but this line ignores self.device, so all AudioBox predictors will load/run on default GPU 0. That removes the intended parallelism and can OOM GPU 0. Please construct or move the predictor onto the worker's assigned device.

…ort clips, stft race)

Four bugs flagged by SolitaryThinker on PR #1352, all reproduced and fixed:

- audio.audiobox_aesthetics: `initialize_predictor()` upstream pins to
  cuda:0 regardless of the worker's device, so every EvalWorker piled
  onto GPU 0. Re-pin `predictor.model` and `predictor.device` onto
  `self.device` in both `setup()` and `to()`. Verified across 4 GPUs.

- audio.desync (fps): the metric ignored per-sample `fps` and silently
  used 25 fps for every clip, mis-aligning audio/video windows for 24
  / 30 / 8 fps inputs. Now reads `sample["fps"]`, falls back to the
  `src_fps` constructor override, and skips with a clear message when
  neither is set.

- audio.desync (short clips): Synchformer's transformer carries a
  fixed 198-token positional embedding (~14 segments × (tv+ta) + 2
  special tokens). Clips with fewer than 14 video/audio segments
  passed `_segment_video` but crashed inside `compare_v_a` on the
  pos_emb add. Guard `s_used < 14` and skip with the required-segment
  count in `details`. Verified on a 2-s clip → 5 segments → clean
  skip.

- audio.kl_divergence and audio.frechet_distance: removed
  `_patch_passt_stft`. Upstream `hear21passt 0.0.26` already passes
  `return_complex=False` explicitly in `preprocess.py`, so the
  monkeypatch is dead code. As a context manager, it was also
  thread-unsafe — multi-worker PaSST forwards could restore another
  thread's partially-patched `torch.stft` or its original, depending
  on interleaving. Verified by running KL and FAD concurrently on 4
  GPUs without races.
@SolitaryThinker SolitaryThinker merged commit 6b2c731 into main May 16, 2026
16 of 23 checks passed
@SolitaryThinker SolitaryThinker deleted the shao/eval-audio branch May 16, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: docs Documentation scope: inference Inference pipeline, serving, CLI type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants