[feat] eval: add audio metrics by shaoxiongduan · Pull Request #1352 · hao-ai-lab/FastVideo

shaoxiongduan · 2026-05-15T06:49:03Z

Purpose

Add seven audio.* metrics to fastvideo.eval, partially ported from
hkchengrex/av-benchmark (the V2A literature's de-facto eval harness used
by MMAudio, FoleyCrafter, V2A-Mapper). Closes the audio gap in the eval
suite so video-with-audio models like LTX2 can be scored end-to-end through
the same registry-driven API as common.*, vbench.*, and physics_iq.*.

Fixes #

Changes

New metrics under fastvideo/eval/metrics/audio/:
- audio.clap_score (HF ClapModel, laion/clap-htsat-fused)
- audio.audiobox_aesthetics (Meta audiobox_aesthetics, PQ as score)
- audio.kl_divergence (PaSST AudioSet-527 logits, KL(gt || pred))
- audio.frechet_distance (PaSST 768-d embeds, corpus-vs-corpus FAD)
- audio.wer (Whisper-base default; GLM-ASR and SenseVoice optional)
- audio.desync (Synchformer, av-benchmark DeSync)
- audio.imagebind_score (ImageBind huge, audio/video cosine)
New [eval-audio] extra in pyproject.toml; ImageBind is git-sourced via
[tool.uv.sources] (CC BY-NC-SA 4.0 so kept out of the FastVideo tree).
Vendored upstream under fastvideo/third_party/eval/ alongside the
existing VBench submodule:
- synchformer/ (MIT) for audio.desync
- glmasr/ (Apache-2.0) for audio.wer (glm_asr) on transformers 4.57
Thread-safety guards for the multi-GPU pool path:
- decord.bridge.set_bridge("torch") re-set on the worker thread in
  imagebind_score.compute() (decord's bridge is threading.local with
  a missing global upstream; setting on the main thread is lost).
- threading.Lock around pytorchvideo's decord-backed video decode.
- contextlib.redirect_stdout(devnull) removed from FAD and KL setup
  paths (races across workers and closes sys.stdout).
Numerical robustness in audio.frechet_distance: filter non-finite PaSST
rows before np.cov, surface drop counts in details, always-on
eps * I regularization on both covariances.
Three equivalent ways to supply the FAD reference set: per-sample paired
(reference_audio on each sample, matches V2A manifests), role-tagged
samples (role="reference"), or a cached .pt features file pointed to
by FASTVIDEO_FAD_REF_FEATURES.
Eval README and docs/contributing/eval-metrics.md updated:
install matrix, per-metric input contracts, vendored-tree disclosure,
upstream-citation table, three-pattern upstream-wrapping guide (submodule
/ vendored / git-source-via-uv).
LTX2 audio example: examples/inference/eval/basic_ltx2_audio_eval.py.

Test Plan

# Lint
pre-commit run --from-ref origin/main --to-ref HEAD

# 7 metrics x 9 corruption variants of one LTX2 clip on 4 H200s
srun --gpus=4 --gres=gpu:4 python \
  outputs_video/ltx2_audio_eval/_score_ltx2_corruptions.py --num-gpus 4

# Single-GPU sanity on real movie clip + LTX2 video
srun --gpus=1 --gres=gpu:1 python \
  outputs_video/ltx2_audio_eval/_score_all_audio.py

Variants exercised by the corruption script: original LTX2 audio, silence,
white noise, low volume (-20 dB), reversed audio, pitch-shifted +6
semitones, audio delayed +1 s, audio advanced -1 s, audio replaced with an
unrelated Inception clip.

Test Results

Pre-commit: yapf, ruff, codespell, pymarkdown, mypy, filename, suggestion
all pass on the 5-commit diff.

End-to-end 4-GPU eval on 9 corruption variants of the LTX2 clip
(eval wall time ~45 s for evaluate() plus ~104 s of model loading at
ctor; per-sample ~5 s averaged across 4 workers):

Variant	clap	audiobox PQ	KL	WER	desync s	imagebind
original	0.187	7.56	0.00	0.00	0.20	0.380
silence	NaN	6.88	(skip)	1.00	2.00	0.006
white_noise	0.071	4.91	5.01	1.00	1.40	0.080
low_volume	0.139	7.37	0.12	0.00	0.20	0.366
reversed	0.277	6.93	0.25	1.00	0.80	0.370
pitch_up	0.184	6.55	0.45	0.78	0.20	0.311
desync_delay_1s	0.128	8.11	0.39	0.28	1.20	0.344
desync_advance_1s	0.101	8.04	0.17	0.22	0.80	0.371
unrelated_audio	-0.098	6.59	3.97	1.00	2.00	0.070

Corpus FAD across the 9 variants paired against the original: 739.389
(silence drops 1 non-finite PaSST embed, surfaced in
details.n_gen_dropped_nonfinite).

How to read the table

clap rewards timbre/category match with the prompt and ignores temporal
order, so reversed speech can score higher than the original. audiobox
(PQ) measures audio quality in isolation and is amplitude-invariant.
KL and imagebind correctly drop toward zero on noise/silence/unrelated
audio. desync lands at the Synchformer ±2 s grid's nearest match to the
ground-truth shift (0.2 s for original, 1.2/0.8 s for the 1 s shifts, 2.0 s
for silence and unrelated audio). wer reaches 1.0 on non-speech and ~0.78
on pitch-shifted speech where Whisper recovers some words.

Checklist

I ran pre-commit run --all-files and fixed all issues
I added or updated tests for my changes
I updated documentation if needed
I considered GPU memory impact of my changes

For model/pipeline changes, also check:

I verified SSIM regression tests pass (not a model/pipeline change)
I updated the support matrix if adding a new model (not adding a model)

mergify · 2026-05-15T06:49:53Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

#approved-reviews-by>=1
check-success=fastcheck-passed
check-success=full-suite-passed

This rule is failing.

#approved-reviews-by>=1
check-success=fastcheck-passed
check-success=full-suite-passed
check-success~=pre-commit
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

gemini-code-assist

Code Review

This pull request introduces a comprehensive audio evaluation suite for video generations, adding metrics such as CLAP, AudioBox Aesthetics, KL Divergence, FAD, WER, DeSync, and ImageBind. The implementation includes new metric classes, documentation updates, and the vendoring of necessary third-party components like Synchformer and GLM-ASR. Key feedback identifies potential runtime errors in the Fréchet Audio Distance calculation due to incorrect empty array initialization and highlights maintainability concerns regarding unused files and non-standard import practices in the vendored code.

gemini-code-assist · 2026-05-15T06:51:58Z

+        self._ref_buf.extend(other._ref_buf)
+
+    def finalize(self) -> MetricResult:
+        gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, ))


Initializing an empty array with np.empty((0,)) creates a 1D array. When self._gen_buf is empty, gen_all becomes this 1D array, and the subsequent call to np.isfinite(gen_all).all(axis=1) on line 182 would raise an AxisError because axis 1 is out of bounds. To prevent this, initialize the empty array with the correct number of dimensions.

Suggested change

gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, ))

gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, PASST_EMBED_DIM))

gemini-code-assist · 2026-05-15T06:51:59Z

+            n_ref_dropped = 0
+            ref_source = "cached"
+        else:
+            ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, ))


Similar to the gen_all array, ref_all should be initialized as a 2D array to avoid a potential AxisError on the next line when self._ref_buf is empty.

Suggested change

ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, ))

ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, PASST_EMBED_DIM))

gemini-code-assist · 2026-05-15T06:51:59Z

+        vsegs = _segment_video(frames).unsqueeze(0)  # (B=1, S, T_seg, C, H, W)
+        # Synchformer's extract_vfeats wants (B, S, T_seg, C, H, W); fold (B, S) → (B*S, 1, ...).
+        b, s = vsegs.shape[:2]
+        from einops import rearrange


For better code organization and to avoid re-importing on every call, it's best practice to move this import statement to the top of the file with the other imports.

gemini-code-assist · 2026-05-15T06:51:59Z

+        import decord
+        from imagebind import data as ib_data
+        from imagebind.models.imagebind_model import ModalityType


For better code organization and to avoid re-importing on every call, these imports should be moved to the top of the file.

gemini-code-assist · 2026-05-15T06:51:59Z

+import torch
+from torchvision.models.resnet import BasicBlock, Bottleneck, ResNet
+
+sys.path.append('.')  # nopep8


This file appears to be unused in the project and contains a sys.path modification which is generally discouraged as it can lead to unpredictable import behavior. The imports on lines 11-13 also seem to rely on a project structure that doesn't exist here. If this file is not needed, it would be best to remove it to improve maintainability. If it is needed, the imports should be refactored to be relative to the fastvideo package.

Seven `audio.*` metrics ported 1:1 from `hkchengrex/av-benchmark`, the V2A literature's de-facto eval harness: - audio.clap_score (HF ClapModel, laion/clap-htsat-fused) - audio.audiobox_aesthetics (Meta audiobox_aesthetics, PQ as score) - audio.kl_divergence (PaSST AudioSet-527 logits, KL(gt||pred)) - audio.frechet_distance (PaSST 768-d embeds, corpus-vs-corpus FAD) - audio.wer (Whisper-base default; GLM-ASR / SenseVoice backends; MagiHuman-style CJK char-level) - audio.desync (Synchformer, av-benchmark DeSync; vendored) - audio.imagebind_score (ImageBind huge, audio<->video cosine) `audio.frechet_distance` is set-vs-set; the other six are per-sample. FAD supports three equivalent ref-supply modes — paired samples (V2A manifest convention), role-tagged samples, or a cached features file via `FASTVIDEO_FAD_REF_FEATURES` — all producing identical math. Multi-GPU thread safety: `decord.bridge.set_bridge("torch")` is set on the worker thread inside ImageBind's compute() to work around decord's threading.local-with-missing-`global` bug. A module-level threading.Lock around the pytorchvideo decode call serializes that step across workers. PaSST setup no longer uses `contextlib.redirect_stdout` (it races across workers and closes sys.stdout). FAD's finalize() filters non-finite PaSST embeds (silent audio drives the softmax into log(0)) and KL skips with a clear message in the same situation. New `[eval-audio]` extra in pyproject covers everything; ImageBind is git-sourced via `[tool.uv.sources]` (CC BY-NC-SA 4.0, not vendored). Synchformer is vendored under `_synchformer/` (MIT) and a transformers-4.57-compatible build of GLM-ASR under `wer/_glmasr/` (Apache-2.0). Both vendored trees keep their upstream LICENSE files. eval README updated with the install matrix, vendored-tree disclosure, per-metric input-contract table, and upstream-citation table. Includes a runnable LTX2 example at `examples/inference/eval/basic_ltx2_audio_eval.py`. Co-Authored-By: klhhhhh <1412841649@qq.com>

`fastvideo/eval/metrics/audio/_synchformer/` (Synchformer, MIT) and `fastvideo/eval/metrics/audio/wer/_glmasr/` (GLM-ASR, Apache-2.0) are byte-for-byte upstream sources. yapf/codespell/mypy were lighting up the diff with style fixes the vendoring contract forbids touching. Add both paths to the top-level pre-commit `exclude` so every hook (yapf, mypy, codespell, pymarkdown) skips them — mirrors the existing treatment of `fastvideo/third_party/` for the VBench submodule. Ruff already has the same exclude in pyproject.toml via `extend-exclude`.

Move the audio metrics' two vendored upstream packages alongside the existing VBench submodule under ``fastvideo/third_party/eval/``: - ``fastvideo/eval/metrics/audio/_synchformer/`` → ``fastvideo/third_party/eval/synchformer/`` - ``fastvideo/eval/metrics/audio/wer/_glmasr/`` → ``fastvideo/third_party/eval/glmasr/`` This consolidates all upstream-provenance code under one tree and lets the existing top-level ``fastvideo/third_party/.*`` exclude in ``.pre-commit-config.yaml`` cover both new trees automatically. The audio-specific entries previously added to ``.pre-commit-config.yaml`` and ``pyproject.toml [tool.ruff] extend-exclude`` are reverted. Also trims module docstrings and comments across the audio metrics to match FastVideo's house style (1–8 lines per module, comments only where the WHY is non-obvious). Removed: - ``redirect_stdout``-history commentary in FAD/KL setup paths - FAD's three-mode decision-tree docstring in ``accumulate`` - ImageBind's per-line bug history in ``_IB_DECODE_LOCK`` and ``compute``'s ``decord.bridge`` setter - DeSync's "Clip-length assumption" sub-section - WER's per-backend description block FAD on the corruption fixtures returns 739.389 (bit-identical to the pre-refactor value); all 8 per-sample scores in the corruption suite match.

…ents README updates: - audio metric list in the layout tree was missing desync + imagebind_score - third_party/eval/ tree now lists synchformer + glmasr alongside vbench - BaseMetric.compute signature: was list[MetricResult], actually MetricResult - cache layout: dropped stale AMT mention, added Synchformer + ImageBind - vbench install row corrected to "11 of 16 by default; +4 with detectron2" - upstream-wrapping section now describes all three patterns coexisting in the suite (submodule / vendored / git-source-via-uv) Contributing guide updates: - example group list: vlm → videoscore2 - file-layout tree: dropped fictional vlm/, added audio/ and videoscore2/ - TL;DR now lists five recipes covering submodule / vendor / git-source pyproject.toml: trim the 27-line eval-audio comment block down to one line and revert the cosmetic backtick churn elsewhere. Net change vs origin/main is now ~6 lines of additions (the eval-audio extra, the imagebind source entry, codespell "passt", eval-full pulling in audio).

Net diff vs origin/main is now 4 lines of pure config (imagebind source entry, eval-audio extra, eval-full pulls in audio, codespell ignores "passt"). No new comments.

SolitaryThinker

Review findings from the audio metric pass.

SolitaryThinker · 2026-05-16T01:10:27Z

+
+        # Video preprocessing → (S, T_seg, C, 224, 224)
+        frames = video.float().to(self.device)
+        src_fps = self._src_fps if self._src_fps is not None else _SYNC_FPS


This ignores the per-sample fps that callers can pass through Evaluator.evaluate(...) and falls back to 25 fps for all pool-decoded clips. For 24/30/8 fps inputs, _resample_video computes the wrong clip duration and segment positions while the audio waveform stays at native duration, so the video/audio windows no longer line up. Please prefer sample.get("fps") when present, or skip when neither the sample nor constructor provides the source FPS.

SolitaryThinker · 2026-05-16T01:10:27Z

+        # |grid|-value → average across the two directions.
+        sync_grid = self._grid.to(self.device) if self._grid is not None else None
+        assert sync_grid is not None
+        s_used = min(_NUM_SEG_PER_DIRECTION, vfeats.shape[1], afeats.shape[1])


For clips shorter than the expected Synchformer window, s_used can be less than 14, but compare_v_a() still feeds that shorter token sequence into the vendored transformer. That transformer has a fixed positional embedding for the 14-segment shape and adds it without slicing/padding, so short-but-decodable clips can shape-mismatch here instead of returning a skipped result. Please guard for the required segment count before calling compare_v_a() or pad to the expected token length.

SolitaryThinker · 2026-05-16T01:10:27Z

+        self.old_stft = torch.stft
+
+    def __enter__(self) -> None:
+        torch.stft = partial(torch.stft, return_complex=False)  # type: ignore[assignment]


This context manager mutates process-global torch.stft around each PaSST forward. Multi-GPU eval runs worker threads in the same process, so KL/FAD calls can overlap: one thread can restore torch.stft while another PaSST forward still expects the patched callable, or restore another thread's partial wrapper. Current hear21passt already passes return_complex=False upstream, so please remove this monkeypatch, or protect it with a shared lock if it is truly still required.

SolitaryThinker · 2026-05-16T01:10:27Z

+        if self._predictor is not None:
+            return
+        from audiobox_aesthetics.infer import initialize_predictor
+        self._predictor = initialize_predictor()


initialize_predictor() chooses its own device; upstream defaults to torch.device("cuda") when CUDA is available. In num_gpus > 1, each EvalWorker calls m.to("cuda:i") before setup, but this line ignores self.device, so all AudioBox predictors will load/run on default GPU 0. That removes the intended parallelism and can OOM GPU 0. Please construct or move the predictor onto the worker's assigned device.

…ort clips, stft race) Four bugs flagged by SolitaryThinker on PR #1352, all reproduced and fixed: - audio.audiobox_aesthetics: `initialize_predictor()` upstream pins to cuda:0 regardless of the worker's device, so every EvalWorker piled onto GPU 0. Re-pin `predictor.model` and `predictor.device` onto `self.device` in both `setup()` and `to()`. Verified across 4 GPUs. - audio.desync (fps): the metric ignored per-sample `fps` and silently used 25 fps for every clip, mis-aligning audio/video windows for 24 / 30 / 8 fps inputs. Now reads `sample["fps"]`, falls back to the `src_fps` constructor override, and skips with a clear message when neither is set. - audio.desync (short clips): Synchformer's transformer carries a fixed 198-token positional embedding (~14 segments × (tv+ta) + 2 special tokens). Clips with fewer than 14 video/audio segments passed `_segment_video` but crashed inside `compare_v_a` on the pos_emb add. Guard `s_used < 14` and skip with the required-segment count in `details`. Verified on a 2-s clip → 5 segments → clean skip. - audio.kl_divergence and audio.frechet_distance: removed `_patch_passt_stft`. Upstream `hear21passt 0.0.26` already passes `return_complex=False` explicitly in `preprocess.py`, so the monkeypatch is dead code. As a context manager, it was also thread-unsafe — multi-worker PaSST forwards could restore another thread's partially-patched `torch.stft` or its original, depending on interleaving. Verified by running KL and FAD concurrently on 4 GPUs without races.

mergify Bot added type: feat New feature or capability scope: inference Inference pipeline, serving, CLI scope: docs Documentation labels May 15, 2026

shaoxiongduan force-pushed the shao/eval-audio branch from 03778b7 to c940ada Compare May 15, 2026 06:51

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

shaoxiongduan requested a review from SolitaryThinker May 15, 2026 06:58

shaoxiongduan and others added 5 commits May 15, 2026 22:10

[docs] eval: drop the comments we added to pyproject

8a26e5d

Net diff vs origin/main is now 4 lines of pure config (imagebind source entry, eval-audio extra, eval-full pulls in audio, codespell ignores "passt"). No new comments.

shaoxiongduan force-pushed the shao/eval-audio branch from c940ada to 8a26e5d Compare May 15, 2026 22:10

SolitaryThinker reviewed May 16, 2026

View reviewed changes

SolitaryThinker approved these changes May 16, 2026

View reviewed changes

SolitaryThinker merged commit 6b2c731 into main May 16, 2026
16 of 23 checks passed

SolitaryThinker deleted the shao/eval-audio branch May 16, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] eval: add audio metrics#1352

[feat] eval: add audio metrics#1352
SolitaryThinker merged 6 commits into
mainfrom
shao/eval-audio

shaoxiongduan commented May 15, 2026

Uh oh!

mergify Bot commented May 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

SolitaryThinker left a comment

Uh oh!

SolitaryThinker May 16, 2026

Uh oh!

SolitaryThinker May 16, 2026

Uh oh!

SolitaryThinker May 16, 2026

Uh oh!

SolitaryThinker May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, ))
	gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, PASST_EMBED_DIM))

	ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, ))
	ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, PASST_EMBED_DIM))

Conversation

shaoxiongduan commented May 15, 2026

Purpose

Changes

Test Plan

Test Results

Checklist

Uh oh!

mergify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker left a comment

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 16, 2026

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 16, 2026

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 16, 2026

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergify Bot commented May 15, 2026 •

edited

Loading