[feat] eval: add audio metrics#1352
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for
This rule is failing.
|
03778b7 to
c940ada
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive audio evaluation suite for video generations, adding metrics such as CLAP, AudioBox Aesthetics, KL Divergence, FAD, WER, DeSync, and ImageBind. The implementation includes new metric classes, documentation updates, and the vendoring of necessary third-party components like Synchformer and GLM-ASR. Key feedback identifies potential runtime errors in the Fréchet Audio Distance calculation due to incorrect empty array initialization and highlights maintainability concerns regarding unused files and non-standard import practices in the vendored code.
| self._ref_buf.extend(other._ref_buf) | ||
|
|
||
| def finalize(self) -> MetricResult: | ||
| gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, )) |
There was a problem hiding this comment.
Initializing an empty array with np.empty((0,)) creates a 1D array. When self._gen_buf is empty, gen_all becomes this 1D array, and the subsequent call to np.isfinite(gen_all).all(axis=1) on line 182 would raise an AxisError because axis 1 is out of bounds. To prevent this, initialize the empty array with the correct number of dimensions.
| gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, )) | |
| gen_all = np.stack(self._gen_buf) if self._gen_buf else np.empty((0, PASST_EMBED_DIM)) |
| n_ref_dropped = 0 | ||
| ref_source = "cached" | ||
| else: | ||
| ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, )) |
There was a problem hiding this comment.
Similar to the gen_all array, ref_all should be initialized as a 2D array to avoid a potential AxisError on the next line when self._ref_buf is empty.
| ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, )) | |
| ref_all = np.stack(self._ref_buf) if self._ref_buf else np.empty((0, PASST_EMBED_DIM)) |
| vsegs = _segment_video(frames).unsqueeze(0) # (B=1, S, T_seg, C, H, W) | ||
| # Synchformer's extract_vfeats wants (B, S, T_seg, C, H, W); fold (B, S) → (B*S, 1, ...). | ||
| b, s = vsegs.shape[:2] | ||
| from einops import rearrange |
| import decord | ||
| from imagebind import data as ib_data | ||
| from imagebind.models.imagebind_model import ModalityType |
| import torch | ||
| from torchvision.models.resnet import BasicBlock, Bottleneck, ResNet | ||
|
|
||
| sys.path.append('.') # nopep8 |
There was a problem hiding this comment.
This file appears to be unused in the project and contains a sys.path modification which is generally discouraged as it can lead to unpredictable import behavior. The imports on lines 11-13 also seem to rely on a project structure that doesn't exist here. If this file is not needed, it would be best to remove it to improve maintainability. If it is needed, the imports should be refactored to be relative to the fastvideo package.
Seven `audio.*` metrics ported 1:1 from `hkchengrex/av-benchmark`, the
V2A literature's de-facto eval harness:
- audio.clap_score (HF ClapModel, laion/clap-htsat-fused)
- audio.audiobox_aesthetics (Meta audiobox_aesthetics, PQ as score)
- audio.kl_divergence (PaSST AudioSet-527 logits, KL(gt||pred))
- audio.frechet_distance (PaSST 768-d embeds, corpus-vs-corpus FAD)
- audio.wer (Whisper-base default; GLM-ASR / SenseVoice
backends; MagiHuman-style CJK char-level)
- audio.desync (Synchformer, av-benchmark DeSync; vendored)
- audio.imagebind_score (ImageBind huge, audio<->video cosine)
`audio.frechet_distance` is set-vs-set; the other six are per-sample.
FAD supports three equivalent ref-supply modes — paired samples (V2A
manifest convention), role-tagged samples, or a cached features file
via `FASTVIDEO_FAD_REF_FEATURES` — all producing identical math.
Multi-GPU thread safety: `decord.bridge.set_bridge("torch")` is set
on the worker thread inside ImageBind's compute() to work around
decord's threading.local-with-missing-`global` bug. A module-level
threading.Lock around the pytorchvideo decode call serializes that
step across workers. PaSST setup no longer uses
`contextlib.redirect_stdout` (it races across workers and closes
sys.stdout). FAD's finalize() filters non-finite PaSST embeds (silent
audio drives the softmax into log(0)) and KL skips with a clear
message in the same situation.
New `[eval-audio]` extra in pyproject covers everything; ImageBind is
git-sourced via `[tool.uv.sources]` (CC BY-NC-SA 4.0, not vendored).
Synchformer is vendored under `_synchformer/` (MIT) and a
transformers-4.57-compatible build of GLM-ASR under `wer/_glmasr/`
(Apache-2.0). Both vendored trees keep their upstream LICENSE files.
eval README updated with the install matrix, vendored-tree
disclosure, per-metric input-contract table, and upstream-citation
table. Includes a runnable LTX2 example at
`examples/inference/eval/basic_ltx2_audio_eval.py`.
Co-Authored-By: klhhhhh <1412841649@qq.com>
`fastvideo/eval/metrics/audio/_synchformer/` (Synchformer, MIT) and `fastvideo/eval/metrics/audio/wer/_glmasr/` (GLM-ASR, Apache-2.0) are byte-for-byte upstream sources. yapf/codespell/mypy were lighting up the diff with style fixes the vendoring contract forbids touching. Add both paths to the top-level pre-commit `exclude` so every hook (yapf, mypy, codespell, pymarkdown) skips them — mirrors the existing treatment of `fastvideo/third_party/` for the VBench submodule. Ruff already has the same exclude in pyproject.toml via `extend-exclude`.
Move the audio metrics' two vendored upstream packages alongside the existing VBench submodule under ``fastvideo/third_party/eval/``: - ``fastvideo/eval/metrics/audio/_synchformer/`` → ``fastvideo/third_party/eval/synchformer/`` - ``fastvideo/eval/metrics/audio/wer/_glmasr/`` → ``fastvideo/third_party/eval/glmasr/`` This consolidates all upstream-provenance code under one tree and lets the existing top-level ``fastvideo/third_party/.*`` exclude in ``.pre-commit-config.yaml`` cover both new trees automatically. The audio-specific entries previously added to ``.pre-commit-config.yaml`` and ``pyproject.toml [tool.ruff] extend-exclude`` are reverted. Also trims module docstrings and comments across the audio metrics to match FastVideo's house style (1–8 lines per module, comments only where the WHY is non-obvious). Removed: - ``redirect_stdout``-history commentary in FAD/KL setup paths - FAD's three-mode decision-tree docstring in ``accumulate`` - ImageBind's per-line bug history in ``_IB_DECODE_LOCK`` and ``compute``'s ``decord.bridge`` setter - DeSync's "Clip-length assumption" sub-section - WER's per-backend description block FAD on the corruption fixtures returns 739.389 (bit-identical to the pre-refactor value); all 8 per-sample scores in the corruption suite match.
…ents README updates: - audio metric list in the layout tree was missing desync + imagebind_score - third_party/eval/ tree now lists synchformer + glmasr alongside vbench - BaseMetric.compute signature: was list[MetricResult], actually MetricResult - cache layout: dropped stale AMT mention, added Synchformer + ImageBind - vbench install row corrected to "11 of 16 by default; +4 with detectron2" - upstream-wrapping section now describes all three patterns coexisting in the suite (submodule / vendored / git-source-via-uv) Contributing guide updates: - example group list: vlm → videoscore2 - file-layout tree: dropped fictional vlm/, added audio/ and videoscore2/ - TL;DR now lists five recipes covering submodule / vendor / git-source pyproject.toml: trim the 27-line eval-audio comment block down to one line and revert the cosmetic backtick churn elsewhere. Net change vs origin/main is now ~6 lines of additions (the eval-audio extra, the imagebind source entry, codespell "passt", eval-full pulling in audio).
Net diff vs origin/main is now 4 lines of pure config (imagebind source entry, eval-audio extra, eval-full pulls in audio, codespell ignores "passt"). No new comments.
c940ada to
8a26e5d
Compare
SolitaryThinker
left a comment
There was a problem hiding this comment.
Review findings from the audio metric pass.
|
|
||
| # Video preprocessing → (S, T_seg, C, 224, 224) | ||
| frames = video.float().to(self.device) | ||
| src_fps = self._src_fps if self._src_fps is not None else _SYNC_FPS |
There was a problem hiding this comment.
This ignores the per-sample fps that callers can pass through Evaluator.evaluate(...) and falls back to 25 fps for all pool-decoded clips. For 24/30/8 fps inputs, _resample_video computes the wrong clip duration and segment positions while the audio waveform stays at native duration, so the video/audio windows no longer line up. Please prefer sample.get("fps") when present, or skip when neither the sample nor constructor provides the source FPS.
| # |grid|-value → average across the two directions. | ||
| sync_grid = self._grid.to(self.device) if self._grid is not None else None | ||
| assert sync_grid is not None | ||
| s_used = min(_NUM_SEG_PER_DIRECTION, vfeats.shape[1], afeats.shape[1]) |
There was a problem hiding this comment.
For clips shorter than the expected Synchformer window, s_used can be less than 14, but compare_v_a() still feeds that shorter token sequence into the vendored transformer. That transformer has a fixed positional embedding for the 14-segment shape and adds it without slicing/padding, so short-but-decodable clips can shape-mismatch here instead of returning a skipped result. Please guard for the required segment count before calling compare_v_a() or pad to the expected token length.
| self.old_stft = torch.stft | ||
|
|
||
| def __enter__(self) -> None: | ||
| torch.stft = partial(torch.stft, return_complex=False) # type: ignore[assignment] |
There was a problem hiding this comment.
This context manager mutates process-global torch.stft around each PaSST forward. Multi-GPU eval runs worker threads in the same process, so KL/FAD calls can overlap: one thread can restore torch.stft while another PaSST forward still expects the patched callable, or restore another thread's partial wrapper. Current hear21passt already passes return_complex=False upstream, so please remove this monkeypatch, or protect it with a shared lock if it is truly still required.
| if self._predictor is not None: | ||
| return | ||
| from audiobox_aesthetics.infer import initialize_predictor | ||
| self._predictor = initialize_predictor() |
There was a problem hiding this comment.
initialize_predictor() chooses its own device; upstream defaults to torch.device("cuda") when CUDA is available. In num_gpus > 1, each EvalWorker calls m.to("cuda:i") before setup, but this line ignores self.device, so all AudioBox predictors will load/run on default GPU 0. That removes the intended parallelism and can OOM GPU 0. Please construct or move the predictor onto the worker's assigned device.
…ort clips, stft race) Four bugs flagged by SolitaryThinker on PR #1352, all reproduced and fixed: - audio.audiobox_aesthetics: `initialize_predictor()` upstream pins to cuda:0 regardless of the worker's device, so every EvalWorker piled onto GPU 0. Re-pin `predictor.model` and `predictor.device` onto `self.device` in both `setup()` and `to()`. Verified across 4 GPUs. - audio.desync (fps): the metric ignored per-sample `fps` and silently used 25 fps for every clip, mis-aligning audio/video windows for 24 / 30 / 8 fps inputs. Now reads `sample["fps"]`, falls back to the `src_fps` constructor override, and skips with a clear message when neither is set. - audio.desync (short clips): Synchformer's transformer carries a fixed 198-token positional embedding (~14 segments × (tv+ta) + 2 special tokens). Clips with fewer than 14 video/audio segments passed `_segment_video` but crashed inside `compare_v_a` on the pos_emb add. Guard `s_used < 14` and skip with the required-segment count in `details`. Verified on a 2-s clip → 5 segments → clean skip. - audio.kl_divergence and audio.frechet_distance: removed `_patch_passt_stft`. Upstream `hear21passt 0.0.26` already passes `return_complex=False` explicitly in `preprocess.py`, so the monkeypatch is dead code. As a context manager, it was also thread-unsafe — multi-worker PaSST forwards could restore another thread's partially-patched `torch.stft` or its original, depending on interleaving. Verified by running KL and FAD concurrently on 4 GPUs without races.
Purpose
Add seven
audio.*metrics tofastvideo.eval, partially ported fromhkchengrex/av-benchmark(the V2A literature's de-facto eval harness usedby MMAudio, FoleyCrafter, V2A-Mapper). Closes the audio gap in the eval
suite so video-with-audio models like LTX2 can be scored end-to-end through
the same registry-driven API as
common.*,vbench.*, andphysics_iq.*.Fixes #
Changes
fastvideo/eval/metrics/audio/:audio.clap_score(HFClapModel,laion/clap-htsat-fused)audio.audiobox_aesthetics(Metaaudiobox_aesthetics, PQ as score)audio.kl_divergence(PaSST AudioSet-527 logits,KL(gt || pred))audio.frechet_distance(PaSST 768-d embeds, corpus-vs-corpus FAD)audio.wer(Whisper-base default; GLM-ASR and SenseVoice optional)audio.desync(Synchformer, av-benchmarkDeSync)audio.imagebind_score(ImageBind huge, audio/video cosine)[eval-audio]extra inpyproject.toml; ImageBind is git-sourced via[tool.uv.sources](CC BY-NC-SA 4.0 so kept out of the FastVideo tree).fastvideo/third_party/eval/alongside theexisting VBench submodule:
synchformer/(MIT) foraudio.desyncglmasr/(Apache-2.0) foraudio.wer (glm_asr)on transformers 4.57decord.bridge.set_bridge("torch")re-set on the worker thread inimagebind_score.compute()(decord's bridge isthreading.localwitha missing
globalupstream; setting on the main thread is lost).threading.Lockaround pytorchvideo's decord-backed video decode.contextlib.redirect_stdout(devnull)removed from FAD and KL setuppaths (races across workers and closes
sys.stdout).audio.frechet_distance: filter non-finite PaSSTrows before
np.cov, surface drop counts indetails, always-oneps * Iregularization on both covariances.(
reference_audioon each sample, matches V2A manifests), role-taggedsamples (
role="reference"), or a cached.ptfeatures file pointed toby
FASTVIDEO_FAD_REF_FEATURES.docs/contributing/eval-metrics.mdupdated:install matrix, per-metric input contracts, vendored-tree disclosure,
upstream-citation table, three-pattern upstream-wrapping guide (submodule
/ vendored / git-source-via-uv).
examples/inference/eval/basic_ltx2_audio_eval.py.Test Plan
Variants exercised by the corruption script: original LTX2 audio, silence,
white noise, low volume (-20 dB), reversed audio, pitch-shifted +6
semitones, audio delayed +1 s, audio advanced -1 s, audio replaced with an
unrelated Inception clip.
Test Results
Pre-commit: yapf, ruff, codespell, pymarkdown, mypy, filename, suggestion
all pass on the 5-commit diff.
End-to-end 4-GPU eval on 9 corruption variants of the LTX2 clip
(eval wall time ~45 s for
evaluate()plus ~104 s of model loading atctor; per-sample ~5 s averaged across 4 workers):
Corpus FAD across the 9 variants paired against the original: 739.389
(silence drops 1 non-finite PaSST embed, surfaced in
details.n_gen_dropped_nonfinite).How to read the table
claprewards timbre/category match with the prompt and ignores temporalorder, so reversed speech can score higher than the original.
audiobox(PQ) measures audio quality in isolation and is amplitude-invariant.
KLandimagebindcorrectly drop toward zero on noise/silence/unrelatedaudio.
desynclands at the Synchformer ±2 s grid's nearest match to theground-truth shift (0.2 s for original, 1.2/0.8 s for the 1 s shifts, 2.0 s
for silence and unrelated audio).
werreaches 1.0 on non-speech and ~0.78on pitch-shifted speech where Whisper recovers some words.
Checklist
pre-commit run --all-filesand fixed all issuesFor model/pipeline changes, also check: