Audio metadata support #535

NicolasHug · 2025-03-05T10:19:20Z

This PR adds metadata support for audio streams. It does not add any audio decoding capabilities.

I'm not writing a design doc, because the proposed design is a natural generalization of what we already have for videos, and I hope the implementation is obvious enough. In the main lines:

We add AudioStreamMetadata, the audio counterpart of existing VideoStreamMetadata. They share a bunch of fields, so I created a private base class StreamMetadata where the common fields are defined.
AudioStreamMetadata does not expose any field that relates to the concept of frame (like num_frames, average_fps, etc.). This is in spite of "frame" being a valid audio concept within FFmpeg. Users care about audio samples, not frames.
The VideoMetadata class becomes ContainerMetadata. This is because we want to expose audio metadata on files that are pure audio (like mp3 files), not just videos files.
Similar functions / parameters that where previously like "video metadata" now use the term "container" instead of "video". This is a tiiiiiny bit BC-breaking for core APIs, which were private anyway.
Publicly, the audio metadata is exposed via AudioDecoder(...).metadata, just like for videos. The AudioDecoder class is currently private and does not support any decoding capabilities.

There are a bunch of TODOs marked as TODO-AUDIO which are items we'll need to address before the release, but that can be done in follow-up PRs.

NicolasHug · 2025-03-05T14:37:33Z

src/torchcodec/decoders/_core/CMakeLists.txt

-set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}")
+# TODO Put back normal flags
+# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}")
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall ${TORCH_CXX_FLAGS}")


Temporarily loosening our compilation warnings. I will make sure to put those back to normal before merging so please assume this was left unchanged for the sake of the review. I mainly just need this to show that the CI is green, until I find more accurate bounds for LIBAVFILTER_VERSION_MAJOR which are currently emmitting warnings on some FFmpeg versions (just below).

src/torchcodec/decoders/_core/FFMPEGCommon.cpp

NicolasHug · 2025-03-05T14:39:35Z

src/torchcodec/decoders/_core/_metadata.py

+            # This is neither a video nor audio stream. Could be e.g. subtitles.
+            # We still need to add a dummy entry so that len(streams_metadata)
+            # is consistent with the number of streams.
+            streams_metadata.append(StreamMetadata(**common_meta))


Note: before these changes, streams_metadata was only comprised of VideoStreamMetadata objects (including for audio / subtitles streams).

NicolasHug · 2025-03-05T14:43:38Z

src/torchcodec/decoders/_decoder_utils.py

Utils in this file are common to both AudioDecoder and VideoDecoder. I considered creating a _BaseDecoder ABC, but decided against it after giving it a try. Specifically because there was a lot alternating media-specific and media-generic logics within the ABC's __init__().

NicolasHug · 2025-03-05T14:51:47Z