Skip to content

Conversation

NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Mar 5, 2025

This PR adds metadata support for audio streams. It does not add any audio decoding capabilities.

I'm not writing a design doc, because the proposed design is a natural generalization of what we already have for videos, and I hope the implementation is obvious enough. In the main lines:

  • We add AudioStreamMetadata, the audio counterpart of existing VideoStreamMetadata. They share a bunch of fields, so I created a private base class StreamMetadata where the common fields are defined.
  • AudioStreamMetadata does not expose any field that relates to the concept of frame (like num_frames, average_fps, etc.). This is in spite of "frame" being a valid audio concept within FFmpeg. Users care about audio samples, not frames.
  • The VideoMetadata class becomes ContainerMetadata. This is because we want to expose audio metadata on files that are pure audio (like mp3 files), not just videos files.
  • Similar functions / parameters that where previously like "video metadata" now use the term "container" instead of "video". This is a tiiiiiny bit BC-breaking for core APIs, which were private anyway.
  • Publicly, the audio metadata is exposed via AudioDecoder(...).metadata, just like for videos. The AudioDecoder class is currently private and does not support any decoding capabilities.

There are a bunch of TODOs marked as TODO-AUDIO which are items we'll need to address before the release, but that can be done in follow-up PRs.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2025
@NicolasHug NicolasHug marked this pull request as ready for review March 5, 2025 14:35
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}")
# TODO Put back normal flags
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall ${TORCH_CXX_FLAGS}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily loosening our compilation warnings. I will make sure to put those back to normal before merging so please assume this was left unchanged for the sake of the review. I mainly just need this to show that the CI is green, until I find more accurate bounds for LIBAVFILTER_VERSION_MAJOR which are currently emmitting warnings on some FFmpeg versions (just below).

# This is neither a video nor audio stream. Could be e.g. subtitles.
# We still need to add a dummy entry so that len(streams_metadata)
# is consistent with the number of streams.
streams_metadata.append(StreamMetadata(**common_meta))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: before these changes, streams_metadata was only comprised of VideoStreamMetadata objects (including for audio / subtitles streams).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils in this file are common to both AudioDecoder and VideoDecoder. I considered creating a _BaseDecoder ABC, but decided against it after giving it a try. Specifically because there was a lot alternating media-specific and media-generic logics within the ABC's __init__().

void VideoDecoder::addAudioStream(int streamIndex) {
TORCH_CHECK(
seekMode_ == SeekMode::approximate,
"seek_mode must be 'approximate' for audio streams.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly overlapping with the follow-up audio decoding PR. But I decided to enforce this here. The reason we only support approximate mode for audio (for now) is because we'll only support get_samples_played_in_range() while decoding frames/samples from the beginning of the stream.

  • There is no "accurate vs non-accurate" tradeoff for our future audio decoding implementation .
  • There is zero reason to do a scan (hence why we enforce approximate). The scan doesn't bring more accurate metadata, and it just makes the decoder creation slower.

Note that this seek_mode parameter name is slightly confusing for audio streams, because effectively what we're doing with audio stream really is "an accurate seek strategy, without a scan". But it's all private, it's OK.

I hope that's clear enough, happy to clarify and / or follow-up on the decoding PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'm also open to revisiting the name we chose for the API, "seek mode", to get the behavior we want for video. If there's a name that is more consistent across audio and video, it may be worth making the API change now rather than living with the incongruity forever.

auto& streamInfo = streamInfos_[activeStreamIndex_];
streamInfo.videoStreamOptions = videoStreamOptions;
streamInfo.codecContext->thread_count =
videoStreamOptions.ffmpegThreadCount.value_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO need to check before merging whether this should be put back within addStream(), before the call to initializeContextOnCuda().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to this: initializeContextOnCuda doesn't seem to rely on threadCount at all, but avcodec_open2 definitely does.
So I'll have to put this back, which is going to be ugly :/

throw std::invalid_argument("No valid stream found in input file.");
throw std::invalid_argument(
"No valid stream found in input file. Is " +
std::to_string(streamIndex) + " of the desired media type?");
Copy link
Contributor

@scotts scotts Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but since we're changing this, we could make it a TORCH_CHECK. We have a mix of TORCH_CHECK and throwing exceptions; I think we should make them all TORCH_CHECKs. I haven't prioritized doing it everywhere, but might as well as we change code. Heads-up that it might require changing the C++ test case; it might expect std::invalid_argument.

Copy link
Contributor Author

@NicolasHug NicolasHug Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let me open #537 for this, this would be a good first contribution

"Seek mode is approximate, but stream " +
std::to_string(activeStreamIndex_) +
" does not have an average fps in its metadata.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on TORCH_CHECK.


void VideoDecoder::validateActiveStream() {
void VideoDecoder::validateActiveStream(
std::optional<AVMediaType> avMediaType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a time when we will want to generically check the valid active stream but not know if we're audio or video? Is there any place we're doing it currently? If the answer to the second question is no, I'd prefer to make this a required field now. We can revisit when we encounter that scenario.

Copy link
Contributor Author

@NicolasHug NicolasHug Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I should have commented on this: I made this optional because on the decoding PR (still WIP #532), there are calls to this function where we don't pass a parameter so as to accept both audio and video streams (namely in getFramesPlayedInRange()).

I don't mind too much so let me know what you prefer, but if we make this a mandatory parameter here, we would have to revert to optional right after.


from torchcodec.decoders import _core, VideoDecoder
from torchcodec.decoders import _core, VideoDecoder, VideoStreamMetadata
from torchcodec.decoders._audio_decoder import AudioDecoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're testing both video and audio in the same file, we should probably change this file name to test_decoders.py or something similar.

# LICENSE file in the root directory of this source tree.

import os
from functools import partial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for the ops tests - although we've been in this situation for a long time because we've had the audio metadata test.

@scotts
Copy link
Contributor

scotts commented Mar 5, 2025

Previously, I had advocated to not have a ContainerMetadata class which was generic to both audio and video, but instead have a top-level VideoMetadata and AudioMetadata classes for each kind of file.

I still slightly prefer that approach, as I think we can model the difference between video and audio files in a way that I think will resonate more with our users. That is, the current way of modeling the metadata matches FFmpeg, which is convenient for the implementation; one generic class which both kinds of media can slot into. I agree that the incongruity is minor: if we have a separate top-level AudioMetadata class then we can model that it can only contain audio streams, while a VideoMetadata can contain audio, video and subtitles. I think audio files can have multiple streams, although it's rare and I can't get clarity on that.

With that said, I'm good with this approach, as users will mostly just see the values, not the types or even the class names.

@NicolasHug
Copy link
Contributor Author

Thanks for the review! I remember now we had a few discussions about the single "container" metadata vs having 2 separate video and audio classes for this. I don't feel too strongly about either at the moment, and now that we're introducing audio seems like the moment to figure that out for good. Let me merge this as-is so as to unblock the rest, but we'll make sure to explore alternative designs before making all this public

@NicolasHug NicolasHug merged commit 79954f9 into meta-pytorch:main Mar 6, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants