-
Notifications
You must be signed in to change notification settings - Fork 63
Audio metadata support #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}") | ||
# TODO Put back normal flags | ||
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -pedantic -Werror ${TORCH_CXX_FLAGS}") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall ${TORCH_CXX_FLAGS}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily loosening our compilation warnings. I will make sure to put those back to normal before merging so please assume this was left unchanged for the sake of the review. I mainly just need this to show that the CI is green, until I find more accurate bounds for LIBAVFILTER_VERSION_MAJOR
which are currently emmitting warnings on some FFmpeg versions (just below).
# This is neither a video nor audio stream. Could be e.g. subtitles. | ||
# We still need to add a dummy entry so that len(streams_metadata) | ||
# is consistent with the number of streams. | ||
streams_metadata.append(StreamMetadata(**common_meta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: before these changes, streams_metadata
was only comprised of VideoStreamMetadata
objects (including for audio / subtitles streams).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils in this file are common to both AudioDecoder
and VideoDecoder
. I considered creating a _BaseDecoder
ABC, but decided against it after giving it a try. Specifically because there was a lot alternating media-specific and media-generic logics within the ABC's __init__()
.
void VideoDecoder::addAudioStream(int streamIndex) { | ||
TORCH_CHECK( | ||
seekMode_ == SeekMode::approximate, | ||
"seek_mode must be 'approximate' for audio streams."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly overlapping with the follow-up audio decoding PR. But I decided to enforce this here. The reason we only support approximate mode for audio (for now) is because we'll only support get_samples_played_in_range()
while decoding frames/samples from the beginning of the stream.
- There is no "accurate vs non-accurate" tradeoff for our future audio decoding implementation .
- There is zero reason to do a scan (hence why we enforce approximate). The scan doesn't bring more accurate metadata, and it just makes the decoder creation slower.
Note that this seek_mode
parameter name is slightly confusing for audio streams, because effectively what we're doing with audio stream really is "an accurate seek strategy, without a scan". But it's all private, it's OK.
I hope that's clear enough, happy to clarify and / or follow-up on the decoding PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I'm also open to revisiting the name we chose for the API, "seek mode", to get the behavior we want for video. If there's a name that is more consistent across audio and video, it may be worth making the API change now rather than living with the incongruity forever.
auto& streamInfo = streamInfos_[activeStreamIndex_]; | ||
streamInfo.videoStreamOptions = videoStreamOptions; | ||
streamInfo.codecContext->thread_count = | ||
videoStreamOptions.ffmpegThreadCount.value_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO need to check before merging whether this should be put back within addStream()
, before the call to initializeContextOnCuda()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to this: initializeContextOnCuda
doesn't seem to rely on threadCount
at all, but avcodec_open2
definitely does.
So I'll have to put this back, which is going to be ugly :/
throw std::invalid_argument("No valid stream found in input file."); | ||
throw std::invalid_argument( | ||
"No valid stream found in input file. Is " + | ||
std::to_string(streamIndex) + " of the desired media type?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but since we're changing this, we could make it a TORCH_CHECK
. We have a mix of TORCH_CHECK
and throwing exceptions; I think we should make them all TORCH_CHECK
s. I haven't prioritized doing it everywhere, but might as well as we change code. Heads-up that it might require changing the C++ test case; it might expect std::invalid_argument
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let me open #537 for this, this would be a good first contribution
"Seek mode is approximate, but stream " + | ||
std::to_string(activeStreamIndex_) + | ||
" does not have an average fps in its metadata."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on TORCH_CHECK
.
|
||
void VideoDecoder::validateActiveStream() { | ||
void VideoDecoder::validateActiveStream( | ||
std::optional<AVMediaType> avMediaType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a time when we will want to generically check the valid active stream but not know if we're audio or video? Is there any place we're doing it currently? If the answer to the second question is no, I'd prefer to make this a required field now. We can revisit when we encounter that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I should have commented on this: I made this optional because on the decoding PR (still WIP #532), there are calls to this function where we don't pass a parameter so as to accept both audio and video streams (namely in getFramesPlayedInRange()).
I don't mind too much so let me know what you prefer, but if we make this a mandatory parameter here, we would have to revert to optional right after.
|
||
from torchcodec.decoders import _core, VideoDecoder | ||
from torchcodec.decoders import _core, VideoDecoder, VideoStreamMetadata | ||
from torchcodec.decoders._audio_decoder import AudioDecoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're testing both video and audio in the same file, we should probably change this file name to test_decoders.py
or something similar.
# LICENSE file in the root directory of this source tree. | ||
|
||
import os | ||
from functools import partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for the ops tests - although we've been in this situation for a long time because we've had the audio metadata test.
Previously, I had advocated to not have a I still slightly prefer that approach, as I think we can model the difference between video and audio files in a way that I think will resonate more with our users. That is, the current way of modeling the metadata matches FFmpeg, which is convenient for the implementation; one generic class which both kinds of media can slot into. I agree that the incongruity is minor: if we have a separate top-level With that said, I'm good with this approach, as users will mostly just see the values, not the types or even the class names. |
Thanks for the review! I remember now we had a few discussions about the single "container" metadata vs having 2 separate video and audio classes for this. I don't feel too strongly about either at the moment, and now that we're introducing audio seems like the moment to figure that out for good. Let me merge this as-is so as to unblock the rest, but we'll make sure to explore alternative designs before making all this public |
This PR adds metadata support for audio streams. It does not add any audio decoding capabilities.
I'm not writing a design doc, because the proposed design is a natural generalization of what we already have for videos, and I hope the implementation is obvious enough. In the main lines:
AudioStreamMetadata
, the audio counterpart of existingVideoStreamMetadata
. They share a bunch of fields, so I created a private base classStreamMetadata
where the common fields are defined.AudioStreamMetadata
does not expose any field that relates to the concept of frame (likenum_frames
,average_fps
, etc.). This is in spite of "frame" being a valid audio concept within FFmpeg. Users care about audio samples, not frames.VideoMetadata
class becomesContainerMetadata
. This is because we want to expose audio metadata on files that are pure audio (like mp3 files), not just videos files.AudioDecoder(...).metadata
, just like for videos. TheAudioDecoder
class is currently private and does not support any decoding capabilities.There are a bunch of TODOs marked as
TODO-AUDIO
which are items we'll need to address before the release, but that can be done in follow-up PRs.