-
Notifications
You must be signed in to change notification settings - Fork 62
Initial C++ implementation of transforms #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wow just intime for me to need it haha |
When do you think this might land :)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @scotts ! Made a first pass (phew!!)
UniqueAVFrame& avFrame = (avFilteredFrame) ? avFilteredFrame : avInputFrame; | ||
// All of our CUDA decoding assumes NV12 format. We handle non-NV12 formats by | ||
// converting them to NV12. | ||
avFrame = maybeConvertAVFrameToNV12(avFrame); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we now have is a net improvement over the logic in main
, but I wonder if we could clarify the logic a bit further: IIUC maybeConvertAVFrameToNV12
doesn't only conver to NV12, it will actually convert to RGB24 (on the CPU!) in some cases. I don't claim I know how to address this because I can't fit all the code branches in my head at this time, but perhaps there's an opportunity to split the logic further into separate methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge is that we're branching on the content of the frame and the version of FFmpeg. The current version is as clean as I could get it. What I think is promising is if we could find another way to do a CUDA format conversion on FFmpeg 4. I tried a bunch of attempts, but I think that should be its own effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this isn't trivial, and I don't know how to do simplify this ATM. Should we at least rename maybeConvertAVFrameToNV12
into maybeConvertAVFrameToNV12OrRGB24
? When I read "maybeConvertAVFrameToNV12", I assume the behavior is that it will either:
- convert the frame to NV12
- not do any conversion
But this function actually does a second type of conversion: to RGB24, on CPU. I think reflecting that in the name can help a bit
@MadcowD, thanks for your interest! As you can see, this PR is still very much under review. This is also only the C++ side of the implementation for this feature. We'll have a few more follow-on PRs that will bridge the Python and C++ as well as expose the APIs at the Python level. All of that will take several weeks before the feature lands on main. |
@NicolasHug, thank you for catching the bug regarding |
deviceInterface_ = createDeviceInterface(device); | ||
TORCH_CHECK( | ||
deviceInterface_ != nullptr, | ||
"Failed to create device interface. This should never happen, please report."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not from this PR but do we need to create an interface for audio streams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I personally do find it easier to maintain the invariant that deviceInterface_
is non-null after adding a stream. This is also aligned with our eventual goal of moving audio decoding into a device interface.
const AVRational& timeBase) { | ||
timeBase_ = timeBase; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment on my understanding of CpuDeviceInterface::initialize
and CpuDeviceInterface::initializeVideo
:
initialize
(previouslyinitializeContext
) now does 2 seemingly orthogonal things:- it initializes the codecContext
- it initializes the interface
timeBase_
field
initializeVideo
(new) initializes other fields of the interface.
Was there a particular reason to set timeBase_
within initialize? If possible I think it would help to separate concerns here. For example we could have one method to initialize the codecContext
itself (and nothing else), and one separate method to initialize all the interface and its fields (timeBase_
and the rest). In https://github.com/meta-pytorch/torchcodec/pull/910/files#diff-e4fe2b8c629954b1dac78a122bd2b33664844f022aa0bf3abc72b5938a14f4e2R465 I did something very similar to your changes, where:
initializeContext
is unchanged and only initializes the codecContextinitialize
is a new method that initializes fields on the interface itself, including thetimeBase_
which I needed too. It's analogous to yourinitializeVideo()
.
Here you're calling your initializeVideo()
later than where I call my initialize()
, but I think still would still work fine with your initialization order.
(I also noticed you moved the cuda context creation to the constructor, which I think is a good thing!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug, my rationale here is:
SpecificDeviceInterface::SpecificDeviceInterface()
, as in, the actual implementation's default constructor, should initialize as much as it can given that it will be initialized without any input.DeviceInterface::initialize()
is for things that are generic to decoding in general. Video and audio decoding will need whatever goes here. That's whatcodecContext_
andtimeBase_
have in common.DeviceInterface::initializeVideo()
is obviously just for things specific to video. At the moment, our transforms are very video specific, although I do think there's a point where we'll want to generalize them to audio and then transforms will move toinitialize()
.DeviceInterface::initializeAudio()
is something we'll eventually implement when we move audio decoding into a device interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity, @NicolasHug's point is about direction of state change. We want to differentiate between the device interface being initialized and the device interface changing the state of something else.
Thanks for the great PR @scotts , made a follow-up comment above interface initialization methods, but I'll approve to unblock. There's also a minor renaming suggestion in #902 (comment) which I'm flagging here as it may be missed, since there are already a lot of comments. |
Initial implementation of C++ transforms. This PR makes no changes to any Python APIs.
There is a major refactoring of all things related to frame dimensions and user-provided requests. It was necessary to rationalize that with the concept of user-provided transforms. Throughout, we apply the principle: figure something out as soon as possible, store that decision in a variable, and then only reference that variable to know what to do. We apply this principle to the output dimensions for a frame and what color conversion library to use.
Suggested way to review this PR:
Transform.h
andTransform.cpp
. This introduces the concept of a transform on the C++ side. We will eventually have a whole family ofTransform
subclasses. Right now, we only have one:ResizeTransform
. Resizing is special in two respects: it can change the output resolution, and it's the only transfrom that swscale can handle. There will be other transforms that change the output resolution, such as crop. But we actually need to specially handleResizeTransform
so that we can use swscale if it's the only transform requested.SingleStreamDecoder.h
.resizedOutputDims_
that is the dimensions of a resize if it exists, andmetadataDims_
which are the dimensions from the metadata. This is a major refactor. We used to pass around the video stream options and constantly refer to it to figure out what the output frame's dimensions should be. The video stream options no longer specify width and height; that exclusively comes from the presence of aResizeTransform
. We set bothresizedOutputDims_
andmetadataDims_
inSingleStreamDecoder::addVideoStream()
. This means we no longer need to pass around the video stream options. We still give priority to the resized values, if they exist.StreamInfo
to the decoder itself. It was a lingering TODO, which I chose to do here because I was adding more decoder state. It helps clarify that the new state I'm adding belongs in the decoder and not inStreamInfo
.SinglestreamDecoder.cpp
. Most of the changes here are a result of all of the above. Focus onSingleStreamDecoder::addVideoStream()
. Note that when we want to know the output dimensions, we sayresizedOutputDims_.value_or(metadataDims_)
. This replaces the functions we had before.DeviceInterface.h
. The big change here is thatinitializeContext()
just becameinitialize()
with a long list of parameters. There's a lot of stuff we were passing toDeviceInterface::convertAVFrameToFrameOutput()
that we now pass toinitialize()
. Note that we passresizedOutputDims_
if they exist.CpuDeviceInterface.h
andCpuDeviceInterface.cpp
. Values that we used to pass intoconvertAVFrameToFrameOutput()
are now passed to initialize, and we just store them as state. However, because we need to be resilient to variable resolution streams, we still need to defer knowing the output frame's resolution and the color conversion library until we get a raw decoded frame. The difference from before is that we pre-compute what we can ininitialize()
. And instead of having the video stream options, we haveresizedOutputDims_
as an optional. We use that or the raw decoded frame's resolution.CudaDeviceInterface.h
andCudaDeviceInterface.cpp
. This is both a consequence of all of the above refactors, as well as a refactoring of Use cuda filters to support 10-bit videos #899.custom_ops.cpp
. Right now, we have some ugly bridge code that turnswidth
andheight
parameters into aResizeTransform
. That will eventually go away and we will accept transforms as tensors. How is beyond the scope of this PR.