Skip to content

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Sep 19, 2025

Initial implementation of C++ transforms. This PR makes no changes to any Python APIs.

There is a major refactoring of all things related to frame dimensions and user-provided requests. It was necessary to rationalize that with the concept of user-provided transforms. Throughout, we apply the principle: figure something out as soon as possible, store that decision in a variable, and then only reference that variable to know what to do. We apply this principle to the output dimensions for a frame and what color conversion library to use.

Suggested way to review this PR:

  1. First read Transform.h and Transform.cpp. This introduces the concept of a transform on the C++ side. We will eventually have a whole family of Transform subclasses. Right now, we only have one: ResizeTransform. Resizing is special in two respects: it can change the output resolution, and it's the only transfrom that swscale can handle. There will be other transforms that change the output resolution, such as crop. But we actually need to specially handle ResizeTransform so that we can use swscale if it's the only transform requested.
  2. Review SingleStreamDecoder.h.
    • We track new state in the decoder. Notably, we maintain a vector of transforms, an optional resizedOutputDims_ that is the dimensions of a resize if it exists, and metadataDims_ which are the dimensions from the metadata. This is a major refactor. We used to pass around the video stream options and constantly refer to it to figure out what the output frame's dimensions should be. The video stream options no longer specify width and height; that exclusively comes from the presence of a ResizeTransform. We set both resizedOutputDims_ and metadataDims_ in SingleStreamDecoder::addVideoStream(). This means we no longer need to pass around the video stream options. We still give priority to the resized values, if they exist.
    • There's also what looks like other new state tracked in the decoder. That's a refactor of moving fields from StreamInfo to the decoder itself. It was a lingering TODO, which I chose to do here because I was adding more decoder state. It helps clarify that the new state I'm adding belongs in the decoder and not in StreamInfo.
  3. Review SinglestreamDecoder.cpp. Most of the changes here are a result of all of the above. Focus on SingleStreamDecoder::addVideoStream(). Note that when we want to know the output dimensions, we say resizedOutputDims_.value_or(metadataDims_). This replaces the functions we had before.
  4. Review DeviceInterface.h. The big change here is that initializeContext() just became initialize() with a long list of parameters. There's a lot of stuff we were passing to DeviceInterface::convertAVFrameToFrameOutput() that we now pass to initialize(). Note that we pass resizedOutputDims_ if they exist.
  5. Review CpuDeviceInterface.h and CpuDeviceInterface.cpp. Values that we used to pass into convertAVFrameToFrameOutput() are now passed to initialize, and we just store them as state. However, because we need to be resilient to variable resolution streams, we still need to defer knowing the output frame's resolution and the color conversion library until we get a raw decoded frame. The difference from before is that we pre-compute what we can in initialize(). And instead of having the video stream options, we have resizedOutputDims_ as an optional. We use that or the raw decoded frame's resolution.
  6. Review CudaDeviceInterface.h and CudaDeviceInterface.cpp. This is both a consequence of all of the above refactors, as well as a refactoring of Use cuda filters to support 10-bit videos #899.
  7. Review custom_ops.cpp. Right now, we have some ugly bridge code that turns width and height parameters into a ResizeTransform. That will eventually go away and we will accept transforms as tensors. How is beyond the scope of this PR.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025
@scotts scotts marked this pull request as ready for review September 25, 2025 19:24
@scotts scotts changed the title [WIP] Initial C++ implementation of transforms Initial C++ implementation of transforms Sep 26, 2025
@MadcowD
Copy link

MadcowD commented Sep 26, 2025

wow just intime for me to need it haha

@MadcowD
Copy link

MadcowD commented Sep 27, 2025

When do you think this might land :)?

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @scotts ! Made a first pass (phew!!)

UniqueAVFrame& avFrame = (avFilteredFrame) ? avFilteredFrame : avInputFrame;
// All of our CUDA decoding assumes NV12 format. We handle non-NV12 formats by
// converting them to NV12.
avFrame = maybeConvertAVFrameToNV12(avFrame);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we now have is a net improvement over the logic in main, but I wonder if we could clarify the logic a bit further: IIUC maybeConvertAVFrameToNV12 doesn't only conver to NV12, it will actually convert to RGB24 (on the CPU!) in some cases. I don't claim I know how to address this because I can't fit all the code branches in my head at this time, but perhaps there's an opportunity to split the logic further into separate methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge is that we're branching on the content of the frame and the version of FFmpeg. The current version is as clean as I could get it. What I think is promising is if we could find another way to do a CUDA format conversion on FFmpeg 4. I tried a bunch of attempts, but I think that should be its own effort.

Copy link
Contributor

@NicolasHug NicolasHug Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this isn't trivial, and I don't know how to do simplify this ATM. Should we at least rename maybeConvertAVFrameToNV12 into maybeConvertAVFrameToNV12OrRGB24? When I read "maybeConvertAVFrameToNV12", I assume the behavior is that it will either:

  • convert the frame to NV12
  • not do any conversion

But this function actually does a second type of conversion: to RGB24, on CPU. I think reflecting that in the name can help a bit

@scotts
Copy link
Contributor Author

scotts commented Sep 29, 2025

@MadcowD, thanks for your interest! As you can see, this PR is still very much under review. This is also only the C++ side of the implementation for this feature. We'll have a few more follow-on PRs that will bridge the Python and C++ as well as expose the APIs at the Python level. All of that will take several weeks before the feature lands on main.

@scotts
Copy link
Contributor Author

scotts commented Oct 1, 2025

@NicolasHug, thank you for catching the bug regarding avcodec_open2() and setting the AVCodecContext.hw_device_ctx! In order to solve it, I had to change DeviceInterface: https://github.com/meta-pytorch/torchcodec/pull/902/files#diff-5502c9dc9edfe6912c95c36ccb012f50930fd275d2c7feec96cddf9e0cdba085R33-R44.

deviceInterface_ = createDeviceInterface(device);
TORCH_CHECK(
deviceInterface_ != nullptr,
"Failed to create device interface. This should never happen, please report.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not from this PR but do we need to create an interface for audio streams?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I personally do find it easier to maintain the invariant that deviceInterface_ is non-null after adding a stream. This is also aligned with our eventual goal of moving audio decoding into a device interface.

const AVRational& timeBase) {
timeBase_ = timeBase;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on my understanding of CpuDeviceInterface::initialize and CpuDeviceInterface::initializeVideo:

  • initialize (previously initializeContext) now does 2 seemingly orthogonal things:
    • it initializes the codecContext
    • it initializes the interface timeBase_ field
  • initializeVideo (new) initializes other fields of the interface.

Was there a particular reason to set timeBase_ within initialize? If possible I think it would help to separate concerns here. For example we could have one method to initialize the codecContext itself (and nothing else), and one separate method to initialize all the interface and its fields (timeBase_ and the rest). In https://github.com/meta-pytorch/torchcodec/pull/910/files#diff-e4fe2b8c629954b1dac78a122bd2b33664844f022aa0bf3abc72b5938a14f4e2R465 I did something very similar to your changes, where:

  • initializeContext is unchanged and only initializes the codecContext
  • initialize is a new method that initializes fields on the interface itself, including the timeBase_ which I needed too. It's analogous to your initializeVideo().

Here you're calling your initializeVideo() later than where I call my initialize(), but I think still would still work fine with your initialization order.

(I also noticed you moved the cuda context creation to the constructor, which I think is a good thing!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug, my rationale here is:

  • SpecificDeviceInterface::SpecificDeviceInterface(), as in, the actual implementation's default constructor, should initialize as much as it can given that it will be initialized without any input.
  • DeviceInterface::initialize() is for things that are generic to decoding in general. Video and audio decoding will need whatever goes here. That's what codecContext_ and timeBase_ have in common.
  • DeviceInterface::initializeVideo() is obviously just for things specific to video. At the moment, our transforms are very video specific, although I do think there's a point where we'll want to generalize them to audio and then transforms will move to initialize().
  • DeviceInterface::initializeAudio() is something we'll eventually implement when we move audio decoding into a device interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, @NicolasHug's point is about direction of state change. We want to differentiate between the device interface being initialized and the device interface changing the state of something else.

@NicolasHug
Copy link
Contributor

Thanks for the great PR @scotts , made a follow-up comment above interface initialization methods, but I'll approve to unblock. There's also a minor renaming suggestion in #902 (comment) which I'm flagging here as it may be missed, since there are already a lot of comments.

@scotts scotts merged commit 401901e into meta-pytorch:main Oct 3, 2025
50 checks passed
@scotts scotts deleted the transform_core branch October 3, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants