Initial C++ implementation of transforms #902

scotts · 2025-09-19T14:32:07Z

Initial implementation of C++ transforms. This PR makes no changes to any Python APIs.

There is a major refactoring of all things related to frame dimensions and user-provided requests. It was necessary to rationalize that with the concept of user-provided transforms. Throughout, we apply the principle: figure something out as soon as possible, store that decision in a variable, and then only reference that variable to know what to do. We apply this principle to the output dimensions for a frame and what color conversion library to use.

Suggested way to review this PR:

First read Transform.h and Transform.cpp. This introduces the concept of a transform on the C++ side. We will eventually have a whole family of Transform subclasses. Right now, we only have one: ResizeTransform. Resizing is special in two respects: it can change the output resolution, and it's the only transfrom that swscale can handle. There will be other transforms that change the output resolution, such as crop. But we actually need to specially handle ResizeTransform so that we can use swscale if it's the only transform requested.
Review SingleStreamDecoder.h.
- We track new state in the decoder. Notably, we maintain a vector of transforms, an optional resizedOutputDims_ that is the dimensions of a resize if it exists, and metadataDims_ which are the dimensions from the metadata. This is a major refactor. We used to pass around the video stream options and constantly refer to it to figure out what the output frame's dimensions should be. The video stream options no longer specify width and height; that exclusively comes from the presence of a ResizeTransform. We set both resizedOutputDims_ and metadataDims_ in SingleStreamDecoder::addVideoStream(). This means we no longer need to pass around the video stream options. We still give priority to the resized values, if they exist.
- There's also what looks like other new state tracked in the decoder. That's a refactor of moving fields from StreamInfo to the decoder itself. It was a lingering TODO, which I chose to do here because I was adding more decoder state. It helps clarify that the new state I'm adding belongs in the decoder and not in StreamInfo.
Review SinglestreamDecoder.cpp. Most of the changes here are a result of all of the above. Focus on SingleStreamDecoder::addVideoStream(). Note that when we want to know the output dimensions, we say resizedOutputDims_.value_or(metadataDims_). This replaces the functions we had before.
Review DeviceInterface.h. The big change here is that initializeContext() just became initialize() with a long list of parameters. There's a lot of stuff we were passing to DeviceInterface::convertAVFrameToFrameOutput() that we now pass to initialize(). Note that we pass resizedOutputDims_ if they exist.
Review CpuDeviceInterface.h and CpuDeviceInterface.cpp. Values that we used to pass into convertAVFrameToFrameOutput() are now passed to initialize, and we just store them as state. However, because we need to be resilient to variable resolution streams, we still need to defer knowing the output frame's resolution and the color conversion library until we get a raw decoded frame. The difference from before is that we pre-compute what we can in initialize(). And instead of having the video stream options, we have resizedOutputDims_ as an optional. We use that or the raw decoded frame's resolution.
Review CudaDeviceInterface.h and CudaDeviceInterface.cpp. This is both a consequence of all of the above refactors, as well as a refactoring of Use cuda filters to support 10-bit videos #899.
Review custom_ops.cpp. Right now, we have some ugly bridge code that turns width and height parameters into a ResizeTransform. That will eventually go away and we will accept transforms as tensors. How is beyond the scope of this PR.

MadcowD · 2025-09-26T17:56:28Z

wow just intime for me to need it haha

MadcowD · 2025-09-27T15:41:33Z

When do you think this might land :)?

NicolasHug

Thanks for the PR @scotts ! Made a first pass (phew!!)

src/torchcodec/_core/Transform.h

src/torchcodec/_core/Transform.cpp

src/torchcodec/_core/Transform.h

src/torchcodec/_core/Frame.h

src/torchcodec/_core/CpuDeviceInterface.h

src/torchcodec/_core/CpuDeviceInterface.cpp

src/torchcodec/_core/CudaDeviceInterface.cpp

NicolasHug · 2025-09-29T11:36:41Z

src/torchcodec/_core/CudaDeviceInterface.cpp

-  UniqueAVFrame& avFrame = (avFilteredFrame) ? avFilteredFrame : avInputFrame;
+  // All of our CUDA decoding assumes NV12 format. We handle non-NV12 formats by
+  // converting them to NV12.
+  avFrame = maybeConvertAVFrameToNV12(avFrame);


I think what we now have is a net improvement over the logic in main, but I wonder if we could clarify the logic a bit further: IIUC maybeConvertAVFrameToNV12 doesn't only conver to NV12, it will actually convert to RGB24 (on the CPU!) in some cases. I don't claim I know how to address this because I can't fit all the code branches in my head at this time, but perhaps there's an opportunity to split the logic further into separate methods?

The challenge is that we're branching on the content of the frame and the version of FFmpeg. The current version is as clean as I could get it. What I think is promising is if we could find another way to do a CUDA format conversion on FFmpeg 4. I tried a bunch of attempts, but I think that should be its own effort.

Agreed this isn't trivial, and I don't know how to do simplify this ATM. Should we at least rename maybeConvertAVFrameToNV12 into maybeConvertAVFrameToNV12OrRGB24? When I read "maybeConvertAVFrameToNV12", I assume the behavior is that it will either:

convert the frame to NV12

not do any conversion

But this function actually does a second type of conversion: to RGB24, on CPU. I think reflecting that in the name can help a bit

scotts · 2025-09-29T12:58:31Z

@MadcowD, thanks for your interest! As you can see, this PR is still very much under review. This is also only the C++ side of the implementation for this feature. We'll have a few more follow-on PRs that will bridge the Python and C++ as well as expose the APIs at the Python level. All of that will take several weeks before the feature lands on main.

scotts · 2025-10-01T19:29:30Z

@NicolasHug, thank you for catching the bug regarding avcodec_open2() and setting the AVCodecContext.hw_device_ctx! In order to solve it, I had to change DeviceInterface: https://github.com/meta-pytorch/torchcodec/pull/902/files#diff-5502c9dc9edfe6912c95c36ccb012f50930fd275d2c7feec96cddf9e0cdba085R33-R44.

NicolasHug · 2025-10-02T09:23:00Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+  deviceInterface_ = createDeviceInterface(device);
+  TORCH_CHECK(
+      deviceInterface_ != nullptr,
+      "Failed to create device interface. This should never happen, please report.");


Nit: not from this PR but do we need to create an interface for audio streams?

No, but I personally do find it easier to maintain the invariant that deviceInterface_ is non-null after adding a stream. This is also aligned with our eventual goal of moving audio decoding into a device interface.

NicolasHug · 2025-10-02T09:38:11Z

src/torchcodec/_core/CpuDeviceInterface.cpp

+    const AVRational& timeBase) {
+  timeBase_ = timeBase;
+}
+


General comment on my understanding of CpuDeviceInterface::initialize and CpuDeviceInterface::initializeVideo:

initialize (previously initializeContext) now does 2 seemingly orthogonal things:

it initializes the codecContext

it initializes the interface timeBase_ field

initializeVideo (new) initializes other fields of the interface.

Was there a particular reason to set timeBase_ within initialize? If possible I think it would help to separate concerns here. For example we could have one method to initialize the codecContext itself (and nothing else), and one separate method to initialize all the interface and its fields (timeBase_ and the rest). In https://github.com/meta-pytorch/torchcodec/pull/910/files#diff-e4fe2b8c629954b1dac78a122bd2b33664844f022aa0bf3abc72b5938a14f4e2R465 I did something very similar to your changes, where:

initializeContext is unchanged and only initializes the codecContext

initialize is a new method that initializes fields on the interface itself, including the timeBase_ which I needed too. It's analogous to your initializeVideo().

Here you're calling your initializeVideo() later than where I call my initialize(), but I think still would still work fine with your initialization order.

(I also noticed you moved the cuda context creation to the constructor, which I think is a good thing!)

@NicolasHug, my rationale here is:

SpecificDeviceInterface::SpecificDeviceInterface(), as in, the actual implementation's default constructor, should initialize as much as it can given that it will be initialized without any input.

DeviceInterface::initialize() is for things that are generic to decoding in general. Video and audio decoding will need whatever goes here. That's what codecContext_ and timeBase_ have in common.

DeviceInterface::initializeVideo() is obviously just for things specific to video. At the moment, our transforms are very video specific, although I do think there's a point where we'll want to generalize them to audio and then transforms will move to initialize().

DeviceInterface::initializeAudio() is something we'll eventually implement when we move audio decoding into a device interface.

For posterity, @NicolasHug's point is about direction of state change. We want to differentiate between the device interface being initialized and the device interface changing the state of something else.

NicolasHug · 2025-10-02T09:47:17Z

Thanks for the great PR @scotts , made a follow-up comment above interface initialization methods, but I'll approve to unblock. There's also a minor renaming suggestion in #902 (comment) which I'm flagging here as it may be missed, since there are already a lot of comments.

scotts added 3 commits September 12, 2025 09:59

First pass on transforms. Committing to switch branches

0620470

Merge branch 'main' of github.com:pytorch/torchcodec into transform_core

ad1631d

Initial C++ implementaiton of transforms

c59de36

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025

scotts added 23 commits September 19, 2025 07:50

Ha, "maybe unsued".

890e2b4

Update C++ tests

d07f7d8

Remove C++ test that we no longer need

cc4e2ec

Virtual classes need virtual destructors

f471776

Cuda device convert frames function

c06aa94

Fix cuda

781f956

Handle swscale correctly

8e7072f

Variable names matter

1d0c275

Timebase

30622a7

Removes width and height from StreamOptions

7a41bfd

More cuda error checking

8e55bd4

Don't pass pre-allocated GPU tensor to CPU decoding

a032cb7

Lint

9aa85c2

Remove prints from test

4e6c6f8

Merge branch 'main' of github.com:pytorch/torchcodec into transform_core

aa54a02

Lint

3737099

Refactor NV12 stuff; test if we need format for FFmpeg 4

139e4ff

Specify hwdownload format as rgb24

6668f4b

Do all nv12 conversions on GPU

9f357c7

Wrong output format

3dc20b8

Back to RGB24

7f88e60

CUDA and CPU refactoring regarding NV12.

dda2649

Test to ensure transforms are not used with non-CPU

fc5468e

scotts marked this pull request as ready for review September 25, 2025 19:24

scotts changed the title ~~[WIP] Initial C++ implementation of transforms~~ Initial C++ implementation of transforms Sep 26, 2025

Better comments; refactor toTensor

48e3ea3

scotts added 3 commits September 26, 2025 13:53

Deal with variable resolution and lying metadata - again

7813005

Better comment

23ec35f

Proper frame dims handling in CUDA

fb06f87

scotts added 2 commits September 28, 2025 18:54

Make swscale and filtergraph look more similar

3626854

Better comment formatting

1a07828

NicolasHug reviewed Sep 29, 2025

View reviewed changes

scotts mentioned this pull request Sep 29, 2025

BETA CUDA interface: NVCUVID decoder implementation 1/N #910

Merged

scotts added 4 commits October 1, 2025 08:02

Apply reviewer suggestions

ee3b9b7

Refactor device interface, again.

d2e9bde

Merge branch 'main' of github.com:pytorch/torchcodec into transform_core

343ed3e

Clean up comment

db2ea07

NicolasHug approved these changes Oct 2, 2025

View reviewed changes

scotts added 3 commits October 2, 2025 09:59

Name change

1753f9c

Merge branch 'main' of github.com:pytorch/torchcodec into transform_core

4b9f4c9

Stragglers

9efb767

scotts merged commit 401901e into meta-pytorch:main Oct 3, 2025
50 checks passed

scotts deleted the transform_core branch October 3, 2025 17:20

scotts mentioned this pull request Oct 3, 2025

BETA CUDA interface: H265 support #919

Merged

Initial C++ implementation of transforms #902

Initial C++ implementation of transforms #902

Uh oh!

Conversation

scotts commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MadcowD commented Sep 26, 2025

Uh oh!

MadcowD commented Sep 27, 2025

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

NicolasHug Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scotts commented Sep 29, 2025

Uh oh!

scotts commented Oct 1, 2025

Uh oh!

NicolasHug Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

NicolasHug Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

scotts commented Sep 19, 2025 •

edited

Loading

NicolasHug Oct 2, 2025 •

edited

Loading