From 567bdc5cbff56516a68989bfb9f806c1353390ea Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:10:57 -0700 Subject: [PATCH 01/13] Initial design draft --- decoder_native_transforms.md | 99 ++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) create mode 100644 decoder_native_transforms.md diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md new file mode 100644 index 000000000..161553cfd --- /dev/null +++ b/decoder_native_transforms.md @@ -0,0 +1,99 @@ +We want to support this user-facing API: + + decoder = VideoDecoder( + "vid.mp4", + transforms=[ + torchcodec.transforms.FPS( + fps=30, + ), + torchvision.transforms.v2.Resize( + width=640, + height=480, + ), + torchvision.transforms.v2.RandomCrop( + width=32, + height=32, + ), + ] + ) + +What the user is asking for, in English: + + 1. I want to decode frames from the file "vid.mp4". + 2. For each decoded frame, I want each frame to pass through the following + transforms: + a. Add or remove frames as necessary to ensure a constant 30 frames + per second. + b. Resize the frame to 640x480. Use the algorithm that is + TorchVision's default. + c. Inside the resized frame, crop the image to 32x32. The x and y + coordinates are chosen randomly upon the creation of the Python + VideoDecoder object. All decoded frames use the same values for x + and y. + +These three transforms are instructive, as they force us to consider: + + 1. How "easy" TorchVision transforms will be handled, where all values are + static. Resize is such an example. + 2. Transforms that involve randomness. The main question we need to resolve + is when the random value is resolved. I think this comes down to: once + upon Python VideoDecoder creation, or different for each frame decoded? + I made the call above that it should be once upon Python VideoDecoder + creation, but we need to make sure that lines up with what we think + users will want. + 3. Transforms that are supported by FFmpeg but not supported by + TorchVision. In particular, FPS is something that multiple users have + asked for. + +First let's consider implementing the "easy" case of Resize. + + 1. We add an optional `transforms` parameter to the initialization of + VideoDecoder. It is a sequence of TorchVision Transforms. + 2. During VideoDecoder object creation, we walk the list, capturing two + pieces of information: + a. The transform name that the C++ layer will understand. (We will + have to decide if we want to just use the FFmpeg filter name + here, the fully resolved Transform name, or introduce a new + naming layer.) + b. The parameters in a format that the C++ layer will understand. We + obtain them by calling `make_params()` on the Transform object. + 3. We add an optional transforms parameter to core.add_video_stream(). This + parameter will be a vector, but whether the vector contains strings, + tensors, or some combination of them is TBD. + 4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning + the values passed from the Python layer into transform objects that the + C++ layer knows about. We will have one class per transform we support. + Each class will have: + a. A name which matches the FFmpeg filter name. + b. One member for each supported parameter. + c. A virtual member function that knows how to produce a string that + can be passed to FFmpeg's filtergraph. + 5. We add a vector of such transforms to + SingleStreamDecoder::addVideoStream. We store the vector as a field in + SingleStreamDecoder. + 6. We need to reconcile FilterGraph, FiltersContext and this vector of + transforms. They are all related, but it's not clear to me what the + exact relationship should be. + 7. The actual string we pass to FFmepg's filtergraph comes from calling + the virtual member function on each transform object. + +For the transforms that do not exist in TorchVision, we can build on the above: + + 1. We define a new module, torchcodec.decoders.transforms. + 2. All transforms we define in there inherit from + torchvision.transforms.v2.Transform. + 3. We implement the mimimum needed to hook the new transforms into the + machinery defined above. + +Open questions: + + 1. Is torchcodec.transforms the right namespace? + 2. For random transforms, when should the value be fixed? + 3. Transforms such as Resize don't actually implement a make_params() + method. How does TorchVision get their parameters? How will TorchCodec? + 4. How do we communicate the transformation names and parameters to the C++ + layer? We need to support transforms with an arbitrary number of parameters. + 5. How does this generalize to AudioDecoder? Ideally we would be able to + support TorchAudio's transforms in a similar way. + 6. What is the relationship between the C++ transform objects, FilterGraph + and FiltersContext? From a9e818296d053aecab2e3b0cfd0fc0f594a10163 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:14:54 -0700 Subject: [PATCH 02/13] Formatting --- decoder_native_transforms.md | 38 +++++++++++++++++++----------------- 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 161553cfd..ac8675c32 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -1,5 +1,6 @@ We want to support this user-facing API: - + + ```python decoder = VideoDecoder( "vid.mp4", transforms=[ @@ -8,7 +9,7 @@ We want to support this user-facing API: ), torchvision.transforms.v2.Resize( width=640, - height=480, + height=480, ), torchvision.transforms.v2.RandomCrop( width=32, @@ -16,10 +17,11 @@ We want to support this user-facing API: ), ] ) +``` What the user is asking for, in English: - 1. I want to decode frames from the file "vid.mp4". + 1. I want to decode frames from the file `"vid.mp4".` 2. For each decoded frame, I want each frame to pass through the following transforms: a. Add or remove frames as necessary to ensure a constant 30 frames @@ -38,7 +40,7 @@ These three transforms are instructive, as they force us to consider: 2. Transforms that involve randomness. The main question we need to resolve is when the random value is resolved. I think this comes down to: once upon Python VideoDecoder creation, or different for each frame decoded? - I made the call above that it should be once upon Python VideoDecoder + I made the call above that it should be once upon Python `VideoDecoder` creation, but we need to make sure that lines up with what we think users will want. 3. Transforms that are supported by FFmpeg but not supported by @@ -48,19 +50,19 @@ These three transforms are instructive, as they force us to consider: First let's consider implementing the "easy" case of Resize. 1. We add an optional `transforms` parameter to the initialization of - VideoDecoder. It is a sequence of TorchVision Transforms. + `VideoDecoder`. It is a sequence of TorchVision Transforms. 2. During VideoDecoder object creation, we walk the list, capturing two pieces of information: - a. The transform name that the C++ layer will understand. (We will + a. The transform name that the C++ layer will understand. (We will have to decide if we want to just use the FFmpeg filter name here, the fully resolved Transform name, or introduce a new naming layer.) b. The parameters in a format that the C++ layer will understand. We obtain them by calling `make_params()` on the Transform object. - 3. We add an optional transforms parameter to core.add_video_stream(). This + 3. We add an optional transforms parameter to `core.add_video_stream()`. This parameter will be a vector, but whether the vector contains strings, tensors, or some combination of them is TBD. - 4. The custom_ops.cpp and pybind_ops.cpp layer is responsible for turning + 4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning the values passed from the Python layer into transform objects that the C++ layer knows about. We will have one class per transform we support. Each class will have: @@ -69,9 +71,9 @@ First let's consider implementing the "easy" case of Resize. c. A virtual member function that knows how to produce a string that can be passed to FFmpeg's filtergraph. 5. We add a vector of such transforms to - SingleStreamDecoder::addVideoStream. We store the vector as a field in - SingleStreamDecoder. - 6. We need to reconcile FilterGraph, FiltersContext and this vector of + `SingleStreamDecoder::addVideoStream`. We store the vector as a field in + `SingleStreamDecoder`. + 6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of transforms. They are all related, but it's not clear to me what the exact relationship should be. 7. The actual string we pass to FFmepg's filtergraph comes from calling @@ -79,21 +81,21 @@ First let's consider implementing the "easy" case of Resize. For the transforms that do not exist in TorchVision, we can build on the above: - 1. We define a new module, torchcodec.decoders.transforms. + 1. We define a new module, `torchcodec.decoders.transforms`. 2. All transforms we define in there inherit from - torchvision.transforms.v2.Transform. + `torchvision.transforms.v2.Transform`. 3. We implement the mimimum needed to hook the new transforms into the machinery defined above. Open questions: - 1. Is torchcodec.transforms the right namespace? + 1. Is `torchcodec.transforms` the right namespace? 2. For random transforms, when should the value be fixed? - 3. Transforms such as Resize don't actually implement a make_params() + 3. Transforms such as Resize don't actually implement a `make_params()` method. How does TorchVision get their parameters? How will TorchCodec? 4. How do we communicate the transformation names and parameters to the C++ layer? We need to support transforms with an arbitrary number of parameters. - 5. How does this generalize to AudioDecoder? Ideally we would be able to + 5. How does this generalize to `AudioDecoder`? Ideally we would be able to support TorchAudio's transforms in a similar way. - 6. What is the relationship between the C++ transform objects, FilterGraph - and FiltersContext? + 6. What is the relationship between the C++ transform objects, `FilterGraph` + and `FiltersContext`? From 449f5008336d7ffe36ed16d813132111477ce3b7 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:18:51 -0700 Subject: [PATCH 03/13] List formatting --- decoder_native_transforms.md | 132 +++++++++++++++++------------------ 1 file changed, 66 insertions(+), 66 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index ac8675c32..55f3aa28f 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -21,81 +21,81 @@ We want to support this user-facing API: What the user is asking for, in English: - 1. I want to decode frames from the file `"vid.mp4".` - 2. For each decoded frame, I want each frame to pass through the following - transforms: - a. Add or remove frames as necessary to ensure a constant 30 frames - per second. - b. Resize the frame to 640x480. Use the algorithm that is - TorchVision's default. - c. Inside the resized frame, crop the image to 32x32. The x and y - coordinates are chosen randomly upon the creation of the Python - VideoDecoder object. All decoded frames use the same values for x - and y. +1. I want to decode frames from the file `"vid.mp4".` +2. For each decoded frame, I want each frame to pass through the following + transforms: + a. Add or remove frames as necessary to ensure a constant 30 frames + per second. + b. Resize the frame to 640x480. Use the algorithm that is + TorchVision's default. + c. Inside the resized frame, crop the image to 32x32. The x and y + coordinates are chosen randomly upon the creation of the Python + VideoDecoder object. All decoded frames use the same values for x + and y. These three transforms are instructive, as they force us to consider: - 1. How "easy" TorchVision transforms will be handled, where all values are - static. Resize is such an example. - 2. Transforms that involve randomness. The main question we need to resolve - is when the random value is resolved. I think this comes down to: once - upon Python VideoDecoder creation, or different for each frame decoded? - I made the call above that it should be once upon Python `VideoDecoder` - creation, but we need to make sure that lines up with what we think - users will want. - 3. Transforms that are supported by FFmpeg but not supported by - TorchVision. In particular, FPS is something that multiple users have - asked for. +1. How "easy" TorchVision transforms will be handled, where all values are + static. Resize is such an example. +2. Transforms that involve randomness. The main question we need to resolve + is when the random value is resolved. I think this comes down to: once + upon Python VideoDecoder creation, or different for each frame decoded? + I made the call above that it should be once upon Python `VideoDecoder` + creation, but we need to make sure that lines up with what we think + users will want. +3. Transforms that are supported by FFmpeg but not supported by + TorchVision. In particular, FPS is something that multiple users have + asked for. First let's consider implementing the "easy" case of Resize. - 1. We add an optional `transforms` parameter to the initialization of - `VideoDecoder`. It is a sequence of TorchVision Transforms. - 2. During VideoDecoder object creation, we walk the list, capturing two - pieces of information: - a. The transform name that the C++ layer will understand. (We will - have to decide if we want to just use the FFmpeg filter name - here, the fully resolved Transform name, or introduce a new - naming layer.) - b. The parameters in a format that the C++ layer will understand. We - obtain them by calling `make_params()` on the Transform object. - 3. We add an optional transforms parameter to `core.add_video_stream()`. This - parameter will be a vector, but whether the vector contains strings, - tensors, or some combination of them is TBD. - 4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning - the values passed from the Python layer into transform objects that the - C++ layer knows about. We will have one class per transform we support. - Each class will have: - a. A name which matches the FFmpeg filter name. - b. One member for each supported parameter. - c. A virtual member function that knows how to produce a string that - can be passed to FFmpeg's filtergraph. - 5. We add a vector of such transforms to - `SingleStreamDecoder::addVideoStream`. We store the vector as a field in - `SingleStreamDecoder`. - 6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of - transforms. They are all related, but it's not clear to me what the - exact relationship should be. - 7. The actual string we pass to FFmepg's filtergraph comes from calling - the virtual member function on each transform object. +1. We add an optional `transforms` parameter to the initialization of + `VideoDecoder`. It is a sequence of TorchVision Transforms. +2. During VideoDecoder object creation, we walk the list, capturing two + pieces of information: + a. The transform name that the C++ layer will understand. (We will + have to decide if we want to just use the FFmpeg filter name + here, the fully resolved Transform name, or introduce a new + naming layer.) + b. The parameters in a format that the C++ layer will understand. We + obtain them by calling `make_params()` on the Transform object. +3. We add an optional transforms parameter to `core.add_video_stream()`. This + parameter will be a vector, but whether the vector contains strings, + tensors, or some combination of them is TBD. +4. The `custom_ops.cpp` and `pybind_ops.cpp` layer is responsible for turning + the values passed from the Python layer into transform objects that the + C++ layer knows about. We will have one class per transform we support. + Each class will have: + a. A name which matches the FFmpeg filter name. + b. One member for each supported parameter. + c. A virtual member function that knows how to produce a string that + can be passed to FFmpeg's filtergraph. +5. We add a vector of such transforms to + `SingleStreamDecoder::addVideoStream`. We store the vector as a field in + `SingleStreamDecoder`. +6. We need to reconcile `FilterGraph`, `FiltersContext` and this vector of + transforms. They are all related, but it's not clear to me what the + exact relationship should be. +7. The actual string we pass to FFmepg's filtergraph comes from calling + the virtual member function on each transform object. For the transforms that do not exist in TorchVision, we can build on the above: - 1. We define a new module, `torchcodec.decoders.transforms`. - 2. All transforms we define in there inherit from - `torchvision.transforms.v2.Transform`. - 3. We implement the mimimum needed to hook the new transforms into the - machinery defined above. +1. We define a new module, `torchcodec.decoders.transforms`. +2. All transforms we define in there inherit from + `torchvision.transforms.v2.Transform`. +3. We implement the mimimum needed to hook the new transforms into the + machinery defined above. Open questions: - 1. Is `torchcodec.transforms` the right namespace? - 2. For random transforms, when should the value be fixed? - 3. Transforms such as Resize don't actually implement a `make_params()` - method. How does TorchVision get their parameters? How will TorchCodec? - 4. How do we communicate the transformation names and parameters to the C++ - layer? We need to support transforms with an arbitrary number of parameters. - 5. How does this generalize to `AudioDecoder`? Ideally we would be able to - support TorchAudio's transforms in a similar way. - 6. What is the relationship between the C++ transform objects, `FilterGraph` - and `FiltersContext`? +1. Is `torchcodec.transforms` the right namespace? +2. For random transforms, when should the value be fixed? +3. Transforms such as Resize don't actually implement a `make_params()` + method. How does TorchVision get their parameters? How will TorchCodec? +4. How do we communicate the transformation names and parameters to the C++ + layer? We need to support transforms with an arbitrary number of parameters. +5. How does this generalize to `AudioDecoder`? Ideally we would be able to + support TorchAudio's transforms in a similar way. +6. What is the relationship between the C++ transform objects, `FilterGraph` + and `FiltersContext`? From 6aea38d68803c7f9c155243cbdc6b9eeb7f7ff8f Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:22:14 -0700 Subject: [PATCH 04/13] More list formatting --- decoder_native_transforms.md | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 55f3aa28f..efb82705e 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -22,16 +22,10 @@ We want to support this user-facing API: What the user is asking for, in English: 1. I want to decode frames from the file `"vid.mp4".` -2. For each decoded frame, I want each frame to pass through the following - transforms: - a. Add or remove frames as necessary to ensure a constant 30 frames - per second. - b. Resize the frame to 640x480. Use the algorithm that is - TorchVision's default. - c. Inside the resized frame, crop the image to 32x32. The x and y - coordinates are chosen randomly upon the creation of the Python - VideoDecoder object. All decoded frames use the same values for x - and y. +2. For each decoded frame, I want each frame to pass through the following transforms: + a. Add or remove frames as necessary to ensure a constant 30 frames per second. + b. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. + c. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python VideoDecoder object. All decoded frames use the same values for x and y. These three transforms are instructive, as they force us to consider: From 65e258e64fe17326544debafe24caa7540a78fcc Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:23:30 -0700 Subject: [PATCH 05/13] More more list formatting --- decoder_native_transforms.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index efb82705e..11739d2c5 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -23,9 +23,9 @@ What the user is asking for, in English: 1. I want to decode frames from the file `"vid.mp4".` 2. For each decoded frame, I want each frame to pass through the following transforms: - a. Add or remove frames as necessary to ensure a constant 30 frames per second. - b. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. - c. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python VideoDecoder object. All decoded frames use the same values for x and y. + 1. Add or remove frames as necessary to ensure a constant 30 frames per second. + 2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. + 3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python VideoDecoder object. All decoded frames use the same values for x and y. These three transforms are instructive, as they force us to consider: From 20898407b1dba98eb35a8c259cffbd6799be54f5 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:24:57 -0700 Subject: [PATCH 06/13] Almost there --- decoder_native_transforms.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 11739d2c5..6c43bd8d7 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -47,12 +47,12 @@ First let's consider implementing the "easy" case of Resize. `VideoDecoder`. It is a sequence of TorchVision Transforms. 2. During VideoDecoder object creation, we walk the list, capturing two pieces of information: - a. The transform name that the C++ layer will understand. (We will - have to decide if we want to just use the FFmpeg filter name - here, the fully resolved Transform name, or introduce a new - naming layer.) - b. The parameters in a format that the C++ layer will understand. We - obtain them by calling `make_params()` on the Transform object. + 1. The transform name that the C++ layer will understand. (We will + have to decide if we want to just use the FFmpeg filter name + here, the fully resolved Transform name, or introduce a new + naming layer.) + 2. The parameters in a format that the C++ layer will understand. We + obtain them by calling `make_params()` on the Transform object. 3. We add an optional transforms parameter to `core.add_video_stream()`. This parameter will be a vector, but whether the vector contains strings, tensors, or some combination of them is TBD. @@ -60,10 +60,10 @@ First let's consider implementing the "easy" case of Resize. the values passed from the Python layer into transform objects that the C++ layer knows about. We will have one class per transform we support. Each class will have: - a. A name which matches the FFmpeg filter name. - b. One member for each supported parameter. - c. A virtual member function that knows how to produce a string that - can be passed to FFmpeg's filtergraph. + 1. A name which matches the FFmpeg filter name. + 2. One member for each supported parameter. + 3. A virtual member function that knows how to produce a string that + can be passed to FFmpeg's filtergraph. 5. We add a vector of such transforms to `SingleStreamDecoder::addVideoStream`. We store the vector as a field in `SingleStreamDecoder`. From 130b7e037eeea7e89518406c25211ef2cb746dd1 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:27:26 -0700 Subject: [PATCH 07/13] Formatting --- decoder_native_transforms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 6c43bd8d7..7439946de 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -33,7 +33,7 @@ These three transforms are instructive, as they force us to consider: static. Resize is such an example. 2. Transforms that involve randomness. The main question we need to resolve is when the random value is resolved. I think this comes down to: once - upon Python VideoDecoder creation, or different for each frame decoded? + upon Python `VideoDecoder` creation, or different for each frame decoded? I made the call above that it should be once upon Python `VideoDecoder` creation, but we need to make sure that lines up with what we think users will want. From 07f1d730c255ebd39df0ad25679fbd441126589b Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:29:36 -0700 Subject: [PATCH 08/13] Formatting --- decoder_native_transforms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 7439946de..ab7121d9a 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -25,7 +25,7 @@ What the user is asking for, in English: 2. For each decoded frame, I want each frame to pass through the following transforms: 1. Add or remove frames as necessary to ensure a constant 30 frames per second. 2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. - 3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python VideoDecoder object. All decoded frames use the same values for x and y. + 3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded frames use the same values for x and y. These three transforms are instructive, as they force us to consider: From 90f41ed497cb84b200e62f930812d78df3be4f64 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:34:24 -0700 Subject: [PATCH 09/13] Another open question --- decoder_native_transforms.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index ab7121d9a..87016eb21 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -25,12 +25,14 @@ What the user is asking for, in English: 2. For each decoded frame, I want each frame to pass through the following transforms: 1. Add or remove frames as necessary to ensure a constant 30 frames per second. 2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default. - 3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded frames use the same values for x and y. + 3. Inside the resized frame, crop the image to 32x32. The x and y coordinates are + chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded + frames use the same values for x and y. These three transforms are instructive, as they force us to consider: 1. How "easy" TorchVision transforms will be handled, where all values are - static. Resize is such an example. + static. `Resize` is such an example. 2. Transforms that involve randomness. The main question we need to resolve is when the random value is resolved. I think this comes down to: once upon Python `VideoDecoder` creation, or different for each frame decoded? @@ -41,11 +43,11 @@ These three transforms are instructive, as they force us to consider: TorchVision. In particular, FPS is something that multiple users have asked for. -First let's consider implementing the "easy" case of Resize. +First let's consider implementing the "easy" case of `Resize`. 1. We add an optional `transforms` parameter to the initialization of `VideoDecoder`. It is a sequence of TorchVision Transforms. -2. During VideoDecoder object creation, we walk the list, capturing two +2. During `VideoDecoder` object creation, we walk the list, capturing two pieces of information: 1. The transform name that the C++ layer will understand. (We will have to decide if we want to just use the FFmpeg filter name @@ -87,9 +89,10 @@ Open questions: 2. For random transforms, when should the value be fixed? 3. Transforms such as Resize don't actually implement a `make_params()` method. How does TorchVision get their parameters? How will TorchCodec? -4. How do we communicate the transformation names and parameters to the C++ +4. Should the name at the bridge layer between Python and C++ just be the FFmpeg filter name? +5. How do we communicate the transformation names and parameters to the C++ layer? We need to support transforms with an arbitrary number of parameters. -5. How does this generalize to `AudioDecoder`? Ideally we would be able to +6. How does this generalize to `AudioDecoder`? Ideally we would be able to support TorchAudio's transforms in a similar way. -6. What is the relationship between the C++ transform objects, `FilterGraph` +7. What is the relationship between the C++ transform objects, `FilterGraph` and `FiltersContext`? From d452d661ccc40bc31db5104a549a48a8c7a84040 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:37:34 -0700 Subject: [PATCH 10/13] Headers --- decoder_native_transforms.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 87016eb21..93a6aecde 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -1,3 +1,6 @@ +# Decoder Native Transforms + +## API We want to support this user-facing API: ```python @@ -29,6 +32,7 @@ What the user is asking for, in English: chosen randomly upon the creation of the Python `VideoDecoder` object. All decoded frames use the same values for x and y. +## Design Considerations These three transforms are instructive, as they force us to consider: 1. How "easy" TorchVision transforms will be handled, where all values are @@ -43,6 +47,7 @@ These three transforms are instructive, as they force us to consider: TorchVision. In particular, FPS is something that multiple users have asked for. +## Implementation Sketch First let's consider implementing the "easy" case of `Resize`. 1. We add an optional `transforms` parameter to the initialization of @@ -83,7 +88,7 @@ For the transforms that do not exist in TorchVision, we can build on the above: 3. We implement the mimimum needed to hook the new transforms into the machinery defined above. -Open questions: +## Open questions: 1. Is `torchcodec.transforms` the right namespace? 2. For random transforms, when should the value be fixed? From d03d0e36eb546942ce9cc33ee4166647f912c528 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Mon, 8 Sep 2025 12:38:54 -0700 Subject: [PATCH 11/13] Questions --- decoder_native_transforms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 93a6aecde..fba326540 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -88,7 +88,7 @@ For the transforms that do not exist in TorchVision, we can build on the above: 3. We implement the mimimum needed to hook the new transforms into the machinery defined above. -## Open questions: +## Open Questions: 1. Is `torchcodec.transforms` the right namespace? 2. For random transforms, when should the value be fixed? From d70f4a1601a74ee572688d6912d89de77fc3ad18 Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Tue, 9 Sep 2025 06:03:29 -0700 Subject: [PATCH 12/13] Use size param --- decoder_native_transforms.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index fba326540..9611c5356 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -11,12 +11,10 @@ We want to support this user-facing API: fps=30, ), torchvision.transforms.v2.Resize( - width=640, - height=480, + size=(640, 480) ), torchvision.transforms.v2.RandomCrop( - width=32, - height=32, + size=(32, 32) ), ] ) From 1e0da614207bcd10b8834246723bbbd25dc41d7e Mon Sep 17 00:00:00 2001 From: Scott Schneider Date: Tue, 9 Sep 2025 06:18:51 -0700 Subject: [PATCH 13/13] Period --- decoder_native_transforms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/decoder_native_transforms.md b/decoder_native_transforms.md index 9611c5356..64c18ebbc 100644 --- a/decoder_native_transforms.md +++ b/decoder_native_transforms.md @@ -22,7 +22,7 @@ We want to support this user-facing API: What the user is asking for, in English: -1. I want to decode frames from the file `"vid.mp4".` +1. I want to decode frames from the file `"vid.mp4"`. 2. For each decoded frame, I want each frame to pass through the following transforms: 1. Add or remove frames as necessary to ensure a constant 30 frames per second. 2. Resize the frame to 640x480. Use the algorithm that is TorchVision's default.