Potential perf improvements to the BETA CUDA interface

Opening this mostly for my own sake, for future reference


#### Cache improvements

I haven't put too much thoughts in our existing cache and mostly tried for it to be safe. Maybe we can make it smarter to increase cache hits. For example, right now we expect an exact match in stream resolution - maybe a decoder can still be re-used for a stream whose resolution is strictly smaller?

I think this is slightly related to decoder re-configuration:

#### Decoder re-configuration

The [NVDEC docs](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/index.html#reconfiguring-the-decoder) mention we could re-configure an existing decoder in some cases, typically when a "sequence change" occurs (stream resolution change).

DALI has some code-path for that too.

We cache the decoders, while the docs assume a new decoder would be instantiated from scratch, so maybe that's not needed.

#### CUDA Streams

We currently hard-code both the [NVDEC stream](https://github.com/meta-pytorch/torchcodec/blob/9c5da208f2ddc2c039c5af3541e3d18ca779af96/src/torchcodec/_core/BetaCudaDeviceInterface.cpp#L490-L491) and the [NPP stream](https://github.com/meta-pytorch/torchcodec/blob/9c5da208f2ddc2c039c5af3541e3d18ca779af96/src/torchcodec/_core/CUDACommon.cpp#L170-L171) (for color-conversion) to be the current stream (as, e.g., specified by a context manager). Maybe... those could be different?

#### Threaded implementation
 
- [NVDEC docs](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/index.html#using-nvidia-video-decoder-nvdecode-api) mention the synchronous “mapping” stage could be done in a separate thread.

- [In this section](https://docs.nvidia.com/video-technologies/video-codec-sdk/13.0/nvdec-video-decoder-api-prog-guide/index.html#writing-an-efficient-decode-application) they mention there could actually be 3 threads: 1 for demuxing (FFmpeg), 1 for decoding, one for mapping.

This shouldn’t be too hard to implement, but out of scope for now, and we should factor-in the maintenance cost of having to manage thread pools. It's also unclear to me whether this would make any difference when the user is already spawning N threads, each with its own `VideoDecoder()` instead. It's possible we're already maxing out NVDEC there, and if that's the case the benefits of a threaded implementation would be minimal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential perf improvements to the BETA CUDA interface #944

Cache improvements

Decoder re-configuration

CUDA Streams

Threaded implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential perf improvements to the BETA CUDA interface #944

Description

Cache improvements

Decoder re-configuration

CUDA Streams

Threaded implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions