Skip to content

Conversation

@NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Oct 14, 2025

This PR addresses a bunch of TODOs related to the BETA CUDA interface. More will come later. See comments below for details.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025
}

if (videoParser_) {
// TODONVDEC P2: consider caching this? Does DALI do that?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer: No, DALI doesn't cache the parser.


UniqueNppContext getNppStreamContext(const torch::Device& device) {
torch::DeviceIndex nonNegativeDeviceIndex = getNonNegativeDeviceIndex(device);
int deviceIndex = getDeviceIndex(device);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from torch::DeviceIndex to int is a fix. cudaGetDeviceProperties and nppCtx->nCudaDeviceId below both expect int, not torch::DeviceIndex which is int8_t

cudaGetDevice(&deviceIndex) == cudaSuccess,
"Failed to get current CUDA device.");
}
return deviceIndex;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another fix: getNonNegativeDeviceIndex() used to map -1 to 0, which is incorrect. -1 should be mapped to the "current device", which can be arbitrary and set by a context manager from Python.

// but:
// - we should consider using std::unordered_map
// - we should consider a more sophisticated and potentially less strict
// cache key comparison logic
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that using an unorderd_map would force us to define our own hash function, so that's really not worth it. The n log(n) complexity of map isn't relevant anyway considering the very small sizes we deal with.

On the potentailly less strict cache logic: I looked into more details at DALI's implementation and they do the same thing, modulo decoder re-configuration, which is tracked in another TODO. So we can remove this one.

# - before calling add_video_stream(device=device, device_variant=device_variant)
#
# TODONVDEC P2: Find a less clunky way to test the BETA CUDA interface. It
# will ultimately depend on how we want to publicly expose it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was addressed by #959 and there is another TODO to remove the legacy use of cuda:0:beta, which will transitively resolve this one.

# a duration. CPU decoder fails too!
# Those are missing the `duration` field so they fail in approximate mode (on all devices).
# TODO: we should address this, see
# https://github.com/meta-pytorch/torchcodec/issues/945
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an NVDEC TODO, it's a general TODO.

NVDECCache::getCache(device_.index())
.returnDecoder(&videoFormat_, std::move(decoder_));
NVDECCache::getCache(device_).returnDecoder(
&videoFormat_, std::move(decoder_));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above and everywhere else in the PR, anything related to "device index" relates to the creation of getDeviceIndex() which is now in CUDACommon.cpp, and replaces the old (buggy!) getNonNegativeDeviceIndex.

// Forward declaration of getDeviceIndex which exists in CUDACommon.h
// This avoids circular dependency between Cache.h and CUDACommon.cpp which also
// needs to include Cache.h
int getDeviceIndex(const torch::Device& device);
Copy link
Contributor Author

@NicolasHug NicolasHug Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thrilled by this, happy to consider alternatives.

We could:

  • include Cache.h within CUDACommon.cpp and not bother about the potential circular issue
  • Have a forward declaration of PerGpuCache within CudaCommon.cpp instead (not sure if that would work since it's a template class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forward declarations of templates are possible, but awkward. But if Cache.h needs to know CUDA stuff, then it's not really generic. The fact we have a circular dependency hints to me that we're not quite modeling things correctly. I can think of two alternatives:

  1. We add the ability to convert a torch::Device into a device-specific index to the DeviceInterface. That's still a little awkward, though, as we'd want to tell PerGpuCache the DeviceInterface so it would know where to get the index.
  2. PerGpuCache accepts the device index itself, instead of a torch::Device. That then assumes the device logic does the conversion itself, which I think is reasonable.

I prefer 2, but that might be a bigger refactor than this PR. So I'm good with this for now, and we can file a task for option 2, if you also agree it makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2 is reasonable, we could align both caches to just take an index as input. Let me follow-up on that

// that case we set the device index to 0. That's used in per-gpu cache
// implementation and during initialization of CUDA and FFmpeg contexts
// which require non negative indices.
deviceIndex = std::max<at::DeviceIndex>(deviceIndex, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was wrong! -1 should be mapped to the current device, not to 0.

// We set the device because we may be called from a different thread than
// the one that initialized the cuda context.
cudaSetDevice(nonNegativeDeviceIndex);
cudaSetDevice(deviceIndex);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well: cudaSetDevice expects an int, not an implicitly-cast torch::DeviceIndex

@NicolasHug NicolasHug marked this pull request as ready for review October 15, 2025 12:28
@NicolasHug NicolasHug merged commit 25e42e1 into meta-pytorch:main Oct 15, 2025
23 checks passed
NicolasHug added a commit to NicolasHug/torchcodec that referenced this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants