-
Notifications
You must be signed in to change notification settings - Fork 68
Address some TODONVDEC #960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| if (videoParser_) { | ||
| // TODONVDEC P2: consider caching this? Does DALI do that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer: No, DALI doesn't cache the parser.
|
|
||
| UniqueNppContext getNppStreamContext(const torch::Device& device) { | ||
| torch::DeviceIndex nonNegativeDeviceIndex = getNonNegativeDeviceIndex(device); | ||
| int deviceIndex = getDeviceIndex(device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going from torch::DeviceIndex to int is a fix. cudaGetDeviceProperties and nppCtx->nCudaDeviceId below both expect int, not torch::DeviceIndex which is int8_t
| cudaGetDevice(&deviceIndex) == cudaSuccess, | ||
| "Failed to get current CUDA device."); | ||
| } | ||
| return deviceIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another fix: getNonNegativeDeviceIndex() used to map -1 to 0, which is incorrect. -1 should be mapped to the "current device", which can be arbitrary and set by a context manager from Python.
| // but: | ||
| // - we should consider using std::unordered_map | ||
| // - we should consider a more sophisticated and potentially less strict | ||
| // cache key comparison logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that using an unorderd_map would force us to define our own hash function, so that's really not worth it. The n log(n) complexity of map isn't relevant anyway considering the very small sizes we deal with.
On the potentailly less strict cache logic: I looked into more details at DALI's implementation and they do the same thing, modulo decoder re-configuration, which is tracked in another TODO. So we can remove this one.
| # - before calling add_video_stream(device=device, device_variant=device_variant) | ||
| # | ||
| # TODONVDEC P2: Find a less clunky way to test the BETA CUDA interface. It | ||
| # will ultimately depend on how we want to publicly expose it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was addressed by #959 and there is another TODO to remove the legacy use of cuda:0:beta, which will transitively resolve this one.
| # a duration. CPU decoder fails too! | ||
| # Those are missing the `duration` field so they fail in approximate mode (on all devices). | ||
| # TODO: we should address this, see | ||
| # https://github.com/meta-pytorch/torchcodec/issues/945 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an NVDEC TODO, it's a general TODO.
| NVDECCache::getCache(device_.index()) | ||
| .returnDecoder(&videoFormat_, std::move(decoder_)); | ||
| NVDECCache::getCache(device_).returnDecoder( | ||
| &videoFormat_, std::move(decoder_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above and everywhere else in the PR, anything related to "device index" relates to the creation of getDeviceIndex() which is now in CUDACommon.cpp, and replaces the old (buggy!) getNonNegativeDeviceIndex.
| // Forward declaration of getDeviceIndex which exists in CUDACommon.h | ||
| // This avoids circular dependency between Cache.h and CUDACommon.cpp which also | ||
| // needs to include Cache.h | ||
| int getDeviceIndex(const torch::Device& device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not thrilled by this, happy to consider alternatives.
We could:
- include
Cache.hwithinCUDACommon.cppand not bother about the potential circular issue - Have a forward declaration of
PerGpuCachewithinCudaCommon.cppinstead (not sure if that would work since it's a template class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward declarations of templates are possible, but awkward. But if Cache.h needs to know CUDA stuff, then it's not really generic. The fact we have a circular dependency hints to me that we're not quite modeling things correctly. I can think of two alternatives:
- We add the ability to convert a
torch::Deviceinto a device-specific index to theDeviceInterface. That's still a little awkward, though, as we'd want to tellPerGpuCachetheDeviceInterfaceso it would know where to get the index. PerGpuCacheaccepts the device index itself, instead of atorch::Device. That then assumes the device logic does the conversion itself, which I think is reasonable.
I prefer 2, but that might be a bigger refactor than this PR. So I'm good with this for now, and we can file a task for option 2, if you also agree it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2 is reasonable, we could align both caches to just take an index as input. Let me follow-up on that
| // that case we set the device index to 0. That's used in per-gpu cache | ||
| // implementation and during initialization of CUDA and FFmpeg contexts | ||
| // which require non negative indices. | ||
| deviceIndex = std::max<at::DeviceIndex>(deviceIndex, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this was wrong! -1 should be mapped to the current device, not to 0.
| // We set the device because we may be called from a different thread than | ||
| // the one that initialized the cuda context. | ||
| cudaSetDevice(nonNegativeDeviceIndex); | ||
| cudaSetDevice(deviceIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well: cudaSetDevice expects an int, not an implicitly-cast torch::DeviceIndex
This PR addresses a bunch of TODOs related to the BETA CUDA interface. More will come later. See comments below for details.