[CORE][NVEP]: add support for Vulkan interop#27456
Conversation
|
@nieubank and @skottmckay could you review the vulkan interop. Especially the enum changes are required to use our next EP ABI drop. |
This looks like what I would expect. No concerned from me, @skottmckay? |
dd46601 to
b028406
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Below are AI analysis. Some might be wrong (like assumptions that it will be compiled for other OS other than Windows/Linux): SummaryThis PR adds Vulkan interop support for the NvTensorRtRtx execution provider, enabling graphics interop on Vulkan (and Linux). It mirrors the existing D3D12 interop approach:
Critical Issues (Must Fix)1. BUG: Missing comparison operator in
|
e3af9b4 to
0ccb087
Compare
|
I addressed most of the AI comments and renamed the handles from This PR does not consider ae609bc yet. I might add support for ae609bc as a separate commit if the required changes are small I saw the Linux CI was failing for this PR, but updated this PR before checking the exact output. It might be that the Linux CI is running on a server GPU, VM or container that does not have graphics enabled and that we need to skip the test if device creation or instance creation fails. I'm skipping the test in that case now. We might not test the Linux path with that though and should check the reason why. |
|
Support for InitGraphicsInterop would probably mean ce2a871 . The API doesn't seem to be well suited for Vulkan: e.g. VkQueue is not very useful if one does not also have VkInstance/VkDevice. For the case of CUDA, one could require passing Please let me know if I should include it in this PR! @tianleiwu could you re-trigger the CI? |
Can the VkInstance/VkDevice not be obtained via the info in the If we need to adjust what is passed in we can as this is all new in the current release. we did want to avoid potential inconsistencies if possible. e.g. if VkDevice can be obtained from the OrtEpDevice that's ideal as there's no chance of a mismatch between a manually provided VkDevice and the device specified by the OrtEpDevice. |
Not really, the user can create an arbitrary amount of Vulkan instances and Vulkan device that are derived from that instances (they are all different). VkDevice is a logical device. The corresponding VkPhysicalDevice could maybe be derived from the info in EpDevice, although even that is challenging at the moment:
Also, the EP can not really do Vulkan calls without a https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and of course VkDevice, VkInstance and VkPhysicalDevice. It would be interesting what other EPs would require for Vulkan-interop. For CUDA, interop works without any special preparation and is more efficient with the data returned from https://docs.vulkan.org/refpages/latest/refpages/source/vkGetExternalComputeQueueDataNV.html (and that data is all that is needed) |
Is my understanding correct that we would need to pass in these for VK interop in general, but you don't need them as using
If the EP implementer is making Vulkan calls I'm assuming they would have a dependency on the Vulkan library and can call vkGetInstanceProcAddr and vkGetDeviceProcAddr using the provided vkInstance and vkDevice respectively. Is that assumption reasonable? Is the VkPhysicalDevice required for the interop? I could see it maybe being used to optionally validate that the vkDevice matches, but not clear to me if it's needed otherwise as I would have assumed using the vkDevice is preferred. If there's additional metadata that is meaningful to add to the OrtHardwareDevice on non-Windows platforms like the UUID we could add that during device discovery. |
yes, that's correct. It's difficult to say what is needed in general. I'm only aware of CUDA path.
Vulkan can be consumed via linking, but also via dynamic loading, also the Vulkan call can be layered via Vulkan layers that get resolved by the Vulkan loader. A portable approach for Vulkan middleware are function pointers to https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html . This way you can get the Vulkan calls the way the application prefers (this PR uses dynamic loading via Vulkan-HPP). I'm not sure whether we want to perform Vulkan calls (e.g. perform the external compute queue calls on behalf of the user) or whether it is better to leave all the control to the application I assume that VkInstance/VkDevice could potentially be useful to prepare for Vulkan, but I can provide concrete examples. I'm not an expert for other vendors. AMD RocM/Vulkan interop seems would work similar to the current CUDA interop that is happening without the InitForGraphics call. When I look for QNN documentation, it seem to me as if no special preparation for a specific Vulkan device is necessary. For OpenVino, I only seem to document some preparation steps for OpenCL interop where they would need either the cl_context or cl_queue for preparation https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device/remote-tensor-api-gpu-plugin.html#examples which would fit into the current API design
Yes, UUID would indeed be useful. However, UUID can only be inferred from GPU APIs like CUDA, Vulkan, OpenCL or nvml. So it is likely something that the providers could add to the EPs if this is possible. LUID also does not work for CUDA<->dxcore matching on WSL #27286 For Linux, we currently use the PCI bus id to match sysfs files with CUDA. The Vulkan extension https://docs.vulkan.org/refpages/latest/refpages/source/VK_EXT_pci_bus_info.html seems to be mostly implemented only by AMD devices If the user is willing to use CUDA in their application a matching of Vulkan and EpDevice is possible via the detour Vulkan <- UUID -> CUDA <- CUDA device index -> ort_api.EpDevice_EpOptions(ep_device) one could add UUID here https://github.com/theHamsta/onnxruntime/blob/ce2a871438e448b6151eff8e7529d98546a7b439/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc#L1200 additionally to |
Small correction, Nvidia seems to support for (const auto& d : ep_devices) {
if (d.Device().VendorId() == props.properties.vendorID && d.Device().DeviceId() == props.properties.deviceID) {
#if defined(_WIN32)
// verify the real device with LUID, but on Linux we only have UUID which we don't know for EpDevice
auto luid = d.Device().Metadata().GetValue("LUID");
if (id_props.deviceLUIDValid && luid) {
LUID vk_luid;
std::memcpy(&vk_luid, dev.id_props.deviceLUID, sizeof(LUID));
uint64_t ep_luid = std::stoull(luid);
uint64_t vk = (uint64_t(vk_luid.HighPart) << 32) | uint64_t(vk_luid.LowPart);
if (ep_luid != vk) {
continue;
}
}
#else
auto pci_bus_id = d.Device().Metadata().GetValue("pci_bus_id");
if (pci_bus_id) {
std::cmatch matches;
if (std::regex_match(pci_bus_id, matches, pci_bus_id_pattern)) {
auto domain = std::stoull(matches[1].str(), nullptr, 16);
auto bus = std::stoull(matches[2].str(), nullptr, 16);
auto device = std::stoull(matches[3].str(), nullptr, 16);
auto function = std::stoull(matches[4].str(), nullptr, 16);
if (domain != pci_props.pciDomain || bus != pci_props.pciBus || device != pci_props.pciDevice || function != pci_props.pciFunction) {
continue;
}
}
}
#endif
dev.ep_device_candidates.push_back(d);
} |
|
Latest AI analysis: 1. Vulkan Test Resource Lifetime (
|
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | High | Vulkan Test Resource Lifetime | VkResources destructor can read/destroy uninitialized Vulkan handles after early skip/failure paths. |
| 2 | High | Linux Opaque-FD Ownership | The Linux test closes imported FDs after CUDA ownership has already transferred. |
| 3 | High | Non-CiG Coverage | VkCigDisabled still requires the CiG extension stack and external queue creation. |
| 4 | Suggestion | Vulkan Version Requirement | The test requires Vulkan 1.4 even though it only uses older capabilities. |
Verdict
REQUEST CHANGES — the core importer wiring looks reasonable, but the new Vulkan test has a few correctness and portability issues that should be fixed before it can be relied on for this feature.
0ccb087 to
48eb157
Compare
|
I did the following changes and with testing on more platforms From AI review
Other changes:
|
8355192 to
866d223
Compare
The Vulkan graphics interop works in a similar way as D3D12 interop. The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use file descriptors. As a sync primitive we use Vulkan timeline semaphores. They are widely supported since Vulkan 1.2 and work in a similar way as the existing `ID3D12Fence`s. The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use opaque file descriptors.
866d223 to
516384a
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
| props.pNext = &id_props; | ||
| resources.loader.vkGetPhysicalDeviceProperties2(p, &props); | ||
| if (props.properties.vendorID == 0x10DE) { | ||
| std::memcpy(&dev.props, &props, sizeof(props)); |
There was a problem hiding this comment.
memcpy of VkPhysicalDeviceProperties2 into VkPhysicalDeviceProperties:
std::memcpy(&dev.props, &props, sizeof(props));Here dev.props is VkPhysicalDeviceProperties (no 2 suffix) but props is VkPhysicalDeviceProperties2. The sizeof(props) is the size of the 2 variant, which is larger. This overflows dev.props and corrupts adjacent memory (dev.id_props). Should be:
dev.props = props.properties;Similarly:
std::memcpy(&dev.id_props, &id_props, sizeof(id_props));This is fine since both sides are VkPhysicalDeviceVulkan11Properties, but the struct contains a pNext pointer — copying it means dev.id_props.pNext still points to the stack-local pci_props. This pointer becomes dangling once the loop iteration ends. If id_props.pNext is never dereferenced after the copy, this is benign, but it's fragile. Consider clearing pNext after copy.
There was a problem hiding this comment.
The fix can be in another PR.
| LOGS_DEFAULT(WARNING) << "[NvTensorRTRTX EP] InitGraphicsInterop: Can't enable CUDA in Graphics (CiG) for Vulkan without onnxruntime::nv::provider_option_names::kExternalComputeQueueDataParamNV_data"; | ||
| return nullptr; | ||
| } | ||
| uint64_t nv_blob_ptr = std::stoull(nv_blob_ptr_str); |
There was a problem hiding this comment.
std::stoull can throw on malformed input (High-Priority): std::stoull(nv_blob_ptr_str) will throw std::invalid_argument or std::out_of_range if the string is not a valid unsigned integer. This is inside a noexcept-equivalent C API implementation — an uncaught exception will call std::terminate. Wrap in a try-catch or use a safe parser.
uint64_t nv_blob_ptr = 0;
try {
nv_blob_ptr = std::stoull(nv_blob_ptr_str);
} catch (...) {
return onnxruntime::CreateStatus(ORT_INVALID_ARGUMENT,
"[NvTensorRTRTX EP] Invalid value for kExternalComputeQueueDataParamNV_data");
}There was a problem hiding this comment.
The fix can be in another PR.
- nv_vulkan_test.cc: replace memcpy of VkPhysicalDeviceProperties2 into VkPhysicalDeviceProperties (buffer overflow) with direct field assignment dev.props = props.properties; also clear id_props.pNext after copy to avoid a dangling pointer to stack-local pci_props - nv_provider_factory.cc: wrap std::stoull() in a try/catch to prevent std::invalid_argument/std::out_of_range from propagating through the noexcept C API boundary and calling std::terminate on malformed input
Description
The Vulkan interop works in a similar way as D3D12 interop.
The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use file descriptors.
As a sync primitive we use Vulkan timeline semaphores. They are widely supported since Vulkan 1.2 and work in a similar way as the existing
ID3D12Fences.Motivation and Context
This change allows to use graphics interop also on Vulkan and on Linux. It addresses a TODO in the external memory API.