Skip to content

[CORE][NVEP]: add support for Vulkan interop#27456

Merged
tianleiwu merged 1 commit intomicrosoft:mainfrom
theHamsta:sseitz/vulkan-interop
Apr 7, 2026
Merged

[CORE][NVEP]: add support for Vulkan interop#27456
tianleiwu merged 1 commit intomicrosoft:mainfrom
theHamsta:sseitz/vulkan-interop

Conversation

@theHamsta
Copy link
Copy Markdown
Contributor

Description

The Vulkan interop works in a similar way as D3D12 interop.

The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use file descriptors.

As a sync primitive we use Vulkan timeline semaphores. They are widely supported since Vulkan 1.2 and work in a similar way as the existing ID3D12Fences.

Motivation and Context

This change allows to use graphics interop also on Vulkan and on Linux. It addresses a TODO in the external memory API.

@gedoensmax
Copy link
Copy Markdown
Contributor

@nieubank and @skottmckay could you review the vulkan interop. Especially the enum changes are required to use our next EP ABI drop.

@nieubank
Copy link
Copy Markdown
Contributor

@nieubank and @skottmckay could you review the vulkan interop. Especially the enum changes are required to use our next EP ABI drop.

This looks like what I would expect. No concerned from me, @skottmckay?

@theHamsta theHamsta force-pushed the sseitz/vulkan-interop branch from dd46601 to b028406 Compare February 26, 2026 08:09
@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented Mar 5, 2026

Below are AI analysis. Some might be wrong (like assumptions that it will be compiled for other OS other than Windows/Linux):


Summary

This PR adds Vulkan interop support for the NvTensorRtRtx execution provider, enabling graphics interop on Vulkan (and Linux). It mirrors the existing D3D12 interop approach:

  • Adds new enum values for Vulkan memory handles (VK_BUFFER_WIN32, VK_BUFFER_OPAQUE_FD) and Vulkan timeline semaphores (VK_TIMELINE_SEMAPHORE_WIN32, VK_TIMELINE_SEMAPHORE_OPAQUE_FD) in the public C API.
  • Extends nv_provider_factory.cc to handle Vulkan handle types for memory and semaphore import on both Windows and Linux (removing the Windows-only #if defined(_WIN32) guard).
  • Adds a Vulkan-Headers dependency in cmake/deps.txt.
  • Adds a comprehensive Vulkan interop test (nv_vulkan_test.cc, ~837 lines).

Critical Issues (Must Fix)

1. BUG: Missing comparison operator in CanImportSemaphoreImpl (Windows path)

File: onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc

// Current (buggy):
return type == ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE || ORT_EXTERNAL_SEMAPHORE_VK_TIMELINE_SEMAPHORE_WIN32;

// Correct:
return type == ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE || type == ORT_EXTERNAL_SEMAPHORE_VK_TIMELINE_SEMAPHORE_WIN32;

ORT_EXTERNAL_SEMAPHORE_VK_TIMELINE_SEMAPHORE_WIN32 is enum value 1, which is truthy. This means CanImportSemaphoreImpl will always return true on Windows regardless of type, bypassing the validation check entirely. Any invalid semaphore type would be accepted, potentially causing silent failures or crashes downstream.


Major Issues

2. Inconsistent preprocessor guard style: #if _WIN32 vs #if defined(_WIN32)

In the production code (nv_provider_factory.cc), the new code uses bare #if _WIN32 and #elif __linux__ in several places (lines around CanImportMemoryImpl, CanImportSemaphoreImpl, ImportMemoryImpl, ImportSemaphoreImpl), while the test file and existing code consistently use #if defined(_WIN32).

  • #if _WIN32 will trigger -Wundef warnings when _WIN32 is not defined (e.g., on Linux/macOS). The standard safe pattern is #if defined(_WIN32).
  • Similarly #elif __linux__ should be #elif defined(__linux__).

Affected locations in nv_provider_factory.cc (new code):

  • CanImportMemoryImpl: #if _WIN32 / #elif __linux__
  • ImportMemoryImpl: #if _WIN32 / #else
  • CanImportSemaphoreImpl: #if _WIN32 / #elif __linux__
  • ImportSemaphoreImpl: #if _WIN32 / #else

Recommendation: Use #if defined(_WIN32) and #elif defined(__linux__) consistently to match existing code style and avoid compiler warnings.

3. No default case in semaphore type switch in ImportSemaphoreImpl

switch (desc->type) {
  case ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE:
    ext_sem_desc.type = CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE;
    break;
  case ORT_EXTERNAL_SEMAPHORE_VK_TIMELINE_SEMAPHORE_WIN32:
    ext_sem_desc.type = CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_TIMELINE_SEMAPHORE_WIN32;
    break;
  case ORT_EXTERNAL_SEMAPHORE_VK_TIMELINE_SEMAPHORE_OPAQUE_FD:
    ext_sem_desc.type = CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_TIMELINE_SEMAPHORE_FD;
    break;
  // Missing: default case
}

The equivalent switch in ImportMemoryImpl correctly has a default: case that returns an error. The semaphore switch should be consistent, especially since CanImportSemaphoreImpl is buggy (issue #1) and cannot be relied upon to filter invalid types.

4. Missing return value on non-Win32/non-Linux platforms in CanImportMemoryImpl and CanImportSemaphoreImpl

Both CanImportMemoryImpl and CanImportSemaphoreImpl use:

#if _WIN32
    return ...;
#elif __linux__
    return ...;
#endif

If compiled on a platform that is neither Windows nor Linux (e.g., macOS), neither branch is taken and the function has no return statement — this is undefined behavior. Add an #else return false; fallback.


Minor Issues / Suggestions

5. File descriptor cast chain

ext_sem_desc.handle.fd = (int)(size_t)desc->native_handle;
ext_mem_desc.handle.fd = (int)(size_t)(desc->native_handle);

This double cast (void* → size_t → int) works in practice but is somewhat fragile. Consider using static_cast<int>(reinterpret_cast<intptr_t>(desc->native_handle)) for clarity — which is actually the pattern used in the test code for the reverse direction.

6. Comment updates

The comment // CUDA supports D3D12 resource and heap handles in CanImportMemoryImpl is not updated to mention Vulkan buffer handles. Similarly, the comment above CanImportSemaphoreImpl still says // CUDA supports D3D12 timeline fences but should now mention Vulkan timeline semaphores.

7. is_dedicated = false assumption for Vulkan

case ORT_EXTERNAL_MEMORY_HANDLE_TYPE_VK_BUFFER_WIN32:
    cu_handle_type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32;
    is_dedicated = false;  // assuming no for non-images
    break;

The comment "assuming no for non-images" is technically inaccurate — dedicated allocations are about whether VkMemoryDedicatedAllocateInfo was used during vkAllocateMemory, not about images vs buffers. With Vulkan, buffers can also have dedicated allocations. Consider either:

  • Documenting this limitation more clearly, or
  • Adding a field to OrtExternalMemoryDescriptor to specify dedicated vs. non-dedicated (future work).

8. Test file: VK_EXT_EXTERNAL_MEMORY_DMA_BUF_EXTENSION_NAME requested but not used

The test requests VK_EXT_EXTERNAL_MEMORY_DMA_BUF_EXTENSION_NAME on Linux but then asserts EXPECT_FALSE(test_params.use_dmabuf) — the extension is loaded but never exercised. This is not harmful but may confuse readers. Consider adding a comment explaining this is for future use, or removing it until DMA-BUF support is implemented.

9. Test: CiG test only runs on Windows

#if defined(_WIN32)
TEST(NvExecutionProviderVulkanTest, VkForceCig) {

The CiG (CUDA in Graphics) test is Windows-only. Should consider adding a Linux CiG test as well if CiG is supported on Linux, or adding a comment explaining why it's Windows-only.

10. Public C API: remaining TODO for DMA-BUF

The remaining TODO in onnxruntime_c_api.h:

 * \todo Add Linux DMA-BUF file descriptor for embedded GPU memory sharing

is appropriate to keep as it's not addressed by this PR.


API Design Observations

  • The enum extension pattern (adding new values at the end) is correct and ABI-compatible.
  • The use of Vulkan timeline semaphores is a good match for the existing ID3D12Fence pattern since both are timeline-based synchronization primitives.
  • The Linux path using opaque file descriptors is the standard approach for CUDA ↔ Vulkan interop.

Verdict

The overall approach is sound and the test coverage is thorough. However, Issue #1 (missing type == comparison) is a critical logic bug that must be fixed before merge. Issues #2#4 should also be addressed as they affect code correctness/portability.

@theHamsta theHamsta force-pushed the sseitz/vulkan-interop branch 2 times, most recently from e3af9b4 to 0ccb087 Compare March 9, 2026 10:03
@theHamsta
Copy link
Copy Markdown
Contributor Author

I addressed most of the AI comments and renamed the handles from ORT_EXTERNAL_MEMORY_HANDLE_TYPE_VK_BUFFER_WIN32/ORT_EXTERNAL_MEMORY_HANDLE_TYPE_VK_BUFFER_OPAQUE to ORT_EXTERNAL_MEMORY_HANDLE_TYPE_VK_MEMORY_WIN32/ORT_EXTERNAL_MEMORY_HANDLE_TYPE_VK_MEMORY_OPAQUE`.

This PR does not consider ae609bc yet. I might add support for ae609bc as a separate commit if the required changes are small

I saw the Linux CI was failing for this PR, but updated this PR before checking the exact output. It might be that the Linux CI is running on a server GPU, VM or container that does not have graphics enabled and that we need to skip the test if device creation or instance creation fails. I'm skipping the test in that case now. We might not test the Linux path with that though and should check the reason why.

@theHamsta
Copy link
Copy Markdown
Contributor Author

theHamsta commented Mar 9, 2026

Support for InitGraphicsInterop would probably mean ce2a871 . The API doesn't seem to be well suited for Vulkan: e.g. VkQueue is not very useful if one does not also have VkInstance/VkDevice. For the case of CUDA, one could require passing onnxruntime::nv::provider_option_names::kExternalComputeQueueDataParamNV_data

Please let me know if I should include it in this PR!

@tianleiwu could you re-trigger the CI?

@skottmckay
Copy link
Copy Markdown
Contributor

Support for InitGraphicsInterop would probably mean ce2a871 . The API doesn't seem to be well suited for Vulkan: e.g. VkQueue is not very useful if one does not also have VkInstance/VkDevice.

Please let me know if I should include it in this PR!

Can the VkInstance/VkDevice not be obtained via the info in the OrtEpDevice that is passed to InitGraphicsInteropForEpDeviceInitGraphicsInteropForEpDevice?

If we need to adjust what is passed in we can as this is all new in the current release. we did want to avoid potential inconsistencies if possible. e.g. if VkDevice can be obtained from the OrtEpDevice that's ideal as there's no chance of a mismatch between a manually provided VkDevice and the device specified by the OrtEpDevice.

@theHamsta
Copy link
Copy Markdown
Contributor Author

theHamsta commented Mar 11, 2026

Can the VkInstance/VkDevice not be obtained via the info in the OrtEpDevice that is passed to InitGraphicsInteropForEpDeviceInitGraphicsInteropForEpDevice?

If we need to adjust what is passed in we can as this is all new in the current release. we did want to avoid potential inconsistencies if possible. e.g. if VkDevice can be obtained from the OrtEpDevice that's ideal as there's no chance of a mismatch between a manually provided VkDevice and the device specified by the OrtEpDevice.

Not really, the user can create an arbitrary amount of Vulkan instances and Vulkan device that are derived from that instances (they are all different). VkDevice is a logical device. The corresponding VkPhysicalDevice could maybe be derived from the info in EpDevice, although even that is challenging at the moment:

  • LUID can be used for matching on Windows, which is ok
  • non-Windows platforms use a UUID to match devices which is currently not exposed for EpDevice.

Also, the EP can not really do Vulkan calls without a https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and of course VkDevice, VkInstance and VkPhysicalDevice.

It would be interesting what other EPs would require for Vulkan-interop. For CUDA, interop works without any special preparation and is more efficient with the data returned from https://docs.vulkan.org/refpages/latest/refpages/source/vkGetExternalComputeQueueDataNV.html (and that data is all that is needed)

@skottmckay
Copy link
Copy Markdown
Contributor

Not really, the user can create an arbitrary amount of Vulkan instances and Vulkan device that are derived from that instances (they are all different). VkDevice is a logical device. The corresponding VkPhysicalDevice could maybe be derived from the info in EpDevice, although even that is challenging at the moment:

  • LUID can be used for matching on Windows, which is ok
  • non-Windows platforms use a UUID to match devices which is currently not exposed for EpDevice.

Also, the EP can not really do Vulkan calls without a https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and of course VkDevice, VkInstance and VkPhysicalDevice.

It would be interesting what other EPs would require for Vulkan-interop. For CUDA, interop works without any special preparation and is more efficient with the data returned from https://docs.vulkan.org/refpages/latest/refpages/source/vkGetExternalComputeQueueDataNV.html (and that data is all that is needed)

Is my understanding correct that we would need to pass in these for VK interop in general, but you don't need them as using vkGetExternalComputeQueueDataNV via EP options is a preferred approach?

  • vkDevice - logical device, should match physical device represented by OrtEpDevice but not 1:1
  • vkInstance - user could have multiple
  • vkQueue

If the EP implementer is making Vulkan calls I'm assuming they would have a dependency on the Vulkan library and can call vkGetInstanceProcAddr and vkGetDeviceProcAddr using the provided vkInstance and vkDevice respectively. Is that assumption reasonable?

Is the VkPhysicalDevice required for the interop? I could see it maybe being used to optionally validate that the vkDevice matches, but not clear to me if it's needed otherwise as I would have assumed using the vkDevice is preferred.

If there's additional metadata that is meaningful to add to the OrtHardwareDevice on non-Windows platforms like the UUID we could add that during device discovery.

@theHamsta
Copy link
Copy Markdown
Contributor Author

theHamsta commented Mar 12, 2026

Is my understanding correct that we would need to pass in these for VK interop in general, but you don't need them as using vkGetExternalComputeQueueDataNV via EP options is a preferred approach?

yes, that's correct. It's difficult to say what is needed in general. I'm only aware of CUDA path.

If the EP implementer is making Vulkan calls I'm assuming they would have a dependency on the Vulkan library and can call vkGetInstanceProcAddr and vkGetDeviceProcAddr using the provided vkInstance and vkDevice respectively. Is that assumption reasonable?

Vulkan can be consumed via linking, but also via dynamic loading, also the Vulkan call can be layered via Vulkan layers that get resolved by the Vulkan loader. A portable approach for Vulkan middleware are function pointers to https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html . This way you can get the Vulkan calls the way the application prefers (this PR uses dynamic loading via Vulkan-HPP). I'm not sure whether we want to perform Vulkan calls (e.g. perform the external compute queue calls on behalf of the user) or whether it is better to leave all the control to the application

I assume that VkInstance/VkDevice could potentially be useful to prepare for Vulkan, but I can provide concrete examples. I'm not an expert for other vendors. AMD RocM/Vulkan interop seems would work similar to the current CUDA interop that is happening without the InitForGraphics call. When I look for QNN documentation, it seem to me as if no special preparation for a specific Vulkan device is necessary. For OpenVino, I only seem to document some preparation steps for OpenCL interop where they would need either the cl_context or cl_queue for preparation https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device/remote-tensor-api-gpu-plugin.html#examples which would fit into the current API design

If there's additional metadata that is meaningful to add to the OrtHardwareDevice on non-Windows platforms like the UUID we could add that during device discovery.

Yes, UUID would indeed be useful. However, UUID can only be inferred from GPU APIs like CUDA, Vulkan, OpenCL or nvml. So it is likely something that the providers could add to the EPs if this is possible. LUID also does not work for CUDA<->dxcore matching on WSL #27286

For Linux, we currently use the PCI bus id to match sysfs files with CUDA. The Vulkan extension https://docs.vulkan.org/refpages/latest/refpages/source/VK_EXT_pci_bus_info.html seems to be mostly implemented only by AMD devices

If the user is willing to use CUDA in their application a matching of Vulkan and EpDevice is possible via the detour

Vulkan <- UUID -> CUDA <- CUDA device index -> ort_api.EpDevice_EpOptions(ep_device)

one could add UUID here https://github.com/theHamsta/onnxruntime/blob/ce2a871438e448b6151eff8e7529d98546a7b439/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc#L1200 additionally to device_id, but it would be difficult to discover that UUID is in the EpOptions and not the general metadata.

@theHamsta
Copy link
Copy Markdown
Contributor Author

theHamsta commented Mar 12, 2026

Is my understanding correct that we would need to pass in these for VK interop in general, but you don't need them as using vkGetExternalComputeQueueDataNV via EP options is a preferred approach?

yes, that's correct. It's difficult to say what is needed in general. I'm only aware of CUDA path.

If the EP implementer is making Vulkan calls I'm assuming they would have a dependency on the Vulkan library and can call vkGetInstanceProcAddr and vkGetDeviceProcAddr using the provided vkInstance and vkDevice respectively. Is that assumption reasonable?

Vulkan can be consumed via linking, but also via dynamic loading, also the Vulkan call can be layered via Vulkan layers that get resolved by the Vulkan loader. A portable approach for Vulkan middleware are function pointers to https://docs.vulkan.org/refpages/latest/refpages/source/vkGetInstanceProcAddr.html and https://docs.vulkan.org/refpages/latest/refpages/source/vkGetDeviceProcAddr.html . This way you can get the Vulkan calls the way the application prefers (this PR uses dynamic loading via Vulkan-HPP). I'm not sure whether we want to perform Vulkan calls (e.g. perform the external compute queue calls on behalf of the user) or whether it is better to leave all the control to the application

I assume that VkInstance/VkDevice could potentially be useful to prepare for Vulkan, but I can provide concrete examples. I'm not an expert for other vendors. AMD RocM/Vulkan interop seems would work similar to the current CUDA interop that is happening without the InitForGraphics call. When I look for QNN documentation, it seem to me as if no special preparation for a specific Vulkan device is necessary. For OpenVino, I only seem to document some preparation steps for OpenCL interop where they would need either the cl_context or cl_queue for preparation https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device/remote-tensor-api-gpu-plugin.html#examples which would fit into the current API design

If there's additional metadata that is meaningful to add to the OrtHardwareDevice on non-Windows platforms like the UUID we could add that during device discovery.

Yes, UUID would indeed be useful. However, UUID can only be inferred from GPU APIs like CUDA, Vulkan, OpenCL or nvml. So it is likely something that the providers could add to the EPs if this is possible. LUID also does not work for CUDA<->dxcore matching on WSL #27286

For Linux, we currently use the PCI bus id to match sysfs files with CUDA. The Vulkan extension https://docs.vulkan.org/refpages/latest/refpages/source/VK_EXT_pci_bus_info.html seems to be mostly implemented only by AMD devices

If the user is willing to use CUDA in their application a matching of Vulkan and EpDevice is possible via the detour

Vulkan <- UUID -> CUDA <- CUDA device index -> ort_api.EpDevice_EpOptions(ep_device)

one could add UUID here https://github.com/theHamsta/onnxruntime/blob/ce2a871438e448b6151eff8e7529d98546a7b439/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc#L1200 additionally to device_id, but it would be difficult to discover that UUID is in the EpOptions and not the general metadata.

Small correction, Nvidia seems to support VK_EXT_pci_bus_info. It could be used for Linux device matching. The EpDevices already have the PCI bus id in their metadata.

      for (const auto& d : ep_devices) {
        if (d.Device().VendorId() == props.properties.vendorID && d.Device().DeviceId() == props.properties.deviceID) {
#if defined(_WIN32)
          // verify the real device with LUID, but on Linux we only have UUID which we don't know for EpDevice
          auto luid = d.Device().Metadata().GetValue("LUID");
          if (id_props.deviceLUIDValid && luid) {
            LUID vk_luid;
            std::memcpy(&vk_luid, dev.id_props.deviceLUID, sizeof(LUID));
            uint64_t ep_luid = std::stoull(luid);
            uint64_t vk = (uint64_t(vk_luid.HighPart) << 32) | uint64_t(vk_luid.LowPart);
            if (ep_luid != vk) {
              continue;
            }
          }
#else
          auto pci_bus_id = d.Device().Metadata().GetValue("pci_bus_id");
          if (pci_bus_id) {
            std::cmatch matches;
            if (std::regex_match(pci_bus_id, matches, pci_bus_id_pattern)) {
              auto domain = std::stoull(matches[1].str(), nullptr, 16);
              auto bus = std::stoull(matches[2].str(), nullptr, 16);
              auto device = std::stoull(matches[3].str(), nullptr, 16);
              auto function = std::stoull(matches[4].str(), nullptr, 16);
              if (domain != pci_props.pciDomain || bus != pci_props.pciBus || device != pci_props.pciDevice || function != pci_props.pciFunction) {
                continue;
              }
            }
          }
#endif
          dev.ep_device_candidates.push_back(d);
        }

@tianleiwu
Copy link
Copy Markdown
Contributor

@theHamsta,

Latest AI analysis:


1. Vulkan Test Resource Lifetime (onnxruntime/test/providers/nv_tensorrt_rtx/nv_vulkan_test.cc)

Positive:

  • Consolidating the Vulkan objects, imported ORT handles, and cleanup into a single VkResources RAII holder is the right structure for a large interop test. It keeps the teardown ordering explicit and makes the test much easier to follow.

Concern:

  • ⚠️ Uninitialized handles in skip/failure paths: VkResources leaves instance and device uninitialized, but the destructor branches on them in the common cleanup path. If init_vulkan_interop() hits an early GTEST_SKIP() or a failed Vulkan call before both handles are assigned, the destructor will read indeterminate values and may try to destroy garbage handles. These members should be value-initialized to VK_NULL_HANDLE up front.

2. Linux Opaque-FD Ownership (onnxruntime/test/providers/nv_tensorrt_rtx/nv_vulkan_test.cc)

Positive:

  • The test covers the same handle types that the importer exposes in the public C API, which is exactly the right level of integration coverage for this feature.

Concern:

  • ⚠️ The test closes FDs after CUDA already owns them: on Linux, the test imports Vulkan memory/semaphore FDs into CUDA in create_timeline_semaphore and allocate_buffer, then later closes those same FDs in VkResources::~VkResources. CUDA's OPAQUE_FD import takes ownership on success, so touching the FD afterward is undefined behavior. At best this becomes a spurious EBADF; at worst it closes an unrelated resource if the descriptor number has been reused.

3. Non-CiG Coverage (onnxruntime/test/providers/nv_tensorrt_rtx/nv_vulkan_test.cc)

Positive:

  • Splitting the test matrix into VkCigDisabled and VkForceCig is a good design. It makes the intended capability split visible instead of burying it behind runtime branching.

Concern:

  • ⚠️ VkCigDisabled still hard-requires CiG plumbing: init_vulkan_interop() always enables VK_NV_external_compute_queue, reserves an external queue, creates VkExternalComputeQueueNV, and later test_vulkan_interop() always fetches its queue data (device setup, queue-data fetch). That means the supposedly non-CiG test cannot run unless the CiG extension stack is also present. Given the comment that Vulkan CiG is currently Windows-only, this blocks the new Linux path from being exercised on the configurations it is meant to cover.

4. Vulkan Version Requirement (onnxruntime/test/providers/nv_tensorrt_rtx/nv_vulkan_test.cc)

Positive:

  • Using timeline semaphores and the Vulkan 1.1/1.2 property/feature structs is consistent with the feature set the test actually needs.

Concern:

  • ⚠️ The test asks for Vulkan 1.4 even though the exercised feature set is older: app_info.apiVersion = VK_API_VERSION_1_4 unnecessarily narrows the set of drivers that can run this test. The test only uses Vulkan 1.1/1.2-era capabilities, so this should either clamp to the loader's advertised version or request the minimum version the test actually requires.

Summary of Concerns

# Severity Component Issue
1 High Vulkan Test Resource Lifetime VkResources destructor can read/destroy uninitialized Vulkan handles after early skip/failure paths.
2 High Linux Opaque-FD Ownership The Linux test closes imported FDs after CUDA ownership has already transferred.
3 High Non-CiG Coverage VkCigDisabled still requires the CiG extension stack and external queue creation.
4 Suggestion Vulkan Version Requirement The test requires Vulkan 1.4 even though it only uses older capabilities.

Verdict

REQUEST CHANGES — the core importer wiring looks reasonable, but the new Vulkan test has a few correctness and portability issues that should be fixed before it can be relied on for this feature.

@theHamsta theHamsta force-pushed the sseitz/vulkan-interop branch from 0ccb087 to 48eb157 Compare March 24, 2026 16:29
@theHamsta
Copy link
Copy Markdown
Contributor Author

theHamsta commented Mar 24, 2026

I did the following changes and with testing on more platforms

From AI review

  • initialized the Vulkan handles with zero, add some more checks around destruction
  • removed closing handles/Fds. I confirmed that the Vulkan driver would close them on vkFreeMemory (ownership is not transferred to CUDA)
  • I change the CiG testing, the three cases are now tested on all platforms (I check now for presence of CiG extension):
    • force CiG off (should always work)
    • use InitForGraphics (auto-behavior, the implementation can use CiG, but it would also work on platform where CiG does not work)
    • force CiG on, this test is skipped when CUDA says that Vulkan-CiG is not supported
  • on Vulkan 1.4: did not change that. I'm assuming a recent Nvidia baseline in this test

Other changes:

  • added InitForGraphics support
  • guarded Vulkan::Headers: when you enable the WebGPU EP and Dawn with Vulkan, then Dawn will bring it's own Vulkan headers that are not controlled by ORT, we then use whatever Dawn uses, instead of creating a new conflicting version
  • I removed the DEVICE_LOCAL flag for host allocations: this is not required

@theHamsta theHamsta force-pushed the sseitz/vulkan-interop branch 2 times, most recently from 8355192 to 866d223 Compare March 24, 2026 16:36
The Vulkan graphics interop works in a similar way as D3D12 interop.

The shared handles from Vulkan work for CUDA the same way as D3D12 handles. For Linux, we can use file descriptors.

As a sync primitive we use Vulkan timeline semaphores. They are widely supported since Vulkan 1.2
and work in a similar way as the existing `ID3D12Fence`s.
The shared handles from Vulkan work for CUDA the same way as D3D12
handles. For Linux, we can use opaque file descriptors.
@theHamsta theHamsta force-pushed the sseitz/vulkan-interop branch from 866d223 to 516384a Compare March 25, 2026 13:01
@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

props.pNext = &id_props;
resources.loader.vkGetPhysicalDeviceProperties2(p, &props);
if (props.properties.vendorID == 0x10DE) {
std::memcpy(&dev.props, &props, sizeof(props));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ memcpy of VkPhysicalDeviceProperties2 into VkPhysicalDeviceProperties:

std::memcpy(&dev.props, &props, sizeof(props));

Here dev.props is VkPhysicalDeviceProperties (no 2 suffix) but props is VkPhysicalDeviceProperties2. The sizeof(props) is the size of the 2 variant, which is larger. This overflows dev.props and corrupts adjacent memory (dev.id_props). Should be:

dev.props = props.properties;

Similarly:

std::memcpy(&dev.id_props, &id_props, sizeof(id_props));

This is fine since both sides are VkPhysicalDeviceVulkan11Properties, but the struct contains a pNext pointer — copying it means dev.id_props.pNext still points to the stack-local pci_props. This pointer becomes dangling once the loop iteration ends. If id_props.pNext is never dereferenced after the copy, this is benign, but it's fragile. Consider clearing pNext after copy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix can be in another PR.

LOGS_DEFAULT(WARNING) << "[NvTensorRTRTX EP] InitGraphicsInterop: Can't enable CUDA in Graphics (CiG) for Vulkan without onnxruntime::nv::provider_option_names::kExternalComputeQueueDataParamNV_data";
return nullptr;
}
uint64_t nv_blob_ptr = std::stoull(nv_blob_ptr_str);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ std::stoull can throw on malformed input (High-Priority): std::stoull(nv_blob_ptr_str) will throw std::invalid_argument or std::out_of_range if the string is not a valid unsigned integer. This is inside a noexcept-equivalent C API implementation — an uncaught exception will call std::terminate. Wrap in a try-catch or use a safe parser.

uint64_t nv_blob_ptr = 0;
try {
  nv_blob_ptr = std::stoull(nv_blob_ptr_str);
} catch (...) {
  return onnxruntime::CreateStatus(ORT_INVALID_ARGUMENT,
      "[NvTensorRTRTX EP] Invalid value for kExternalComputeQueueDataParamNV_data");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix can be in another PR.

@tianleiwu tianleiwu enabled auto-merge (squash) April 7, 2026 00:12
@tianleiwu tianleiwu merged commit c37c8b8 into microsoft:main Apr 7, 2026
110 of 115 checks passed
tianleiwu added a commit that referenced this pull request Apr 7, 2026
- nv_vulkan_test.cc: replace memcpy of VkPhysicalDeviceProperties2 into
  VkPhysicalDeviceProperties (buffer overflow) with direct field assignment
  dev.props = props.properties; also clear id_props.pNext after copy to
  avoid a dangling pointer to stack-local pci_props

- nv_provider_factory.cc: wrap std::stoull() in a try/catch to prevent
  std::invalid_argument/std::out_of_range from propagating through the
  noexcept C API boundary and calling std::terminate on malformed input
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants