Skip to content

Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731

Closed
gedoensmax wants to merge 1 commit into
microsoft:mainfrom
gedoensmax:trt_only_compile_main
Closed

Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731
gedoensmax wants to merge 1 commit into
microsoft:mainfrom
gedoensmax:trt_only_compile_main

Conversation

@gedoensmax
Copy link
Copy Markdown
Contributor

Description

Adresses #18542.
This would drastically reduce binary size and enable shipping TRT in a much learner way without the dependency on cuDNN and cuBLAS in addition. The PR is very much work in progress and has not cleanly dropped cuDNN and cuBLAS for TRT yet.
@jywu-msft is there someone who could help taking a look at this ? I am able to compile, but I am still facing issues when loading the TRT EP at runtime.

C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1195 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\admin\CLionProjects\onnxruntime\cmake\cmake-build-debug-92-only-trt\onnxruntime_providers_cuda.dll"
Stacktrace:
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc(1345): onnxruntime!onnxruntime::CudaProviderFactoryCreator::Create+0x7C
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc(1765): onnxruntime!OrtApis::SessionOptionsAppendExecutionProvider_CUDA+0x5E
C:\Users\admin\CLionProjects\onnxruntime\include\onnxruntime\core\session\onnxruntime_cxx_inline.h(770): onnxruntime_perf_test!Ort::detail::SessionOptionsImpl<OrtSessionOptions>::AppendExecutionProvider_CUDA+0x3B   
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\ort_test_session.cc(214): onnxruntime_perf_test!onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession+0xBAB
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include\memory(3477): onnxruntime_perf_test!std::make_unique<onnxruntime::perftest::OnnxRuntimeTestSession,Ort::Env &,std::random_device &,onnxruntime::perftest::PerformanceTestConfig const &,TestModelInfo const &,0>+0x6D
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\performance_runner.cc(284): onnxruntime_perf_test!onnxruntime::perftest::CreateSession+0x82
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\performance_runner.cc(298): onnxruntime_perf_test!onnxruntime::perftest::PerformanceRunner::PerformanceRunner+0x137
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\main.cc(45): onnxruntime_perf_test!real_main+0x1BD
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\main.cc(64): onnxruntime_perf_test!wmain+0x54
D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl(91): onnxruntime_perf_test!

This only happens if I run the perf test without CUDA EP compiled. Maybe someone with much more experience about the dynamic loading in ORT can easily spot the error.

@jywu-msft
Copy link
Copy Markdown
Member

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support.
This PR pulls out those components into a separate common files and includes them in a TRT EP only build.
Is this approach viable?

@RyanUnderhill
Copy link
Copy Markdown
Contributor

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support. This PR pulls out those components into a separate common files and includes them in a TRT EP only build. Is this approach viable?

It could work, but is this the right approach? Is the goal just to build a minimal CUDA provider that will allow TRT to work? It seems like it might be a simple change to not register any of the kernels and have the linker drop all of the associated code to get a minimal CUDA provider. I noticed that is also suggested in the #18542 issue.

This change is rather large, is there a better description of how it's trying to do what it says?

@gedoensmax
Copy link
Copy Markdown
Contributor Author

gedoensmax commented Dec 11, 2023

@RyanUnderhill Thanks for your input. My reasoning for the approach that I took was to make the separation cleaner. It does not make a whole lot of sense to register the CUDA EP without any ops in my opinion.

My approach to move CUDA graph, stream and memory allocation in an external cuda_common lib that can be loaded by TRT and CUDA EP. Then TRT and CUDA EP are just a kernel lib but rely on the same base library for all interfacing code to the CUDA device. Most of the code changes are coming from moving a few headers which results in a lot of files being touched.
The reason why I went to separate CUDA and TRT that way is that the CUDA EP relies on cuDNN, cuBLAS and I think also cuFFT from some of its kernels. This notion is not only present in the kernels but cudnn handles etc. are also inside the CUDA EP's class definition. It would be great to split that from the CUDA only interfacing code to the device. Here is a drawing of what I imagine:

cuDNN (infer) 8.9 =1.1GB (can be split into more libs but unsure what is needed)
cuBLAS = 100MB
cuBLASLt = 530MB
Untitled

Disclaimer: I know that as of know TRT still uses some of the cuDNN and cuBLAS libraries, but let's live under the assumption that this will change in the future and we support the nvinfer lean runtime interface here.

@jywu-msft
Copy link
Copy Markdown
Member

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support. This PR pulls out those components into a separate common files and includes them in a TRT EP only build. Is this approach viable?

It could work, but is this the right approach? Is the goal just to build a minimal CUDA provider that will allow TRT to work? It seems like it might be a simple change to not register any of the kernels and have the linker drop all of the associated code to get a minimal CUDA provider. I noticed that is also suggested in the #18542 issue.

This change is rather large, is there a better description of how it's trying to do what it says?

the reason for the large # of changes is just a change to include header from shared location
e.g. from "core/providers/cuda/cuda_common.h" to "core/providers/cuda/common/cuda_common.h"
maybe we can find a way to avoid this change.
any idea the fata error he is encountering in provider_bridge_ort when trying to test out his POC?
"C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1195 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\admin\CLionProjects\onnxruntime\cmake\cmake-build-debug-92-only-trt\onnxruntime_providers_cuda.dll"
Stacktrace:"

@RyanUnderhill
Copy link
Copy Markdown
Contributor

The quickest way to figure out your DLL loading issue is to download this tool: https://en.wikipedia.org/wiki/Dependency_Walker (sadly windows doesn't say which dependency is missing without running special logging tools).
If you open the DLL that won't load in it, you'll see what other DLLs it depends on. One common problem in the past was not building onnxruntime_providers_shared.dll in the right cases.

You say it works if you build the cuda common, cuda provider and the tensorrt provider, just not if you build the cuda common and tensorrt provider?

@RyanUnderhill
Copy link
Copy Markdown
Contributor

@jywu-msft Since this adds a new DLL dependency when building cuda would this break anyone's existing packaging workflows? Since now they need to copy an extra DLL to work.

Another random thought, if this cuda core is small and has no state, we could statically link it in each provider that uses cuda.

@gedoensmax
Copy link
Copy Markdown
Contributor Author

gedoensmax commented Dec 12, 2023

Yes right. To be honest I'll just move to my linux box to debug this - I'll try to do that tomorrow.
I mostly opened this draft to see if this would be something acceptable to merge in and start splitting the CUDA EP in its core wich is really CUDA only and have the actual kernel library which links to cuDNN etc. This would also omit having another build flag which again widens the QA efforts.

If this is something that sounds interesting I am happy to put more work in and try to arrive at a reasonable library split. Static linking should not be a problem !

@gedoensmax gedoensmax force-pushed the trt_only_compile_main branch from 43f74b9 to baa31e6 Compare January 8, 2024 13:21
@gedoensmax
Copy link
Copy Markdown
Contributor Author

As you suspected before I got more and more problems disentangling the CUDA and TRT library. I still believe it would be the "nicer" option to have a separate CUDA management library that is share between TRT and CUDA, but I fear that I will not get the time to make this a reality in the near future and I see this as an important step to being able to ship ORT TRT without a dependency on cuDNN and cuBlas which are very large libraries.

@gedoensmax gedoensmax closed this Jan 8, 2024
tianleiwu pushed a commit that referenced this pull request Jan 17, 2024
…19052)

Adresses #18542.
I followed the advice given by @RyanUnderhill
[here](#18731 (comment))
and went with a minimal CUDA EP for now.
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Aug 19, 2025
…19052)

Adresses microsoft/onnxruntime#18542.
I followed the advice given by @RyanUnderhill
[here](microsoft/onnxruntime#18731 (comment))
and went with a minimal CUDA EP for now.
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Sep 15, 2025
…19052)

Adresses microsoft/onnxruntime#18542.
I followed the advice given by @RyanUnderhill
[here](microsoft/onnxruntime#18731 (comment))
and went with a minimal CUDA EP for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants