Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP by gedoensmax · Pull Request #18731 · microsoft/onnxruntime

gedoensmax · 2023-12-06T21:16:47Z

Description

Adresses #18542.
This would drastically reduce binary size and enable shipping TRT in a much learner way without the dependency on cuDNN and cuBLAS in addition. The PR is very much work in progress and has not cleanly dropped cuDNN and cuBLAS for TRT yet.
@jywu-msft is there someone who could help taking a look at this ? I am able to compile, but I am still facing issues when loading the TRT EP at runtime.

C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1195 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\admin\CLionProjects\onnxruntime\cmake\cmake-build-debug-92-only-trt\onnxruntime_providers_cuda.dll"
Stacktrace:
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc(1345): onnxruntime!onnxruntime::CudaProviderFactoryCreator::Create+0x7C
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc(1765): onnxruntime!OrtApis::SessionOptionsAppendExecutionProvider_CUDA+0x5E
C:\Users\admin\CLionProjects\onnxruntime\include\onnxruntime\core\session\onnxruntime_cxx_inline.h(770): onnxruntime_perf_test!Ort::detail::SessionOptionsImpl<OrtSessionOptions>::AppendExecutionProvider_CUDA+0x3B   
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\ort_test_session.cc(214): onnxruntime_perf_test!onnxruntime::perftest::OnnxRuntimeTestSession::OnnxRuntimeTestSession+0xBAB
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.37.32822\include\memory(3477): onnxruntime_perf_test!std::make_unique<onnxruntime::perftest::OnnxRuntimeTestSession,Ort::Env &,std::random_device &,onnxruntime::perftest::PerformanceTestConfig const &,TestModelInfo const &,0>+0x6D
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\performance_runner.cc(284): onnxruntime_perf_test!onnxruntime::perftest::CreateSession+0x82
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\performance_runner.cc(298): onnxruntime_perf_test!onnxruntime::perftest::PerformanceRunner::PerformanceRunner+0x137
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\main.cc(45): onnxruntime_perf_test!real_main+0x1BD
C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\test\perftest\main.cc(64): onnxruntime_perf_test!wmain+0x54
D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl(91): onnxruntime_perf_test!

This only happens if I run the perf test without CUDA EP compiled. Maybe someone with much more experience about the dynamic loading in ORT can easily spot the error.

jywu-msft · 2023-12-07T22:43:01Z

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support.
This PR pulls out those components into a separate common files and includes them in a TRT EP only build.
Is this approach viable?

RyanUnderhill · 2023-12-09T06:22:26Z

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support. This PR pulls out those components into a separate common files and includes them in a TRT EP only build. Is this approach viable?

It could work, but is this the right approach? Is the goal just to build a minimal CUDA provider that will allow TRT to work? It seems like it might be a simple change to not register any of the kernels and have the linker drop all of the associated code to get a minimal CUDA provider. I noticed that is also suggested in the #18542 issue.

This change is rather large, is there a better description of how it's trying to do what it says?

gedoensmax · 2023-12-11T10:13:49Z

@RyanUnderhill Thanks for your input. My reasoning for the approach that I took was to make the separation cleaner. It does not make a whole lot of sense to register the CUDA EP without any ops in my opinion.

My approach to move CUDA graph, stream and memory allocation in an external cuda_common lib that can be loaded by TRT and CUDA EP. Then TRT and CUDA EP are just a kernel lib but rely on the same base library for all interfacing code to the CUDA device. Most of the code changes are coming from moving a few headers which results in a lot of files being touched.
The reason why I went to separate CUDA and TRT that way is that the CUDA EP relies on cuDNN, cuBLAS and I think also cuFFT from some of its kernels. This notion is not only present in the kernels but cudnn handles etc. are also inside the CUDA EP's class definition. It would be great to split that from the CUDA only interfacing code to the device. Here is a drawing of what I imagine:

cuDNN (infer) 8.9 =1.1GB (can be split into more libs but unsure what is needed)
cuBLAS = 100MB
cuBLASLt = 530MB

Disclaimer: I know that as of know TRT still uses some of the cuDNN and cuBLAS libraries, but let's live under the assumption that this will change in the future and we support the nvinfer lean runtime interface here.

jywu-msft · 2023-12-11T16:48:57Z

@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support. This PR pulls out those components into a separate common files and includes them in a TRT EP only build. Is this approach viable?

It could work, but is this the right approach? Is the goal just to build a minimal CUDA provider that will allow TRT to work? It seems like it might be a simple change to not register any of the kernels and have the linker drop all of the associated code to get a minimal CUDA provider. I noticed that is also suggested in the #18542 issue.

This change is rather large, is there a better description of how it's trying to do what it says?

the reason for the large # of changes is just a change to include header from shared location
e.g. from "core/providers/cuda/cuda_common.h" to "core/providers/cuda/common/cuda_common.h"
maybe we can find a way to avoid this change.
any idea the fata error he is encountering in provider_bridge_ort when trying to test out his POC?
"C:\Users\admin\CLionProjects\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1195 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\admin\CLionProjects\onnxruntime\cmake\cmake-build-debug-92-only-trt\onnxruntime_providers_cuda.dll"
Stacktrace:"

RyanUnderhill · 2023-12-12T00:03:01Z

The quickest way to figure out your DLL loading issue is to download this tool: https://en.wikipedia.org/wiki/Dependency_Walker (sadly windows doesn't say which dependency is missing without running special logging tools).
If you open the DLL that won't load in it, you'll see what other DLLs it depends on. One common problem in the past was not building onnxruntime_providers_shared.dll in the right cases.

You say it works if you build the cuda common, cuda provider and the tensorrt provider, just not if you build the cuda common and tensorrt provider?

RyanUnderhill · 2023-12-12T00:06:38Z

@jywu-msft Since this adds a new DLL dependency when building cuda would this break anyone's existing packaging workflows? Since now they need to copy an extra DLL to work.

Another random thought, if this cuda core is small and has no state, we could statically link it in each provider that uses cuda.

gedoensmax · 2023-12-12T00:09:48Z

Yes right. To be honest I'll just move to my linux box to debug this - I'll try to do that tomorrow.
I mostly opened this draft to see if this would be something acceptable to merge in and start splitting the CUDA EP in its core wich is really CUDA only and have the actual kernel library which links to cuDNN etc. This would also omit having another build flag which again widens the QA efforts.

If this is something that sounds interesting I am happy to put more work in and try to arrive at a reasonable library split. Static linking should not be a problem !

gedoensmax · 2024-01-08T23:06:25Z

As you suspected before I got more and more problems disentangling the CUDA and TRT library. I still believe it would be the "nicer" option to have a separate CUDA management library that is share between TRT and CUDA, but I fear that I will not get the time to make this a reality in the near future and I see this as an important step to being able to ship ORT TRT without a dependency on cuDNN and cuBlas which are very large libraries.

@RyanUnderhill

…19052) Adresses #18542. I followed the advice given by @RyanUnderhill [here](#18731 (comment)) and went with a minimal CUDA EP for now.

@RyanUnderhill

…19052) Adresses microsoft/onnxruntime#18542. I followed the advice given by @RyanUnderhill [here](microsoft/onnxruntime#18731 (comment)) and went with a minimal CUDA EP for now.

@RyanUnderhill

…19052) Adresses microsoft/onnxruntime#18542. I followed the advice given by @RyanUnderhill [here](microsoft/onnxruntime#18731 (comment)) and went with a minimal CUDA EP for now.

gedoensmax force-pushed the trt_only_compile_main branch from 43f74b9 to baa31e6 Compare January 8, 2024 13:21

extract common cuda functions to separate lib

d97af5b

gedoensmax force-pushed the trt_only_compile_main branch from baa31e6 to d97af5b Compare January 8, 2024 20:27

gedoensmax mentioned this pull request Jan 8, 2024

[TensorRT EP] Enable a minimal CUDA EP compilation without kernels #19052

Merged

gedoensmax closed this Jan 8, 2024

tianleiwu pushed a commit that referenced this pull request Jan 17, 2024

[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#…

bc219ed

…19052) Adresses #18542. I followed the advice given by @RyanUnderhill [here](#18731 (comment)) and went with a minimal CUDA EP for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731

Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731
gedoensmax wants to merge 1 commit into
microsoft:mainfrom
gedoensmax:trt_only_compile_main

gedoensmax commented Dec 6, 2023

Uh oh!

jywu-msft commented Dec 7, 2023

Uh oh!

RyanUnderhill commented Dec 9, 2023

Uh oh!

gedoensmax commented Dec 11, 2023 •

edited

Loading

Uh oh!

jywu-msft commented Dec 11, 2023

Uh oh!

RyanUnderhill commented Dec 12, 2023

Uh oh!

RyanUnderhill commented Dec 12, 2023

Uh oh!

gedoensmax commented Dec 12, 2023 •

edited

Loading

Uh oh!

gedoensmax commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gedoensmax commented Dec 6, 2023

Description

Uh oh!

jywu-msft commented Dec 7, 2023

Uh oh!

RyanUnderhill commented Dec 9, 2023

Uh oh!

gedoensmax commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jywu-msft commented Dec 11, 2023

Uh oh!

RyanUnderhill commented Dec 12, 2023

Uh oh!

RyanUnderhill commented Dec 12, 2023

Uh oh!

gedoensmax commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gedoensmax commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gedoensmax commented Dec 11, 2023 •

edited

Loading

gedoensmax commented Dec 12, 2023 •

edited

Loading