Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731
Draft: [TensorRT EP] Enable compiling TRT EP without CUDA EP#18731gedoensmax wants to merge 1 commit into
Conversation
|
@RyanUnderhill , can you advise? This is a prototype which is evaluating whether we can optionally build TRT EP without CUDA EP. as you know, currently TRT EP does have dependencies on CUDA EP's allocator, cuda graph and stream support. |
It could work, but is this the right approach? Is the goal just to build a minimal CUDA provider that will allow TRT to work? It seems like it might be a simple change to not register any of the kernels and have the linker drop all of the associated code to get a minimal CUDA provider. I noticed that is also suggested in the #18542 issue. This change is rather large, is there a better description of how it's trying to do what it says? |
|
@RyanUnderhill Thanks for your input. My reasoning for the approach that I took was to make the separation cleaner. It does not make a whole lot of sense to register the CUDA EP without any ops in my opinion. My approach to move CUDA graph, stream and memory allocation in an external cuda_common lib that can be loaded by TRT and CUDA EP. Then TRT and CUDA EP are just a kernel lib but rely on the same base library for all interfacing code to the CUDA device. Most of the code changes are coming from moving a few headers which results in a lot of files being touched. cuDNN (infer) 8.9 =1.1GB (can be split into more libs but unsure what is needed) Disclaimer: I know that as of know TRT still uses some of the cuDNN and cuBLAS libraries, but let's live under the assumption that this will change in the future and we support the nvinfer lean runtime interface here. |
the reason for the large # of changes is just a change to include header from shared location |
|
The quickest way to figure out your DLL loading issue is to download this tool: https://en.wikipedia.org/wiki/Dependency_Walker (sadly windows doesn't say which dependency is missing without running special logging tools). You say it works if you build the cuda common, cuda provider and the tensorrt provider, just not if you build the cuda common and tensorrt provider? |
|
@jywu-msft Since this adds a new DLL dependency when building cuda would this break anyone's existing packaging workflows? Since now they need to copy an extra DLL to work. Another random thought, if this cuda core is small and has no state, we could statically link it in each provider that uses cuda. |
|
Yes right. To be honest I'll just move to my linux box to debug this - I'll try to do that tomorrow. If this is something that sounds interesting I am happy to put more work in and try to arrive at a reasonable library split. Static linking should not be a problem ! |
43f74b9 to
baa31e6
Compare
baa31e6 to
d97af5b
Compare
|
As you suspected before I got more and more problems disentangling the CUDA and TRT library. I still believe it would be the "nicer" option to have a separate CUDA management library that is share between TRT and CUDA, but I fear that I will not get the time to make this a reality in the near future and I see this as an important step to being able to ship ORT TRT without a dependency on cuDNN and cuBlas which are very large libraries. |
…19052) Adresses #18542. I followed the advice given by @RyanUnderhill [here](#18731 (comment)) and went with a minimal CUDA EP for now.
…19052) Adresses microsoft/onnxruntime#18542. I followed the advice given by @RyanUnderhill [here](microsoft/onnxruntime#18731 (comment)) and went with a minimal CUDA EP for now.
…19052) Adresses microsoft/onnxruntime#18542. I followed the advice given by @RyanUnderhill [here](microsoft/onnxruntime#18731 (comment)) and went with a minimal CUDA EP for now.

Description
Adresses #18542.
This would drastically reduce binary size and enable shipping TRT in a much learner way without the dependency on cuDNN and cuBLAS in addition. The PR is very much work in progress and has not cleanly dropped cuDNN and cuBLAS for TRT yet.
@jywu-msft is there someone who could help taking a look at this ? I am able to compile, but I am still facing issues when loading the TRT EP at runtime.
This only happens if I run the perf test without CUDA EP compiled. Maybe someone with much more experience about the dynamic loading in ORT can easily spot the error.