[Performance] VRAM usage difference between TRT-EP and native TRT #20457
Labels
ep:CUDA
issues related to the CUDA execution provider
ep:TensorRT
issues related to TensorRT execution provider
performance
issues related to performance regressions
Describe the issue
I run a simple CNN model with:
respectively.
There is no problem in terms of operation and latency. When I examine the VRAM usage of the two approaches, they both use the same level of VRAM (approx. 420-440 MB) up to a point (which I assume is the engine build phase). TensorRT Native performs a clean-up after the engine build is completed (most likely) and reduces VRAM usage to 130-140 MB. If the engine cache is used afterwards, it never reaches the 420-440 MB band. However, the problem starts here, ONNXRuntime + TRT EP does not reduce this VRAM usage in the same process and remains with a VRAM usage of around 420-440 MB during execution.
To reproduce
I run the ONNX model file with ONNXRuntime + TRT EP via Python API and the same model with TensorRT Native with
trtexec
.Urgency
I don't know if this is a problem or something else. If you can at least help me with this issue, I will have time to move my work.
Platform
Linux
OS Version
Ubuntu 24.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 12.2
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: