Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

Open
omerferhatt opened this issue Apr 25, 2024 · 1 comment
Open

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

omerferhatt opened this issue Apr 25, 2024 · 1 comment
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions

Comments

@omerferhatt
Copy link

omerferhatt commented Apr 25, 2024

Describe the issue

I run a simple CNN model with:

  • ONNXRuntime + TRT EP and,
  • TensorRT Native
    respectively.

There is no problem in terms of operation and latency. When I examine the VRAM usage of the two approaches, they both use the same level of VRAM (approx. 420-440 MB) up to a point (which I assume is the engine build phase). TensorRT Native performs a clean-up after the engine build is completed (most likely) and reduces VRAM usage to 130-140 MB. If the engine cache is used afterwards, it never reaches the 420-440 MB band. However, the problem starts here, ONNXRuntime + TRT EP does not reduce this VRAM usage in the same process and remains with a VRAM usage of around 420-440 MB during execution.

To reproduce

I run the ONNX model file with ONNXRuntime + TRT EP via Python API and the same model with TensorRT Native with trtexec.

Urgency

I don't know if this is a problem or something else. If you can at least help me with this issue, I will have time to move my work.

Platform

Linux

OS Version

Ubuntu 24.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 12.2

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Apr 25, 2024
@jywu-msft jywu-msft removed the ep:CUDA issues related to the CUDA execution provider label Apr 25, 2024
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Apr 25, 2024
@omerferhatt omerferhatt changed the title [Performance] GPU VRAM usage difference between TRT-EP and native TRT [Performance] VRAM usage difference between TRT-EP and native TRT Apr 25, 2024
@sophies927 sophies927 added performance issues related to performance regressions and removed ep:CUDA issues related to the CUDA execution provider labels Apr 25, 2024
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Apr 26, 2024
@chilo-ms
Copy link
Contributor

chilo-ms commented Apr 30, 2024

Hi sorry for the late reply.
What you observed about TRT EP's memory footprint is pretty much expected.

I used ResNet50 model to run the inference with TRT EP and trtexec to compare.
The peak memory usage happens in engine build where the serialized engine is generated and kept in memory.

trtexec:
Once the build is finished, trtexec calls bEnv.reset() to reset parser, network and builder instances. But most of ~80MB memory space is reclaimed from the parser instance.
Also, trtexec calls iEnv.engine.releaseBlob() to release serialized blob for ~97MB.

TRT EP:
During TRT EP setup - It requires ~90MB memory for the parser and then releases that memory space once it's finished.
During TRT EP execution - The engine is being built at this moment due to the model has dynamic shape input. The serialized engine is kept across TRT EP's compute function. So that's why you saw memory usage remains almost the same during execution. But once when TRT EP execution is finished, it will release the ~100MB serialized engine.
(i think TRT EP can release the serialized engine as soon as it deserializes the engine to save memory earlier)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

5 participants