[Performance] VRAM usage difference between TRT-EP and native TRT #20457

omerferhatt · 2024-04-25T00:14:46Z

Describe the issue

I run a simple CNN model with:

ONNXRuntime + TRT EP and,
TensorRT Native
respectively.

There is no problem in terms of operation and latency. When I examine the VRAM usage of the two approaches, they both use the same level of VRAM (approx. 420-440 MB) up to a point (which I assume is the engine build phase). TensorRT Native performs a clean-up after the engine build is completed (most likely) and reduces VRAM usage to 130-140 MB. If the engine cache is used afterwards, it never reaches the 420-440 MB band. However, the problem starts here, ONNXRuntime + TRT EP does not reduce this VRAM usage in the same process and remains with a VRAM usage of around 420-440 MB during execution.

To reproduce

I run the ONNX model file with ONNXRuntime + TRT EP via Python API and the same model with TensorRT Native with trtexec.

Urgency

I don't know if this is a problem or something else. If you can at least help me with this issue, I will have time to move my work.

Platform

Linux

OS Version

Ubuntu 24.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 12.2

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

chilo-ms · 2024-04-30T20:41:23Z

Hi sorry for the late reply.
What you observed about TRT EP's memory footprint is pretty much expected.

I used ResNet50 model to run the inference with TRT EP and trtexec to compare.
The peak memory usage happens in engine build where the serialized engine is generated and kept in memory.

trtexec:
Once the build is finished, trtexec calls bEnv.reset() to reset parser, network and builder instances. But most of ~80MB memory space is reclaimed from the parser instance.
Also, trtexec calls iEnv.engine.releaseBlob() to release serialized blob for ~97MB.

TRT EP:
During TRT EP setup - It requires ~90MB memory for the parser and then releases that memory space once it's finished.
During TRT EP execution - The engine is being built at this moment due to the model has dynamic shape input. The serialized engine is kept across TRT EP's compute function. So that's why you saw memory usage remains almost the same during execution. But once when TRT EP execution is finished, it will release the ~100MB serialized engine.
(i think TRT EP can release the serialized engine as soon as it deserializes the engine to save memory earlier)

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Apr 25, 2024

jywu-msft removed the ep:CUDA issues related to the CUDA execution provider label Apr 25, 2024

jywu-msft assigned yf711 and chilo-ms Apr 25, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Apr 25, 2024

omerferhatt changed the title ~~[Performance] GPU VRAM usage difference between TRT-EP and native TRT~~ [Performance] VRAM usage difference between TRT-EP and native TRT Apr 25, 2024

sophies927 added performance issues related to performance regressions and removed ep:CUDA issues related to the CUDA execution provider labels Apr 25, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

omerferhatt commented Apr 25, 2024 •

edited

chilo-ms commented Apr 30, 2024 •

edited

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

[Performance] VRAM usage difference between TRT-EP and native TRT #20457

Comments

omerferhatt commented Apr 25, 2024 • edited

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

chilo-ms commented Apr 30, 2024 • edited

omerferhatt commented Apr 25, 2024 •

edited

chilo-ms commented Apr 30, 2024 •

edited