[Performance] GPU Accelerated Image normalization for DirectML #20155
Labels
ep:CUDA
issues related to the CUDA execution provider
ep:DML
issues related to the DirectML execution provider
platform:windows
issues related to the Windows platform
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
As seen in the " To Reproduce " section, the performance is heavily bottlenecked by the requirement to normalize using Numpy.
For ONNX Runtime CUDA + Pytorch Cuda, I can easily move the normalizations to the GPU using:
And then just inference with pytorch tensors.
Would there be a workaround to allow for the normalizations to be moved in one form or another to the GPU For faster inference.
For w/e it's worth I have some performance benchmarks here:
"""
frame= 153 fps=3.0 q=-0.0 Lsize=N/A time=00:00:06.38 bitrate=N/A speed=0.126x
Compact no video encoding, 1080p, onnxruntime directml, fp16, with clamp 0-255
"""
To reproduce
Urgency
No, used for benchmarking purposes only to compare to NCNN inference performances.
Platform
Windows
OS Version
11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
17.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
Model File
https://github.com/NevermindNilas/TAS-Modes-Host/releases/download/main/2x_AnimeJaNai_HD_V3_Compact_583k-fp16.onnx
Is this a quantized model?
Unknown
The text was updated successfully, but these errors were encountered: