I am using onnxruntime to run stable diffusion float16 onnx. And I am getting times of 1.5 seconds per cycle for DirectML and 0.7 seconds per cycle when I switch to CUDA runtime.
I have Nvidia 1080 GTX equivalent (actually is is Quadro P5000).
Is this as expected? I am using latest DirectML.dll 1.10 from the nuget package.