DirectML 1.10 is 2x as slow as CUDA for onnxruntime on NVidia Quadro P5000

I am using onnxruntime to run stable diffusion float16 onnx. And I am getting times of 1.5 seconds per cycle for DirectML and 0.7 seconds per cycle when I switch to CUDA runtime.

I have Nvidia 1080 GTX equivalent (actually is is Quadro P5000).

Is this as expected? I am using latest DirectML.dll 1.10 from the nuget package.