-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: DmlExecutionProvider does not contain a subset of available providers ['CPUExecutionProvider', 'DmlExecutionProvider'] #5617
Comments
any update/feedback on this is greatly appreciated. |
What happens if you pass in a list instead of a single value like this: sess.set_providers([providers[-1]]) |
Thanks for the prompt response.
if I remove the |
|
They are timming labels I set to evalue the performance of the CPU and GPU backend to see if DML is indeed wrokin or not: what you see is simply the output of these : ....
with Benchmark_Block('CPU_ONNX') as blk:
ort_outs = sess.run(None, ort_inputs)
print(f'available providers : {providers}') # ['CPUExecutionProvider', 'DmlExecutionProvider']
sess.set_providers([providers[-1]])
# try and see if the gpu is indeed in the works!
with Benchmark_Block('GPU_ONNX') as blk:
ort_outs = sess.run(None, ort_inputs) |
CPU_ONNX: When you don’t set any providers it will use DML + CPU (for ops not supported on DML) by default as those are included in the build. GPU_ONNX: When you set providers to be DML explicitly, it will still be DML + CPU (for ops not supported on DML). CPU will be included by default because not all ops are supported on all backends. I think the perf being the same is expected because both are running DML + CPU. What perf do you see with just CPU ? (Do set_providers([CPUExecutionProvider]). This will be really CPU only. |
if I do that, I get around 27/30 ms! which is way higher than default! sess.set_providers(['CPUExecutionProvider'])
with Benchmark_Block('CPU_ONNX') as blk:
ort_outs = sess.run(None, ort_inputs) results in : |
I would say your model was using the DML EP (GPU) just fine when it ran at ~80ms. When it ran just on CPU, it takes ~30ms. The DML EP is not comprehensive in its opset support. So it will fallback to using CPU in cases where some ops can’t be executed on DML (GPU). This switching between heterogeneous backends causes data copies (CPU <-> GPU) which introduces overhead that shows up in cheap models (~30 ms on CPU means this model is not very heavy). You can turn on verbose logging (see examples in docs and tests- onnxruntime_python_tests.py) to see how many nodes fallback to using CPU when trying to use the DML + CPU run. With better DML opset support, the fallback can be avoided. Hope this helps. |
Thanks I'll look into it. |
OnnxRuntime can only work off what DML library can execute. Let me give one toy example: let us say there is a simple graph: nodeA -> node B -> node C. If DML tells OnnxRuntime that it can only execute nodeB. OnnxRuntime will assign node A and nodeC to CPU. So the final graph will look like this: nodeA -> copy to GPU -> node B -> copy to CPU -> nodeC. Two GPU <-> CPU copy nodes are now added and it may take away all the hardware acceleration gains of nodeB as well as add more overhead. So in theory, this “hardware accelerated” graph could perform worse than just nodeA -> nodeB -> nodeC all running on CPU with no device data copies involved. Keep in mind that the CPU EP is a pretty optimal backend by itself. It is multi-threaded and uses vectorized instructions to provide maximum perf gains as possible on CPU. |
here are the logs : ONNX_CPU : ( `sess.set_providers(['CPUExecutionProvider'])` )
[
{"cat" : "Session","pid" :8800,"tid" :10988,"dur" :121570,"ts" :7,"ph" : "X","name" :"model_loading_uri","args" : {}},
{"cat" : "Session","pid" :8800,"tid" :10988,"dur" :759232,"ts" :145953,"ph" : "X","name" :"session_initialization","args" : {}}
] When tried on GPU ( DML
There are only a handful of memcopies, are these indeed the culprits? |
I tested another much simpler model (vgg like) and there is no data copies as far as the logs are concerned. yet the cpu performance is nearly 2x better than the gpu one! here is the model : simpnet_onnx_10312020.zip and here is the gpu log :
|
@Coderx7 let us look into this issue a bit more. DirectML operator coverage in onnxruntime is pretty extensive. It supports up to ONNX opset-12 in the v1.5 distribution, and definitely should not be slower than the CPU on most DNN models. |
Any luck on finding out the culprit? |
Update posted on the related Issue #50 in the DirectML repo. |
When trying to set the provider to DmlExecutionProvider, it fails with the following error :
System information
To Reproduce
Run :
Onnx model : https://gofile.io/d/GdkIeR
Expected behavior
Should be able to execute this on GPU using DirectML?
The text was updated successfully, but these errors were encountered: