Is your feature request related to a problem? Please describe.
CPU/Wasm inference is slow on larger models. If a device has a GPU and the model is large, then using the CPU is not a good choice. The WebGL runtime has three problems:
- It is much harder to get it working due to op and data-type support issues. Examples: 1, 2
- It is likely to be significantly slower than WebGPU.
- In my testing it is often unstable and can either throw an error that requires refreshing the tab to fix, or breaks WebGL across all tabs (requiring a browser restart) or even freezes my whole laptop. That said, this may just be specific to Ubuntu, or even my specific OS/hardware combination - I'm not sure. I'm guessing that in general it has to do with high memory usage.
Describe the solution you'd like
By far the most important requirement is that there is very strong op support, just like the Wasm backend. The broad op support is what sets ORT Web (Wasm backend) apart from tfjs, and some other web runtimes.
I still bump into ORT Web Wasm op support problems sometimes, but far less often than with other runtimes. I'm guessing this is due to the fact that the ops can be compiled from their C++/C/etc code rather than having to be re-written specifically for the web. I hope that the WebGPU backend would be able to do something similar, because otherwise I fear that lack of op support would make it far less useful. I am rarely able to get the WebGL backend working with my models because of op and data type support issues.
Might it be possible to set up some sort of conversion pipeline (perhaps via SPIR-V) to WGSL using Naga and similar tools? This is beyond my pay grade, so I'm just guessing here.
Describe alternatives you've considered
Seems like there are people already working on this: https://github.com/webonnx/wonnx But it is unfortunately not stable or op/feature-complete.
Additional context
WebGPU is behind a flag in Chrome, and is expected to ship by default in Chrome this year or early next year. It might be useful to start working on this now so that the ORT Web team can perhaps provide feedback on the WebGPU spec while there's an opportunity to make changes. The earlier the feedback, the easier it is to make changes to the spec which could be critical for getting maximum ML performance out of WebGPU.
Is your feature request related to a problem? Please describe.
CPU/Wasm inference is slow on larger models. If a device has a GPU and the model is large, then using the CPU is not a good choice. The WebGL runtime has three problems:
Describe the solution you'd like
By far the most important requirement is that there is very strong op support, just like the Wasm backend. The broad op support is what sets ORT Web (Wasm backend) apart from tfjs, and some other web runtimes.
I still bump into ORT Web Wasm op support problems sometimes, but far less often than with other runtimes. I'm guessing this is due to the fact that the ops can be compiled from their C++/C/etc code rather than having to be re-written specifically for the web. I hope that the WebGPU backend would be able to do something similar, because otherwise I fear that lack of op support would make it far less useful. I am rarely able to get the WebGL backend working with my models because of op and data type support issues.
Might it be possible to set up some sort of conversion pipeline (perhaps via SPIR-V) to WGSL using Naga and similar tools? This is beyond my pay grade, so I'm just guessing here.
Describe alternatives you've considered
Seems like there are people already working on this: https://github.com/webonnx/wonnx But it is unfortunately not stable or op/feature-complete.
Additional context
WebGPU is behind a flag in Chrome, and is expected to ship by default in Chrome this year or early next year. It might be useful to start working on this now so that the ORT Web team can perhaps provide feedback on the WebGPU spec while there's an opportunity to make changes. The earlier the feedback, the easier it is to make changes to the spec which could be critical for getting maximum ML performance out of WebGPU.