-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733
Comments
Hi @clementpoiret , thanks for reaching out and sharing your model to help us debug. The most pertinent issue present here is the deepsparse engine doesn't have optimized support for ConvTranspose operations, which seem to be taking the majority of the time in these models (e.g. half time for the dense fp32 model, basically symmetric to all the regular Conv). These are used for the upsampling operations. We are currently working on optimized sparsity support for ConvTranspose based on previous models we've tested for segmentation and super resolution, like UNET and ESRGAN. It seems that SparseML by default isn't pruning those operations' weights so that will need to be addressed once the engine has support. We haven't been able to find a quantized version of ConvTranspose so it might be difficult to quantize. Because the engine doesn't have great support for all operations in your model, it is not performing as we'd like and the quantized graph just magnifies this issue unfortunately. You could try using just FP32 sparsity to accelerate your model. If you could share an example input/output to help us evaluate what we could help with now, that would be great. I've been running random data through it at |
Dear @mgoin, Thanks for your feedback, I'm happy it helped to discover some issues, and also that it doesn't come from an issue from my implementation ahah :) I use the library I hope it'll help you! |
@clementpoiret thanks for sharing that example, we are using it in test to verify these issues won't happen again. For your performance concerns, I was able to see a small benefit from sparsity on the Conv operations for the FP32 model so I would recommend that if you'd like to do anything right now. Wish you best of luck and feel free to reach out for further questions. |
Describe the bug
I am developing a tool using sparse (85% sparsity) and quantized with QAT. The models produced are slower (1.5 to 2x) than the original non-sparse float32 models.
Sparse QAT model: https://zenodo.org/record/6489202/files/arunet_3.0.0_85sparse_qat_single.onnx?download=1
Original model: https://zenodo.org/record/6457484/files/arunet_3.0.0_single.onnx?download=1
It takes as input a matrix of [batch, 1, x, y, z].
Expected behavior
I use a CPU supporting AVX512-VNNI instructions, so it should be faster on sparse quantized models.
Environment
Include all relevant environment information:
f7245c8
]: 0.12From deepsparse:
To Reproduce
Exact steps to reproduce the behavior:
Load the model, pass a volume to obtain a segmentation
Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
Additional context
Add any other context about the problem here. Also include any relevant files.
The text was updated successfully, but these errors were encountered: