SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

clementpoiret · 2022-04-27T09:14:28Z

Describe the bug

I am developing a tool using sparse (85% sparsity) and quantized with QAT. The models produced are slower (1.5 to 2x) than the original non-sparse float32 models.

Sparse QAT model: https://zenodo.org/record/6489202/files/arunet_3.0.0_85sparse_qat_single.onnx?download=1
Original model: https://zenodo.org/record/6457484/files/arunet_3.0.0_single.onnx?download=1

It takes as input a matrix of [batch, 1, x, y, z].

Expected behavior

I use a CPU supporting AVX512-VNNI instructions, so it should be faster on sparse quantized models.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 18.04]: Ubuntu 18.04
Python version [e.g. 3.7]: 3.8
SparseML version or commit hash [e.g. 0.1.0, f7245c8]: 0.12
ML framework version(s) [e.g. torch 1.7.1]: torch 1.9.1
Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]: deepsparse 0.12
Other relevant environment information [e.g. hardware, CUDA version]: Kernel Linux 5.4.0-105-generic; CPU Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz

From deepsparse:

GenuineIntel CPU detected with 32 cores. (2 sockets with 16 cores each)
DeepSparse FP32 model performance supported: True.
DeepSparse INT8 (quantized) model performance supported: TRUE.

Additional CPU info: {'vendor': 'GenuineIntel', 'isa': 'avx512', 'vnni': True, 'num_sockets': 2, 'available_sockets': 2, 'cores_per_socket': 16, 'available_cores_per_socket': 16, 'threads_per_core': 2, 'available_threads_per_core': 2, 'L1_instruction_cache_size': 32768, 'L1_data_cache_size': 32768, 'L2_cache_size': 1048576, 'L3_cache_size': 23068672}

To Reproduce
Exact steps to reproduce the behavior:

Load the model, pass a volume to obtain a segmentation

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

mgoin · 2022-04-29T15:39:47Z

Hi @clementpoiret , thanks for reaching out and sharing your model to help us debug.

The most pertinent issue present here is the deepsparse engine doesn't have optimized support for ConvTranspose operations, which seem to be taking the majority of the time in these models (e.g. half time for the dense fp32 model, basically symmetric to all the regular Conv). These are used for the upsampling operations. We are currently working on optimized sparsity support for ConvTranspose based on previous models we've tested for segmentation and super resolution, like UNET and ESRGAN.
This model did uncover a bug for me in our Reduce operation so we will have a fix for that next week in the nightly.

It seems that SparseML by default isn't pruning those operations' weights so that will need to be addressed once the engine has support. We haven't been able to find a quantized version of ConvTranspose so it might be difficult to quantize.

Because the engine doesn't have great support for all operations in your model, it is not performing as we'd like and the quantized graph just magnifies this issue unfortunately. You could try using just FP32 sparsity to accelerate your model.

If you could share an example input/output to help us evaluate what we could help with now, that would be great. I've been running random data through it at batchx1x32x32x32. Otherwise please stay patient while we work on properly supporting ConvTranspose and other operations. Thank you!

clementpoiret · 2022-04-29T17:14:26Z

Dear @mgoin,

Thanks for your feedback, I'm happy it helped to discover some issues, and also that it doesn't come from an issue from my implementation ahah :)
Please find attached an example of what is passed through the network, alongside its expected output.
T2w_hires_right_hippocampus_seg_crop.nii.gz
T2w_hires_right_hippocampus.nii.gz

I use the library torchio to load the data, as in the toy example attached: https://gist.github.com/clementpoiret/b9e00327931af9e6d9b30938b57a334c

I hope it'll help you!

mgoin · 2022-05-04T23:18:20Z

@clementpoiret thanks for sharing that example, we are using it in test to verify these issues won't happen again.

For your performance concerns, I was able to see a small benefit from sparsity on the Conv operations for the FP32 model so I would recommend that if you'd like to do anything right now.
As mentioned before, I don't believe it would be a substantial improvement until we properly support pruning transposed convolutions and the ConvTranspose ONNX operation. This is something we will support but not immediately as we want to do a good job, it will be at least a few months.

Wish you best of luck and feel free to reach out for further questions.

clementpoiret added the bug Something isn't working label Apr 27, 2022

clementpoiret mentioned this issue Apr 29, 2022

Slow sparse quantized models clementpoiret/HSF#22

Open

mgoin self-assigned this May 3, 2022

mgoin closed this as completed May 6, 2022

clementpoiret mentioned this issue Oct 23, 2023

Sparse-quantized model runs without VNNI acceleration neurospin/HSF#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

clementpoiret commented Apr 27, 2022

mgoin commented Apr 29, 2022

clementpoiret commented Apr 29, 2022

mgoin commented May 4, 2022

SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

Comments

clementpoiret commented Apr 27, 2022

mgoin commented Apr 29, 2022

clementpoiret commented Apr 29, 2022

mgoin commented May 4, 2022