Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseML producing sparse int8-quantized models slower than originals on AVX512-VNNI CPU #733

Closed
clementpoiret opened this issue Apr 27, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@clementpoiret
Copy link

Describe the bug

I am developing a tool using sparse (85% sparsity) and quantized with QAT. The models produced are slower (1.5 to 2x) than the original non-sparse float32 models.

Sparse QAT model: https://zenodo.org/record/6489202/files/arunet_3.0.0_85sparse_qat_single.onnx?download=1
Original model: https://zenodo.org/record/6457484/files/arunet_3.0.0_single.onnx?download=1

It takes as input a matrix of [batch, 1, x, y, z].

Expected behavior

I use a CPU supporting AVX512-VNNI instructions, so it should be faster on sparse quantized models.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 18.04]: Ubuntu 18.04
  2. Python version [e.g. 3.7]: 3.8
  3. SparseML version or commit hash [e.g. 0.1.0, f7245c8]: 0.12
  4. ML framework version(s) [e.g. torch 1.7.1]: torch 1.9.1
  5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]: deepsparse 0.12
  6. Other relevant environment information [e.g. hardware, CUDA version]: Kernel Linux 5.4.0-105-generic; CPU Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz

From deepsparse:

GenuineIntel CPU detected with 32 cores. (2 sockets with 16 cores each)
DeepSparse FP32 model performance supported: True.
DeepSparse INT8 (quantized) model performance supported: TRUE.

Additional CPU info: {'vendor': 'GenuineIntel', 'isa': 'avx512', 'vnni': True, 'num_sockets': 2, 'available_sockets': 2, 'cores_per_socket': 16, 'available_cores_per_socket': 16, 'threads_per_core': 2, 'available_threads_per_core': 2, 'L1_instruction_cache_size': 32768, 'L1_data_cache_size': 32768, 'L2_cache_size': 1048576, 'L3_cache_size': 23068672}

To Reproduce
Exact steps to reproduce the behavior:

Load the model, pass a volume to obtain a segmentation

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context
Add any other context about the problem here. Also include any relevant files.

@clementpoiret clementpoiret added the bug Something isn't working label Apr 27, 2022
@mgoin
Copy link
Member

mgoin commented Apr 29, 2022

Hi @clementpoiret , thanks for reaching out and sharing your model to help us debug.

The most pertinent issue present here is the deepsparse engine doesn't have optimized support for ConvTranspose operations, which seem to be taking the majority of the time in these models (e.g. half time for the dense fp32 model, basically symmetric to all the regular Conv). These are used for the upsampling operations. We are currently working on optimized sparsity support for ConvTranspose based on previous models we've tested for segmentation and super resolution, like UNET and ESRGAN.
This model did uncover a bug for me in our Reduce operation so we will have a fix for that next week in the nightly.

It seems that SparseML by default isn't pruning those operations' weights so that will need to be addressed once the engine has support. We haven't been able to find a quantized version of ConvTranspose so it might be difficult to quantize.

Because the engine doesn't have great support for all operations in your model, it is not performing as we'd like and the quantized graph just magnifies this issue unfortunately. You could try using just FP32 sparsity to accelerate your model.

If you could share an example input/output to help us evaluate what we could help with now, that would be great. I've been running random data through it at batchx1x32x32x32. Otherwise please stay patient while we work on properly supporting ConvTranspose and other operations. Thank you!

@clementpoiret
Copy link
Author

Dear @mgoin,

Thanks for your feedback, I'm happy it helped to discover some issues, and also that it doesn't come from an issue from my implementation ahah :)
Please find attached an example of what is passed through the network, alongside its expected output.
T2w_hires_right_hippocampus_seg_crop.nii.gz
T2w_hires_right_hippocampus.nii.gz

I use the library torchio to load the data, as in the toy example attached: https://gist.github.com/clementpoiret/b9e00327931af9e6d9b30938b57a334c

I hope it'll help you!

@mgoin
Copy link
Member

mgoin commented May 4, 2022

@clementpoiret thanks for sharing that example, we are using it in test to verify these issues won't happen again.

For your performance concerns, I was able to see a small benefit from sparsity on the Conv operations for the FP32 model so I would recommend that if you'd like to do anything right now.
As mentioned before, I don't believe it would be a substantial improvement until we properly support pruning transposed convolutions and the ConvTranspose ONNX operation. This is something we will support but not immediately as we want to do a good job, it will be at least a few months.

Wish you best of luck and feel free to reach out for further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants