INT8 Performance without Pruning #277

ajithAI · 2021-06-10T06:13:20Z

It is really nice to have comparison of INT8 Performance - without Pruning vs with Pruning
Can that be included in the Blog Page ? ( Pruning = ZERO but with INT8 Quantization )
And also, it is good to mention how much Pruning has been applied. ( Say at 50% Pruning, achieved 99% of FP32 accuracy )

Can we specify the constrain on Pruning ratio. ( Say Pruning 50% or 65% etc. )
This is essential when the accuracy within certain limits doesn't matter much, but Throughput matters.

markurtz · 2021-06-17T13:57:02Z

Hi @ajithAI, thank you for the feedback! We are actively working on filling in more info for comparisons within all of the optimization categories and will begin filling in more over the coming weeks. Right now it's a general rule that pruning will give roughly 2X more performance over the top of quantization, at least that's what we aim for internally when creating the models and the engine. For the sparsity levels, they're all around 80% for the high-performance models on ResNet-50. We'll be sure to make this information more accessible in the future for the blogs and tutorials!

Additionally would love to hear any feedback on the new model pages we're rolling out as we'll be doing one for ResNet-50 soon, we just launched the YOLOv3 one here, so please let us know what additional information would be important for you on that page.

We have a new UI coming out for the SparseZoo in the next few weeks that will make all of these comparisons easier and list out the level of pruning for each model. Let us know if you'd like to be an alpha tester on that as will be making an announcement in our Slack and Discord communities before pushing publicly!

Can you explain a bit more what you mean by specifying the constraint on the pruning ratio?

ajithAI · 2021-06-23T03:54:56Z

Hi @markurtz, Thanks for your explanation. So, Un-pruned ResNet50 Model gives over 1,000 FPS Throughput ( which is great, when comparing with Nvidia-T4 Throughput of 5,563 FPS Performance ). And taking the advantage of Pruning, achieveing 2090 FPS on CPU is pretty amazing. Thanks for the information !!

Regarding the YOLOv3 example, I will try to go with the entire flow and will let you know my experience.

I can have a look on SparseZoo UI, but, I am not sure how deep can I go into, becasue of my current bandwidth.

Pruning Ratio Constrain : Nvidia can sparse only 50% of model with their latest Ampere family. They cant sparse less, they cant sparse more. There are constrains how much we can sparse based on accelerators. And in addition to this, in some usecases, even 10% drop in accuracy is bearable. In cases like that, where throughput is real interest, can we prune model beyond limits, say, I need a model with 90% Pruning where any accuracy loss is fine. Here, in this case, I have constrain on Pruning ratio. In Neural Magic application, can we specify the ( min, max ) pruning ratios, or traget for desired Throughput, say, I need 5000 FPS and any extend of pruning is fine. Something like that.

And is there any paper that I can read about the method of Neural Magic pruning. Just thinking broad on how can Neural Magic Pruning methodologies can be applied on to FPGA Accelerators, where we can program at a hardware levels.

markurtz · 2021-08-26T13:38:29Z

Ah makes a lot of sense, thanks @ajithAI!

For the pruning ratio, yes, you're free to specify even more sparsity by editing the recipes we have or creating one from scratch. All of the recipes are set up to have sectioned sparsity variables at the top of the recipes, increasing these will give the result you desire. The DeepSparse engine generally has an exponential relationship with sparsity and performance provided everything is compute-bound. If layers are memory bound, such as with depthwise convolutions, then sparsity won't give much speedup. This is some of the core technology that we're working on improving though -- executing more of the networks depthwise to make the model more compute-bound.

Our pruning methodologies follow gradual magnitude pruning as we have found this to be the most consistent and give the best results. The one caveat is that it takes more training time as compared to other methods. Song Han's 2015 paper is probably the best to go through for this: https://arxiv.org/abs/1506.02626

jeanniefinks · 2022-01-28T19:48:27Z

Hello @ajithAI
As there has been no further commentary, I am going to go ahead and close this thread out. But if you have more comments, please re-open and we'd love to chat. Lastly, if you have not starred our sparseml repo already, and you feel inclined, please do! Thank you in advance for your support! https://github.com/neuralmagic/sparseml/

Best, Jeannie / Neural Magic

ajithAI added the enhancement New feature or request label Jun 10, 2021

markurtz self-assigned this Jun 17, 2021

jeanniefinks closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 Performance without Pruning #277

INT8 Performance without Pruning #277

ajithAI commented Jun 10, 2021 •

edited

Loading

markurtz commented Jun 17, 2021

ajithAI commented Jun 23, 2021

markurtz commented Aug 26, 2021

jeanniefinks commented Jan 28, 2022

INT8 Performance without Pruning #277

INT8 Performance without Pruning #277

Comments

ajithAI commented Jun 10, 2021 • edited Loading

markurtz commented Jun 17, 2021

ajithAI commented Jun 23, 2021

markurtz commented Aug 26, 2021

jeanniefinks commented Jan 28, 2022

ajithAI commented Jun 10, 2021 •

edited

Loading