Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT8 Performance without Pruning #277

Closed
ajithAI opened this issue Jun 10, 2021 · 4 comments
Closed

INT8 Performance without Pruning #277

ajithAI opened this issue Jun 10, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@ajithAI
Copy link

ajithAI commented Jun 10, 2021

It is really nice to have comparison of INT8 Performance - without Pruning vs with Pruning
Can that be included in the Blog Page ? ( Pruning = ZERO but with INT8 Quantization )
And also, it is good to mention how much Pruning has been applied. ( Say at 50% Pruning, achieved 99% of FP32 accuracy )

Can we specify the constrain on Pruning ratio. ( Say Pruning 50% or 65% etc. )
This is essential when the accuracy within certain limits doesn't matter much, but Throughput matters.

@ajithAI ajithAI added the enhancement New feature or request label Jun 10, 2021
@markurtz markurtz self-assigned this Jun 17, 2021
@markurtz
Copy link
Member

Hi @ajithAI, thank you for the feedback! We are actively working on filling in more info for comparisons within all of the optimization categories and will begin filling in more over the coming weeks. Right now it's a general rule that pruning will give roughly 2X more performance over the top of quantization, at least that's what we aim for internally when creating the models and the engine. For the sparsity levels, they're all around 80% for the high-performance models on ResNet-50. We'll be sure to make this information more accessible in the future for the blogs and tutorials!

Additionally would love to hear any feedback on the new model pages we're rolling out as we'll be doing one for ResNet-50 soon, we just launched the YOLOv3 one here, so please let us know what additional information would be important for you on that page.

We have a new UI coming out for the SparseZoo in the next few weeks that will make all of these comparisons easier and list out the level of pruning for each model. Let us know if you'd like to be an alpha tester on that as will be making an announcement in our Slack and Discord communities before pushing publicly!

Can you explain a bit more what you mean by specifying the constraint on the pruning ratio?

@ajithAI
Copy link
Author

ajithAI commented Jun 23, 2021

Hi @markurtz, Thanks for your explanation. So, Un-pruned ResNet50 Model gives over 1,000 FPS Throughput ( which is great, when comparing with Nvidia-T4 Throughput of 5,563 FPS Performance ). And taking the advantage of Pruning, achieveing 2090 FPS on CPU is pretty amazing. Thanks for the information !!

Regarding the YOLOv3 example, I will try to go with the entire flow and will let you know my experience.

I can have a look on SparseZoo UI, but, I am not sure how deep can I go into, becasue of my current bandwidth.

Pruning Ratio Constrain : Nvidia can sparse only 50% of model with their latest Ampere family. They cant sparse less, they cant sparse more. There are constrains how much we can sparse based on accelerators. And in addition to this, in some usecases, even 10% drop in accuracy is bearable. In cases like that, where throughput is real interest, can we prune model beyond limits, say, I need a model with 90% Pruning where any accuracy loss is fine. Here, in this case, I have constrain on Pruning ratio. In Neural Magic application, can we specify the ( min, max ) pruning ratios, or traget for desired Throughput, say, I need 5000 FPS and any extend of pruning is fine. Something like that.

And is there any paper that I can read about the method of Neural Magic pruning. Just thinking broad on how can Neural Magic Pruning methodologies can be applied on to FPGA Accelerators, where we can program at a hardware levels.

@markurtz
Copy link
Member

Ah makes a lot of sense, thanks @ajithAI!

For the pruning ratio, yes, you're free to specify even more sparsity by editing the recipes we have or creating one from scratch. All of the recipes are set up to have sectioned sparsity variables at the top of the recipes, increasing these will give the result you desire. The DeepSparse engine generally has an exponential relationship with sparsity and performance provided everything is compute-bound. If layers are memory bound, such as with depthwise convolutions, then sparsity won't give much speedup. This is some of the core technology that we're working on improving though -- executing more of the networks depthwise to make the model more compute-bound.

Our pruning methodologies follow gradual magnitude pruning as we have found this to be the most consistent and give the best results. The one caveat is that it takes more training time as compared to other methods. Song Han's 2015 paper is probably the best to go through for this: https://arxiv.org/abs/1506.02626

@jeanniefinks
Copy link
Member

Hello @ajithAI
As there has been no further commentary, I am going to go ahead and close this thread out. But if you have more comments, please re-open and we'd love to chat. Lastly, if you have not starred our sparseml repo already, and you feel inclined, please do! Thank you in advance for your support! https://github.com/neuralmagic/sparseml/

Best, Jeannie / Neural Magic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants