From 9444edf113928a417a3aaaee1f3be92bbf45ac53 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 14:19:39 -0500 Subject: [PATCH 01/13] Sparsification update - update sparsification descriptions and move to preferred verbage - update classification examples to resnet50 --- README.md | 88 ++++++++++++++++++++++------------ docs/source/index.rst | 34 ++++++------- docs/source/quicktour.md | 77 ++++++++++++++++++----------- notebooks/classification.ipynb | 10 ++-- 4 files changed, 130 insertions(+), 79 deletions(-) diff --git a/README.md b/README.md index 307a07cd5c..aa525e0ebb 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ limitations under the License. # ![icon for DeepSparse](https://raw.githubusercontent.com/neuralmagic/deepsparse/main/docs/source/icon-deepsparse.png) DeepSparse Engine -### CPU inference engine that delivers unprecedented performance for sparse models +### Neural network inference engine that delivers unprecedented performance for sparsified models on CPUs

GitHub @@ -50,16 +50,19 @@ The DeepSparse Engine is a CPU runtime that delivers unprecedented performance b This repository includes package APIs along with examples to quickly get started learning about and actually running sparse models. -### Related Products +## Sparsification -- [SparseZoo](https://github.com/neuralmagic/sparsezoo): - Neural network model repository for highly sparse models and optimization recipes -- [SparseML](https://github.com/neuralmagic/sparseml): - Libraries for state-of-the-art deep neural network optimization algorithms, - enabling simple pipelines integration with a few lines of code -- [Sparsify](https://github.com/neuralmagic/sparsify): - Easy-to-use autoML interface to optimize deep neural networks for - better inference performance and a smaller footprint +Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. +Techniques for sparsification are all encompassing including everything from inducing sparsity using [pruning](https://neuralmagic.com/blog/pruning-overview/) and [quantization](https://arxiv.org/abs/1609.07061) to enabling naturally occurring sparsity using [activation sparsity](http://proceedings.mlr.press/v119/kurtz20a.html) or [winograd/FFT](https://arxiv.org/abs/1509.09308). +When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. +For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline. + +The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. +Recipes encode the directions for how to sparsify a model into a simple, easily editable format. +Download a sparsification recipe/sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo) or create one using [Sparsify](https://github.com/neuralmagic/sparsify), apply it using [SparseML](https://github.com/neuralmagic/sparseml) with only a few lines of code, and deploy with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse) for unprecedented performance on CPUs. +Visualization of the full product flow: + + ## Compatibility @@ -67,21 +70,22 @@ The DeepSparse Engine ingests models in the [ONNX](https://onnx.ai/) format, all ## Quick Tour -To expedite inference and benchmarking on real models, we include the `sparsezoo` package. [SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference optimized models, trained on repeatable optimization recipes using state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). +To expedite inference and benchmarking on real models, we include the `sparsezoo` package. [SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). ### Quickstart with SparseZoo ONNX Models -**MobileNetV1 Dense** +**ResNet-50 Dense** -Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense MobileNetV1 from SparseZoo. +Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense ResNet-50 from SparseZoo. ```python from deepsparse import compile_model from sparsezoo.models import classification + batch_size = 64 # Download model and compile as optimized executable for your machine -model = classification.mobilenet_v1() +model = classification.resnet_50() engine = compile_model(model, batch_size=batch_size) # Fetch sample input and predict output using engine @@ -89,41 +93,65 @@ inputs = model.data_inputs.sample_batch(batch_size=batch_size) outputs, inference_time = engine.timed_run(inputs) ``` -**MobileNetV1 Optimized** +**ResNet-50 Sparsified** When exploring available optimized models, you can use the `Zoo.search_optimized_models` utility to find models that share a base. -Let us try this on the dense MobileNetV1 to see what is available. +Let us try this on the dense ResNet-50 to see what is available. ```python from sparsezoo import Zoo from sparsezoo.models import classification -print(Zoo.search_optimized_models(classification.mobilenet_v1())) + +model = classification.resnet_50() +print(Zoo.search_optimized_models(model)) ``` Output: ```shell -[Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/base-none), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-conservative), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned_quant-moderate)] +[ + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-conservative), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet-augmented/pruned_quant-aggressive) +] ``` -Great. We can see there are two pruned versions targeting FP32, `conservative` at 100% and `moderate` at >= 99% of baseline accuracy. There is also a `pruned_quant` variant targetting INT8. +We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. +The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say you want to evaluate best performance on FP32 and are okay with a small drop in accuracy, so we can choose `pruned-moderate` over `pruned-conservative`. +Let's say that we want something that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. +This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). +For hardware compatibility, see the Hardware Support section. ```python from deepsparse import compile_model -from sparsezoo.models import classification -batch_size = 64 +import numpy -model = classification.mobilenet_v1(optim_name="pruned", optim_category="moderate") -engine = compile_model(model, batch_size=batch_size) - -inputs = model.data_inputs.sample_batch(batch_size=batch_size) -outputs, inference_time = engine.timed_run(inputs) +batch_size = 64 +sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)] + +# run baseline benchmarking +engine_base = compile_model( + model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none", + batch_size=batch_size, +) +benchmarks_base = engine_base.benchmark(sample_inputs) +print(benchmarks_base) + +# run sparse benchmarking +engine_sparse = compile_model( + model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate", + batch_size=batch_size, +) +if not engine_sparse.cpu_vnni: + print("WARNING: VNNI instructions not detected, quantization speedup not well supported") +benchmarks_sparse = engine_sparse.benchmark(sample_inputs) +print(benchmarks_sparse) + +print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x") ``` ### Quickstart with custom ONNX models diff --git a/docs/source/index.rst b/docs/source/index.rst index 27f7253e5e..c5590ebff6 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -17,7 +17,7 @@ DeepSparse |version| ==================== -CPU inference engine that delivers unprecedented performance for sparse models. +Neural network inference engine that delivers unprecedented performance for sparsified models on CPUs .. raw:: html @@ -51,14 +51,26 @@ CPU inference engine that delivers unprecedented performance for sparse models. Overview ======== -The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of -natural sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. -It is focused on model deployment and scaling machine learning pipelines, -fitting seamlessly into your existing deployments as an inference backend. +The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of natural sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend. `This repository `_ includes package APIs along with examples to quickly get started learning about and actually running sparse models. +Sparsification +============== + +Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. +Techniques for sparsification are all encompassing including everything from inducing sparsity using `pruning `_ and `quantization `_ to enabling naturally occurring sparsity using `activation sparsity `_ or `winograd/FFT `_. +When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. +For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline. + +The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. +Recipes encode the directions for how to sparsify a model into a simple, easily editable format. +Download a sparsification recipe/sparsified model from the `SparseZoo `_ or create one using `Sparsify `_, apply it using `SparseML `_ with only a few lines of code, and deploy with the `DeepSparse Engine `_ for unprecedented performance on CPUs. +Visualization of the full product flow: + + + Compatibility ============= @@ -68,18 +80,6 @@ allowing for compatibility with `PyTorch `_ that support it. This reduces the extra work of preparing your trained model for inference to just one step of exporting. -Related Products -================ - -- `SparseZoo `_: - Neural network model repository for highly sparse models and optimization recipes -- `SparseML `_: - Libraries for state-of-the-art deep neural network optimization algorithms, - enabling simple pipelines integration with a few lines of code -- `Sparsify `_: - Easy-to-use autoML interface to optimize deep neural networks for - better inference performance and a smaller footprint - Resources and Learning More =========================== diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md index 432c267f70..e14283c93e 100644 --- a/docs/source/quicktour.md +++ b/docs/source/quicktour.md @@ -17,23 +17,24 @@ limitations under the License. ## Quick Tour To expedite inference and benchmarking on real models, we include the `sparsezoo` package. -[SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference optimized models, -trained on repeatable optimization recipes using state-of-the-art techniques from +[SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, +trained on repeatable sparsification recipes using state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). ### Quickstart with SparseZoo ONNX Models -**MobileNetV1 Dense** +**ResNet-50 Dense** -Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense MobileNetV1 from SparseZoo. +Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense ResNet-50 from SparseZoo. ```python from deepsparse import compile_model from sparsezoo.models import classification + batch_size = 64 # Download model and compile as optimized executable for your machine -model = classification.mobilenet_v1() +model = classification.resnet_50() engine = compile_model(model, batch_size=batch_size) # Fetch sample input and predict output using engine @@ -41,43 +42,65 @@ inputs = model.data_inputs.sample_batch(batch_size=batch_size) outputs, inference_time = engine.timed_run(inputs) ``` -**MobileNetV1 Optimized** +**ResNet-50 Sparsified** -When exploring available optimized models, you can use the `Zoo.search_optimized_models` -utility to find models that share a base. +When exploring available optimized models, you can use the `Zoo.search_optimized_models` utility to find models that share a base. -Let us try this on the dense MobileNetV1 to see what is available. +Let us try this on the dense ResNet-50 to see what is available. ```python from sparsezoo import Zoo from sparsezoo.models import classification -print(Zoo.search_optimized_models(classification.mobilenet_v1())) + +model = classification.resnet_50() +print(Zoo.search_optimized_models(model)) ``` + Output: -``` -[Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/base-none), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-conservative), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate), - Model(stub=cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned_quant-moderate)] + +```shell +[ + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-conservative), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned-moderate), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate), + Model(stub=cv/classification/resnet_v1-50/pytorch/sparseml/imagenet-augmented/pruned_quant-aggressive) +] ``` -Great. We can see there are two pruned versions targeting FP32, -`conservative` at 100% and `moderate` at >= 99% of baseline accuracy. -There is also a `pruned_quant` variant targeting INT8. +We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. +The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say you want to evaluate best performance on FP32 and are okay with a small drop in accuracy, -so we can choose `pruned-moderate` over `pruned-conservative`. +Let's say that we want something that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. +This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). +For hardware compatibility, see the Hardware Support section. ```python from deepsparse import compile_model -from sparsezoo.models import classification -batch_size = 64 - -model = classification.mobilenet_v1(optim_name="pruned", optim_category="moderate") -engine = compile_model(model, batch_size=batch_size) +import numpy -inputs = model.data_inputs.sample_batch(batch_size=batch_size) -outputs, inference_time = engine.timed_run(inputs) +batch_size = 64 +sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)] + +# run baseline benchmarking +engine_base = compile_model( + model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none", + batch_size=batch_size, +) +benchmarks_base = engine_base.benchmark(sample_inputs) +print(benchmarks_base) + +# run sparse benchmarking +engine_sparse = compile_model( + model="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned_quant-moderate", + batch_size=batch_size, +) +if not engine_sparse.cpu_vnni: + print("WARNING: VNNI instructions not detected, quantization speedup not well supported") +benchmarks_sparse = engine_sparse.benchmark(sample_inputs) +print(benchmarks_sparse) + +print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x") ``` ### Quickstart with custom ONNX models diff --git a/notebooks/classification.ipynb b/notebooks/classification.ipynb index 7b090b4926..fce1d9324e 100644 --- a/notebooks/classification.ipynb +++ b/notebooks/classification.ipynb @@ -63,11 +63,11 @@ "source": [ "## Gathering the Model and Data\n", "\n", - "By default, you will download a MobileNetV1 model trained on the ImageNet dataset.\n", + "By default, you will download a sparsified ResNet-50 model trained on the ImageNet dataset.\n", "The model's pretrained weights and exported ONNX file are downloaded from the SparseZoo model repo.\n", "The sample batch of data is downloaded from SparseZoo as well.\n", "\n", - "If you want to try different architectures replace `mobilenet_v1()` with your choice, for example: `resnet50()` or `efficientnet_b0()`.\n", + "If you want to try different architectures replace `resnet50()` with your choice, for example: `mobilenet_v1()` or `efficientnet_b0()`.\n", "\n", "You may also want to try different batch sizes to evaluate accuracy and performance for your task." ] @@ -95,7 +95,7 @@ "# Define your model below\n", "# =====================================================\n", "print(\"Downloading model...\")\n", - "model = classification.mobilenet_v1()\n", + "model = classification.resnet_50(optim_name=\"pruned_quant\", optim_category=\"moderate\")\n", "\n", "# Gather sample batch of data for inference and visualization\n", "batch = model.sample_batch(batch_size=batch_size)\n", @@ -276,9 +276,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.9" + "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file From 83e40fab48ed17b61c048041250c636b31fabd5a Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 17:15:14 -0500 Subject: [PATCH 02/13] update from comments --- README.md | 4 ++-- docs/source/index.rst | 2 +- docs/source/quicktour.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index aa525e0ebb..effa94e5d1 100644 --- a/README.md +++ b/README.md @@ -55,7 +55,7 @@ This repository includes package APIs along with examples to quickly get started Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using [pruning](https://neuralmagic.com/blog/pruning-overview/) and [quantization](https://arxiv.org/abs/1609.07061) to enabling naturally occurring sparsity using [activation sparsity](http://proceedings.mlr.press/v119/kurtz20a.html) or [winograd/FFT](https://arxiv.org/abs/1509.09308). When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. -For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline. +For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. @@ -122,7 +122,7 @@ Output: We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say that we want something that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. +Let's say that we want a version of ResNet-50 that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. diff --git a/docs/source/index.rst b/docs/source/index.rst index c5590ebff6..7c6ef7b5e6 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -62,7 +62,7 @@ Sparsification Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using `pruning `_ and `quantization `_ to enabling naturally occurring sparsity using `activation sparsity `_ or `winograd/FFT `_. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. -For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline. +For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md index e14283c93e..994350e82c 100644 --- a/docs/source/quicktour.md +++ b/docs/source/quicktour.md @@ -71,7 +71,7 @@ Output: We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say that we want something that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. +Let's say that we want a version of ResNet-50 that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. From b0ed9847fc0244b9fd69d6aeedd6102127fa0cd1 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:17 -0500 Subject: [PATCH 03/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index effa94e5d1..6c76655dc8 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ This repository includes package APIs along with examples to quickly get started ## Sparsification -Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. +Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using [pruning](https://neuralmagic.com/blog/pruning-overview/) and [quantization](https://arxiv.org/abs/1609.07061) to enabling naturally occurring sparsity using [activation sparsity](http://proceedings.mlr.press/v119/kurtz20a.html) or [winograd/FFT](https://arxiv.org/abs/1509.09308). When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. From 0ee87dc2e90ceccbd06708bf4721a35de247662f Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:23 -0500 Subject: [PATCH 04/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6c76655dc8..01818815b9 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,7 @@ Techniques for sparsification are all encompassing including everything from ind When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. -The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. +The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. Download a sparsification recipe/sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo) or create one using [Sparsify](https://github.com/neuralmagic/sparsify), apply it using [SparseML](https://github.com/neuralmagic/sparseml) with only a few lines of code, and deploy with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse) for unprecedented performance on CPUs. Visualization of the full product flow: From 4235010f69479bbb3be560f1f0835d966ea84190 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:29 -0500 Subject: [PATCH 05/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 01818815b9..1cd3498bd4 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,12 @@ For example, pruning plus quantization can give over [7x improvements in perform The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. -Download a sparsification recipe/sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo) or create one using [Sparsify](https://github.com/neuralmagic/sparsify), apply it using [SparseML](https://github.com/neuralmagic/sparseml) with only a few lines of code, and deploy with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse) for unprecedented performance on CPUs. +- Download a sparsification recipe and sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo). +- Alternatively, create a recipe for your model using [Sparsify](https://github.com/neuralmagic/sparsify). +- Apply your recipe with only a few lines of code using [SparseML](https://github.com/neuralmagic/sparseml). +- Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse). + + Visualization of the full product flow: From 59a2d87b36f3c69cee473c8164e347e62096fa9d Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:34 -0500 Subject: [PATCH 06/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1cd3498bd4..f1c636735e 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ Recipes encode the directions for how to sparsify a model into a simple, easily - Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse). -Visualization of the full product flow: +**Full Deep Sparse product flow:** From f94ca3b44d03e51071bac0d289fd71af62108df1 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:40 -0500 Subject: [PATCH 07/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f1c636735e..e7c24e49f8 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ We can see there are two pruned versions targeting FP32 and two pruned, quantize The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. Let's say that we want a version of ResNet-50 that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. -This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). +This model will run [nearly 7x faster](linktoresnet50example) than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. ```python From 6f088b1f9a68b9d36c870dc25dc1d7d62b485557 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:45 -0500 Subject: [PATCH 08/13] Update README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e7c24e49f8..5439874455 100644 --- a/README.md +++ b/README.md @@ -127,7 +127,7 @@ Output: We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say that we want a version of ResNet-50 that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. +For a version of ResNet-50 that recovers close to the baseline and is very performant, choose the pruned_quant-moderate model. This model will run [nearly 7x faster](linktoresnet50example) than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. From ba018dd6b0f523b5cac197b5a110c970ff52e6d8 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:03:57 -0500 Subject: [PATCH 09/13] Update docs/source/quicktour.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> --- docs/source/quicktour.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md index 994350e82c..312b8c1a23 100644 --- a/docs/source/quicktour.md +++ b/docs/source/quicktour.md @@ -103,7 +103,7 @@ print(benchmarks_sparse) print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x") ``` -### Quickstart with custom ONNX models +### Quickstart with Custom ONNX Models We accept ONNX files for custom models, too. Simply plug in your model to compare performance with other solutions. From 6d760abc514eeedfd0d64b710523a50831a8e779 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:06:07 -0500 Subject: [PATCH 10/13] update for changes found in sparsezoo --- docs/source/index.rst | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index 7c6ef7b5e6..266c7ff677 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -59,15 +59,20 @@ actually running sparse models. Sparsification ============== -Sparsification is the process of taking a trained deep learning model and removing redundant information from the over precise and over parameterized network resulting in a faster and smaller model. +Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using `pruning `_ and `quantization `_ to enabling naturally occurring sparsity using `activation sparsity `_ or `winograd/FFT `_. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. -The DeepSparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets/models using recipe driven approaches. +The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. -Download a sparsification recipe/sparsified model from the `SparseZoo `_ or create one using `Sparsify `_, apply it using `SparseML `_ with only a few lines of code, and deploy with the `DeepSparse Engine `_ for unprecedented performance on CPUs. -Visualization of the full product flow: +- Download a sparsification recipe and sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo). +- Alternatively, create a recipe for your model using [Sparsify](https://github.com/neuralmagic/sparsify). +- Apply your recipe with only a few lines of code using [SparseML](https://github.com/neuralmagic/sparseml). +- Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse). + + +**Full Deep Sparse product flow:** From 0a16da859250da63d2533656a8c301f746b79ffb Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Wed, 24 Feb 2021 22:10:04 -0500 Subject: [PATCH 11/13] fix links in index.rst for reviewd content --- docs/source/index.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index 266c7ff677..397b94002f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -66,10 +66,10 @@ For example, pruning plus quantization can give over [7x improvements in perform The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. -- Download a sparsification recipe and sparsified model from the [SparseZoo](https://github.com/neuralmagic/sparsezoo). -- Alternatively, create a recipe for your model using [Sparsify](https://github.com/neuralmagic/sparsify). -- Apply your recipe with only a few lines of code using [SparseML](https://github.com/neuralmagic/sparseml). -- Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the [DeepSparse Engine](https://github.com/neuralmagic/deepsparse). +- Download a sparsification recipe and sparsified model from the `SparseZoo `_. +- Alternatively, create a recipe for your model using `Sparsify `_. +- Apply your recipe with only a few lines of code using `SparseML `_. +- Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the `DeepSparse Engine `_. **Full Deep Sparse product flow:** From 4609179142f4a44bf0c9d2352539d58ddd575182 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Thu, 25 Feb 2021 09:34:07 -0500 Subject: [PATCH 12/13] update component overview and tagline from doc --- README.md | 9 +++++---- docs/source/index.rst | 36 ++++++++++++++++++------------------ setup.py | 5 ++++- src/deepsparse/engine.py | 2 +- 4 files changed, 28 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 5439874455..6dd7bdad14 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ limitations under the License. # ![icon for DeepSparse](https://raw.githubusercontent.com/neuralmagic/deepsparse/main/docs/source/icon-deepsparse.png) DeepSparse Engine -### Neural network inference engine that delivers unprecedented performance for sparsified models on CPUs +### Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs

GitHub @@ -46,16 +46,17 @@ limitations under the License. ## Overview -The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of natural sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend. +The DeepSparse Engine is a CPU runtime that delivers GPU-class performance by taking advantage of sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. +It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend. -This repository includes package APIs along with examples to quickly get started learning about and actually running sparse models. +This repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models. ## Sparsification Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using [pruning](https://neuralmagic.com/blog/pruning-overview/) and [quantization](https://arxiv.org/abs/1609.07061) to enabling naturally occurring sparsity using [activation sparsity](http://proceedings.mlr.press/v119/kurtz20a.html) or [winograd/FFT](https://arxiv.org/abs/1509.09308). When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. -For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. +For example, pruning plus quantization can give over [7x improvements in performance](https://neuralmagic.com/blog/benchmark-resnet50-with-deepsparse) while recovering to nearly the same baseline accuracy. The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. diff --git a/docs/source/index.rst b/docs/source/index.rst index 397b94002f..0a2344b29b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -17,7 +17,7 @@ DeepSparse |version| ==================== -Neural network inference engine that delivers unprecedented performance for sparsified models on CPUs +Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs .. raw:: html @@ -51,10 +51,10 @@ Neural network inference engine that delivers unprecedented performance for spar Overview ======== -The DeepSparse Engine is a CPU runtime that delivers unprecedented performance by taking advantage of natural sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend. +The DeepSparse Engine is a CPU runtime that delivers GPU-class performance by taking advantage of sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. +It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend. -`This repository `_ includes package APIs along with examples to quickly get started learning about and -actually running sparse models. +`This repository `_ includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models. Sparsification ============== @@ -62,7 +62,7 @@ Sparsification Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using `pruning `_ and `quantization `_ to enabling naturally occurring sparsity using `activation sparsity `_ or `winograd/FFT `_. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. -For example, pruning plus quantization can give over [7x improvements in performance](resnet50link) while recovering to nearly the same baseline accuracy. +For example, pruning plus quantization can give over `7x improvements in performance `_ while recovering to nearly the same baseline accuracy. The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. @@ -79,31 +79,31 @@ Recipes encode the directions for how to sparsify a model into a simple, easily Compatibility ============= -The DeepSparse Engine ingests models in the `ONNX `_ format, -allowing for compatibility with `PyTorch `_, -`TensorFlow `_, `Keras `_, -and `many other frameworks `_ that support it. +The DeepSparse Engine ingests models in the `ONNX `_ format, +allowing for compatibility with `PyTorch `_, +`TensorFlow `_, `Keras `_, +and `many other frameworks `_ that support it. This reduces the extra work of preparing your trained model for inference to just one step of exporting. Resources and Learning More =========================== -- `SparseZoo Documentation `_ -- `SparseML Documentation `_ -- `Sparsify Documentation `_ -- `Neural Magic Blog `_, - `Resources `_, - `Website `_ +- `SparseZoo Documentation `_ +- `SparseML Documentation `_ +- `Sparsify Documentation `_ +- `Neural Magic Blog `_, + `Resources `_, + `Website `_ Release History =============== Official builds are hosted on PyPi -- stable: `deepsparse `_ -- nightly (dev): `deepsparse-nightly `_ +- stable: `deepsparse `_ +- nightly (dev): `deepsparse-nightly `_ Additionally, more information can be found via -`GitHub Releases `_. +`GitHub Releases `_. .. toctree:: :maxdepth: 3 diff --git a/setup.py b/setup.py index 349a45d214..a8a1c88849 100644 --- a/setup.py +++ b/setup.py @@ -114,7 +114,10 @@ def _setup_long_description() -> Tuple[str, str]: version=_VERSION, author="Neuralmagic, Inc.", author_email="support@neuralmagic.com", - description="CPU runtime that delivers unprecedented performance for sparse models", + description=( + "Neural network inference engine that delivers GPU-class performance " + "for sparsified models on CPUs" + ), long_description=_setup_long_description()[0], long_description_content_type=_setup_long_description()[1], keywords=( diff --git a/src/deepsparse/engine.py b/src/deepsparse/engine.py index dbe909fdf0..763526120f 100644 --- a/src/deepsparse/engine.py +++ b/src/deepsparse/engine.py @@ -21,9 +21,9 @@ from typing import Dict, Iterable, List, Optional, Tuple, Union import numpy +from tqdm.auto import tqdm from deepsparse.benchmark import BenchmarkResults -from tqdm.auto import tqdm try: From a38d27fbbe68d6a6678432b424e4ac76cebc1d09 Mon Sep 17 00:00:00 2001 From: Mark Kurtz Date: Thu, 25 Feb 2021 09:38:46 -0500 Subject: [PATCH 13/13] update from comments --- README.md | 6 +++--- docs/source/quicktour.md | 11 ++++------- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 6dd7bdad14..c0800485a5 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,7 @@ outputs, inference_time = engine.timed_run(inputs) When exploring available optimized models, you can use the `Zoo.search_optimized_models` utility to find models that share a base. -Let us try this on the dense ResNet-50 to see what is available. +Try this on the dense ResNet-50 to see what is available: ```python from sparsezoo import Zoo @@ -129,7 +129,7 @@ We can see there are two pruned versions targeting FP32 and two pruned, quantize The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. For a version of ResNet-50 that recovers close to the baseline and is very performant, choose the pruned_quant-moderate model. -This model will run [nearly 7x faster](linktoresnet50example) than the baseline model on a compatible CPU (with the VNNI instruction set enabled). +This model will run [nearly 7x faster](https://neuralmagic.com/blog/benchmark-resnet50-with-deepsparse) than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. ```python @@ -160,7 +160,7 @@ print(benchmarks_sparse) print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x") ``` -### Quickstart with custom ONNX models +### Quickstart with Custom ONNX Models We accept ONNX files for custom models, too. Simply plug in your model to compare performance with other solutions. diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md index 312b8c1a23..7bb3c94a1a 100644 --- a/docs/source/quicktour.md +++ b/docs/source/quicktour.md @@ -16,10 +16,7 @@ limitations under the License. ## Quick Tour -To expedite inference and benchmarking on real models, we include the `sparsezoo` package. -[SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, -trained on repeatable sparsification recipes using state-of-the-art techniques from -[SparseML](https://github.com/neuralmagic/sparseml). +To expedite inference and benchmarking on real models, we include the `sparsezoo` package. [SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). ### Quickstart with SparseZoo ONNX Models @@ -46,7 +43,7 @@ outputs, inference_time = engine.timed_run(inputs) When exploring available optimized models, you can use the `Zoo.search_optimized_models` utility to find models that share a base. -Let us try this on the dense ResNet-50 to see what is available. +Try this on the dense ResNet-50 to see what is available: ```python from sparsezoo import Zoo @@ -71,8 +68,8 @@ Output: We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The `conservative`, `moderate`, and `aggressive` tags recover to 100%, >=99%, and <99% of baseline accuracy respectively. -Let's say that we want a version of ResNet-50 that recovers close to the baseline and is very performant, we can choose the pruned_quant-moderate model. -This model will run [nearly 7 times faster](linktoresnet50example) than the baseline model on a compatible CPU (VNNI instruction set enabled). +For a version of ResNet-50 that recovers close to the baseline and is very performant, choose the pruned_quant-moderate model. +This model will run [nearly 7x faster](https://neuralmagic.com/blog/benchmark-resnet50-with-deepsparse) than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section. ```python