From 7e861e03f4112443376f47e13ffea0321215cfd7 Mon Sep 17 00:00:00 2001 From: Ricky Costa Date: Fri, 1 Apr 2022 12:39:40 -0400 Subject: [PATCH 1/4] altered emoji and title font sizes to match other readmes --- src/deepsparse/benchmark_model/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/deepsparse/benchmark_model/README.md b/src/deepsparse/benchmark_model/README.md index 789a23255e..f721be7180 100644 --- a/src/deepsparse/benchmark_model/README.md +++ b/src/deepsparse/benchmark_model/README.md @@ -14,11 +14,11 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Benchmarking ONNX Models ๐Ÿ“œ +## ๐Ÿ“œ Benchmarking ONNX Models `deepsparse.benchmark` is a command-line (CLI) tool for benchmarking the DeepSparse Engine with ONNX models. The tool will parse the arguments, download/compile the network into the engine, generate input tensors, and execute the model depending on the chosen scenario. By default, it will choose a multi-stream or asynchronous mode to optimize for throughput. -## Quickstart +### Quickstart After `pip install deepsparse`, the benchmark tool is available on your CLI. For example, to benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset where the model path is the minimum input required to get started, run: @@ -26,7 +26,7 @@ After `pip install deepsparse`, the benchmark tool is available on your CLI. For deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none ``` __ __ -## Usage +### Usage In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model. However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options: @@ -100,7 +100,7 @@ Output of the JSON file: ![alt text](./img/json_output.png) -## Sample CLI Argument Configurations +### Sample CLI Argument Configurations To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests: From 487493523594992c918aec1383b872f792984523 Mon Sep 17 00:00:00 2001 From: Ricky Costa Date: Fri, 1 Apr 2022 12:41:47 -0400 Subject: [PATCH 2/4] altered emoji and title font sizes to match other readmes --- src/deepsparse/benchmark_model/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/deepsparse/benchmark_model/README.md b/src/deepsparse/benchmark_model/README.md index f721be7180..c67133744e 100644 --- a/src/deepsparse/benchmark_model/README.md +++ b/src/deepsparse/benchmark_model/README.md @@ -100,7 +100,7 @@ Output of the JSON file: ![alt text](./img/json_output.png) -### Sample CLI Argument Configurations +#### Sample CLI Argument Configurations To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests: @@ -114,21 +114,21 @@ To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency: deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant_6layers-aggressive_96 --batch_size 1 --scenario sync ``` __ __ -## Inference Scenarios โšกโšก +### โšก Inference Scenarios -### Synchronous (Single-stream) Scenario +#### Synchronous (Single-stream) Scenario Set by the `--scenario sync` argument, the goal metric is latency per batch (ms/batch). This scenario submits a single inference request at a time to the engine, recording the time taken for a request to return an output. This mimics an edge deployment scenario. The latency value reported is the mean of all latencies recorded during the execution period for the given batch size. -### Asynchronous (Multi-stream) Scenario +#### Asynchronous (Multi-stream) Scenario Set by the `--scenario async` argument, the goal metric is throughput in items per second (i/s). This scenario submits `--num_streams` concurrent inference requests to the engine, recording the time taken for each request to return an output. This mimics a model server or bulk batch deployment scenario. The throughput value reported comes from measuring the number of finished inferences within the execution time and the batch size. -### Example Benchmarking Output of Synchronous vs. Asynchronous +#### Example Benchmarking Output of Synchronous vs. Asynchronous **BERT 3-layer FP32 Sparse Throughput** From 988f768c6c60156107ae471d3eee62873dc1fd86 Mon Sep 17 00:00:00 2001 From: Ricky Costa Date: Mon, 4 Apr 2022 13:33:45 -0400 Subject: [PATCH 3/4] fix yaml code block indentation --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b719fd6094..91696a09cf 100644 --- a/README.md +++ b/README.md @@ -97,15 +97,17 @@ To look up arguments run: `deepsparse.server --help`. **โญ Multiple Models โญ** To serve multiple models in your deployment you can easily build a `config.yaml`. In the example below, we define two BERT models in our configuration for the question answering task: - models: +```yaml +models: - task: question_answering model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none batch_size: 1 - alias: question_answering/dense + alias: question_answering/base - task: question_answering model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95 batch_size: 1 - alias: question_answering/sparse_quantized + alias: question_answering/pruned_quant +``` Finally, after your `config.yaml` file is built, run the server with the config file path as an argument: ```bash From 15cebcacaa66a49b0aa7e88d25f3b037c67c18ed Mon Sep 17 00:00:00 2001 From: Ricky Costa Date: Mon, 4 Apr 2022 13:50:59 -0400 Subject: [PATCH 4/4] aligned indentation 2nd time --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 91696a09cf..54e8bfa51f 100644 --- a/README.md +++ b/README.md @@ -100,13 +100,13 @@ To serve multiple models in your deployment you can easily build a `config.yaml` ```yaml models: - task: question_answering - model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none - batch_size: 1 - alias: question_answering/base + model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none + batch_size: 1 + alias: question_answering/base - task: question_answering - model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95 - batch_size: 1 - alias: question_answering/pruned_quant + model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95 + batch_size: 1 + alias: question_answering/pruned_quant ``` Finally, after your `config.yaml` file is built, run the server with the config file path as an argument: