diff --git a/README.md b/README.md index 112c1f81dd..91e4d21b22 100644 --- a/README.md +++ b/README.md @@ -76,25 +76,25 @@ pip install deepsparse ## 🔌 DeepSparse Server -The DeepSparse Server allows you to serve models and pipelines in deployment in CLI. The server runs on top of the popular FastAPI web framework and Uvicorn web server. Install the server using the following command: +The DeepSparse Server allows you to serve models and pipelines from the terminal. The server runs on top of the popular FastAPI web framework and Uvicorn web server. Install the server using the following command: ```bash pip install deepsparse[server] ``` -**⭐ Single Model ⭐** +### Single Model Once installed, the following example CLI command is available for running inference with a single BERT model: ```bash deepsparse.server \ --task question_answering \ - --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none" + --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni" ``` To look up arguments run: `deepsparse.server --help`. -**⭐ Multiple Models ⭐** +### Multiple Models To serve multiple models in your deployment you can easily build a `config.yaml`. In the example below, we define two BERT models in our configuration for the question answering task: ```yaml @@ -104,7 +104,7 @@ models: batch_size: 1 alias: question_answering/base - task: question_answering - model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant-aggressive_95 + model_path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni batch_size: 1 alias: question_answering/pruned_quant ``` @@ -113,6 +113,9 @@ Finally, after your `config.yaml` file is built, run the server with the config ```bash deepsparse.server --config_file config.yaml ``` + +[Getting Started with the DeepSparse Server](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server) for more info. + ## 📜 DeepSparse Benchmark The benchmark tool is available on your CLI to run expressive model benchmarks on the DeepSparse Engine with minimal parameters. @@ -124,27 +127,26 @@ deepsparse.benchmark [-h] [-b BATCH_SIZE] [-shapes INPUT_SHAPES] [-ncores NUM_CORES] [-s {async,sync}] [-t TIME] [-nstreams NUM_STREAMS] [-pin {none,core,numa}] [-q] [-x EXPORT_PATH] - model_path + model_path ``` [Getting Started with CLI Benchmarking](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark_model) includes examples of select inference scenarios: - Synchronous (Single-stream) Scenario - Asynchronous (Multi-stream) Scenario -__ __ -## 👩💻 NLP Inference | Question Answering + +## 👩💻 NLP Inference Example ```python from deepsparse.transformers import pipeline # SparseZoo model stub or path to ONNX file -onnx_filepath="zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned-aggressive_98" +model_path = "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni" qa_pipeline = pipeline( task="question-answering", - model_path=onnx_filepath, - num_cores=None, # uses all available CPU cores by default + model_path=model_path, ) my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax") @@ -154,20 +156,19 @@ NLP Tutorials: - [Getting Started with Hugging Face Transformers 🤗](https://github.com/neuralmagic/deepsparse/tree/main/examples/huggingface-transformers) Tasks Supported: -- Text Classification (Sentiment Analysis) -- Question Answering -- Masked Language Modeling (MLM) - -__ __ +- [Token Classification: Named Entity Recognition](https://neuralmagic.com/use-cases/sparse-named-entity-recognition/) +- [Text Classification: Multi-Class](https://neuralmagic.com/use-cases/sparse-multi-class-text-classification/) +- [Text Classification: Binary](https://neuralmagic.com/use-cases/sparse-binary-text-classification/) +- [Text Classification: Sentiment Analysis](https://neuralmagic.com/use-cases/sparse-sentiment-analysis/) +- [Question Answering](https://neuralmagic.com/use-cases/sparse-question-answering/) ## 🦉 SparseZoo ONNX vs. Custom ONNX Models DeepSparse can accept ONNX models from two sources: -1. `SparseZoo ONNX`: our open-source collection of sparse models available for download. [SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from [SparseML.](https://github.com/neuralmagic/sparseml) - -2. `Custom ONNX`: Your own ONNX model, can be dense or sparse. Plug in your model to compare performance with other solutions. +- **SparseZoo ONNX**: our open-source collection of sparse models available for download. [SparseZoo](https://github.com/neuralmagic/sparsezoo) hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). +- **Custom ONNX**: your own ONNX model, can be dense or sparse. Plug in your model to compare performance with other solutions. ```bash > wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx @@ -188,15 +189,13 @@ inputs = generate_random_inputs(onnx_filepath, batch_size) engine = compile_model(onnx_filepath, batch_size) outputs = engine.run(inputs) ``` -Compatibility/Support Notes +Compatibility/Support Notes: - ONNX version 1.5-1.7 - ONNX opset version 11+ - ONNX IR version has not been tested at this time The [GitHub repository](https://github.com/neuralmagic/deepsparse) includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models. -__ __ - ## Scheduling Single-Stream, Multi-Stream, and Elastic Inference The DeepSparse Engine offers up to three types of inferences based on your use case. Read more details here: [Inference Types](https://github.com/neuralmagic/deepsparse/blob/main/docs/source/scheduler.md). @@ -216,7 +215,6 @@ PRO TIP: The most common use cases for the multi-stream scheduler are where para 3 ⚡ Elastic scheduling: requests execute in parallel, but not multiplexed on individual NUMA nodes. Use Case: A workload that might benefit from the elastic scheduler is one in which multiple requests need to be handled simultaneously, but where performance is hindered when those requests have to share an L3 cache. -__ __ ## 🧰 CPU Hardware Support @@ -233,34 +231,29 @@ Here is a table detailing specific support for some algorithms over different mi ## Resources -
| Documentation | Versions | Info |
|---|---|---|
| - -[DeepSparse](https://docs.neuralmagic.com/deepsparse/) - -[SparseML](https://docs.neuralmagic.com/sparseml/) -[SparseZoo](https://docs.neuralmagic.com/sparsezoo/) +#### Libraries +- [DeepSparse](https://docs.neuralmagic.com/deepsparse/) -[Sparsify](https://docs.neuralmagic.com/sparsify/) +- [SparseML](https://docs.neuralmagic.com/sparseml/) - | +- [SparseZoo](https://docs.neuralmagic.com/sparsezoo/) - stable : : [DeepSparse](https://pypi.org/project/deepsparse) +- [Sparsify](https://docs.neuralmagic.com/sparsify/) - nightly (dev) : : [DeepSparse-Nightly](https://pypi.org/project/deepsparse-nightly/) - releases : : [GitHub](https://github.com/neuralmagic/deepsparse/releases) +#### Versions +- [DeepSparse](https://pypi.org/project/deepsparse) | stable - | +- [DeepSparse-Nightly](https://pypi.org/project/deepsparse-nightly/) | nightly (dev) -[Blog](https://www.neuralmagic.com/blog/) +- [GitHub](https://github.com/neuralmagic/deepsparse/releases) | releases -[Resources](https://www.neuralmagic.com/resources/) +#### Info - |