# Optimizing LLMs with One-Shot Pruning and Quantization

This guide delves into optimizing large language models (LLMs) for efficient text generation using neural network compression techniques like sparsification and quantization.
You'll learn how to:

- <b>Sparsify Models:</b> Apply pruning techniques to eliminate redundant parameters from an LLM, reducing its size and computational requirements.
- <b>Quantize Models:</b> Lower the numerical precision of model weights and activations for faster inference with minimal impact on accuracy.
- <b>Evaluate Performance:</b> Measure the impact of sparsification and quantization on model accuracy.

## Prerequisites

- <b>Training Environment:</b> A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](https://docs.neuralmagic.com/get-started/install/#prerequisites).

In [None]:
!pip install "sparseml[transformers]==1.7"

Collecting sparseml[transformers]
  Downloading sparseml-1.7.0-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sparsezoo~=1.7.0 (from sparseml[transformers])
  Downloading sparsezoo-1.7.0-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setuptools<=59.5.0 (from sparseml[transformers])
  Downloading setuptools-59.5.0-py3-none-any.whl (952 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m952.4/952.4 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Collecting merge-args>=0.1.0 (from sparseml[transformers])
  Downloading merge_args-0.1.5-py2.py3-none-any.whl (6.0 kB)
Collecting onnx<1.15.0,>=1.5.0 (from sparseml[transformers])
  Downloading onnx-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━


## Sparsifying a Llama Model

We'll use a pre-trained, unoptimized [TinyLlama 1.1B chat model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) from the HuggingFace Hub.
The model is referenced by the following stub:
```text
TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

For additional models that work with SparseML, consider the following options:
- Explore pre-sparsified [Generative AI models in the SparseZoo](https://sparsezoo.neuralmagic.com/?modelSet=generative_ai).
- Try out popular LLMs from the [Hugging Face Model Hub](https://huggingface.co/models?pipeline_tag=causal-lm).

### Data Preparation

SparseML requires a dataset to be used for calibration during the sparsification process.
For this example, we'll use the Open Platypus dataset, which is available in the Hugging Face dataset hub and can be loaded as follows:

In [None]:
from datasets import load_dataset

dataset = load_dataset("garage-bAInd/Open-Platypus")

Downloading readme:   0%|          | 0.00/5.34k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/15.6M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/24926 [00:00<?, ? examples/s]

### One Shot Compression

Applying pruning and quantization to an LLM without fine-tuning can be done utilizing recipes, the SparseGPT algorithm, and the `compress` command in SparseML.
This combination enables a quick and easy way to sparsify a model, resulting in medium compression levels with minimal accuracy loss, enabling efficient inference.

The code below demonstrates applying one-shot sparsification to the Llama chat model utilizing a recipe.
The recipe specifies using the SparseGPTModifier to apply 50% sparsity and quantization (int8 weights and activations) to the targeted layers within the model.

In [None]:
from sparseml.transformers import (
    SparseAutoModelForCausalLM, SparseAutoTokenizer, compress
)

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = SparseAutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = SparseAutoTokenizer.from_pretrained(model_id)

def format_data(data):
    return {
        "text": data["instruction"] + data["output"]
    }

dataset = dataset.map(format_data)

recipe = """
compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, QuantizableMatMul]
            post_oneshot_calibration: true
            scheme_overrides:
                Linear:
                    weights:
                        num_bits: 8
                        symmetric: true
                        strategy: channel
                Embedding:
                    input_activations: null
                    weights:
                        num_bits: 8
                        symmetric: false
        SparseGPTModifier:
            sparsity: 0.5
            quantize: True
            targets: ['re:model.layers.\d*$']
"""

compress(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    recipe=recipe,
    output_dir="./one-shot-example",
)

2024-04-09 20:33:32 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
INFO:sparseml.transformers.utils.helpers:model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-04-09 20:33:32 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: TinyLlama/TinyLlama-1.1B-Chat-v1.0. Downloading...
INFO:sparseml.transformers.utils.helpers:Found recipe: recipe.yaml for model id: TinyLlama/TinyLlama-1.1B-Chat-v1.0. Downloading...
2024-04-09 20:33:32 sparseml.transformers.utils.helpers INFO     Unable to to find recipe recipe.yaml for model id: TinyLlama/TinyLlama-1.1B-Chat-v1.0: 404 Client Error. (Request ID: Root=1-6615a61c-65654cad30dc82d40293b87a;9622e80a-d4fb-4a6e-bd44-2d2c2dd7456b)

Entry Not Found for url: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/recipe.yaml.. Skipping recipe resolution.
INFO:sparse

Map:   0%|          | 0/24926 [00:00<?, ? examples/s]

compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, QuantizableMatMul]
            post_oneshot_calibration: true
            scheme_overrides:
                Linear:
                    weights:
                        num_bits: 8
                        symmetric: true
                        strategy: channel
                Embedding:
                    input_activations: null
                    weights:
                        num_bits: 8
                        symmetric: false
        SparseGPTModifier:
            sparsity: 0.5
            quantize: True
            targets: ['re:model.layers.\d*$']

compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, QuantizableMatMul]
            post_oneshot_calibration: true
            scheme_overrides:
       

Removing unneeded columns:   0%|          | 0/24926 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/24926 [00:00<?, ? examples/s]

Adding labels:   0%|          | 0/24926 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, QuantizableMatMul]
            post_oneshot_calibration: true
            scheme_overrides:
                Linear:
                    weights:
                        num_bits: 8
                        symmetric: true
                        strategy: channel
                Embedding:
                    input_activations: null
                    weights:
                        num_bits: 8
                        symmetric: false
        SparseGPTModifier:
            sparsity: 0.5
            quantize: True
            targets: ['re:model.layers.\d*$']

compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, QuantizableMatMul]
            p

{'train': ['input_ids', 'attention_mask', 'labels']}


2024-04-09 20:33:55 sparseml.modifiers.quantization.pytorch INFO     Running QuantizationModifier calibration with 512 samples...
INFO:sparseml.modifiers.quantization.pytorch:Running QuantizationModifier calibration with 512 samples...
100%|██████████| 512/512 [04:42<00:00,  1.81it/s]
2024-04-09 20:38:38 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.0 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.0 for compression
2024-04-09 20:38:38 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.1 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.1 for compression
2024-04-09 20:38:38 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.2 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.2 for compression
2024-04-09 20:38:38 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.3 for compression
INFO:spar

After running the above code, the model is pruned to 50% sparsity and quantized, resulting in a smaller model ready for efficient inference.

### Inference

To test the model's generation capabilities, we can use the following code to generate text utilizing PyTorch:


In [None]:
from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer
from sparseml.core.utils import session_context_manager

model_path = "./one-shot-example/stage_compression"

with session_context_manager():
  model = SparseAutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0")
tokenizer = SparseAutoTokenizer.from_pretrained(model_path)

chat = [
    {"role": "user", "content": "Tell me about large language models"}
]

inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

2024-04-09 21:04:58 sparseml.transformers.utils.helpers INFO     Found recipe in the model_path: ./one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.transformers.utils.helpers:Found recipe in the model_path: ./one-shot-example/stage_compression/recipe.yaml
2024-04-09 21:04:58 sparseml.core.recipe.recipe INFO     Loading recipe from file ./one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.core.recipe.recipe:Loading recipe from file ./one-shot-example/stage_compression/recipe.yaml
manager stage: Model structure initialized
2024-04-09 21:04:58 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at ./one-shot-example/stage_compression
INFO:sparseml.pytorch.model_load.helpers:Applied an unstaged recipe to the model at ./one-shot-example/stage_compression
2024-04-09 21:05:03 sparseml.pytorch.model_load.helpers INFO     Reloaded 3302 model params for SparseML Recipe from ./one-shot-example/stage_compression
INFO:sparseml.pytorch.model_loa

['<|user|>\nTell me about large language models \n<|assistant|>\nLarge language models are computer programs that learn to recognize and generate large sets of words or phrases from a large corpus of text. These models can be trained on large corpora of text using various techniques such as supervised learning, unsupervised learning, and reinforcing learning. The goal of large language models is to learn to recognize and generate large sets of words or phrases from a large corpus of text. These models can be trained on large corpora of text using various techniques such as supervised learning, unsupervised learning, and reinforing learning. Large language models are used in various ways to learn to recognize and generate large sets of words or phrases from a large corpus of text. These models can be trained on large corpora of text using various techniques such as supervised learning, unsupervised learning, and reinforing learning. Large language models are used in various ways to lear

### Evaluating Accuracy

Evaluating the model's accuracy is important to ensure it meets the desired performance requirements.
To do so, we can use the following code to evaluate the model's perplexity on a sample dataset:

In [None]:
from sparseml import evaluate

eval = evaluate(
    "./one-shot-example/stage_compression",
    datasets="openai_humaneval",
    integration="perplexity",
    text_column_name=["prompt", "canonical_solution"]
)
print(eval)

2024-04-09 21:13:19 sparseml.evaluation.registry INFO     Auto collected perplexity integration for eval
INFO:sparseml.evaluation.registry:Auto collected perplexity integration for eval
2024-04-09 21:13:30 sparseml.transformers.utils.helpers INFO     Found recipe in the model_path: ./one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.transformers.utils.helpers:Found recipe in the model_path: ./one-shot-example/stage_compression/recipe.yaml
2024-04-09 21:13:30 sparseml.core.recipe.recipe INFO     Loading recipe from file ./one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.core.recipe.recipe:Loading recipe from file ./one-shot-example/stage_compression/recipe.yaml
manager stage: Model structure initialized
2024-04-09 21:13:30 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at ./one-shot-example/stage_compression
INFO:sparseml.pytorch.model_load.helpers:Applied an unstaged recipe to the model at ./one-shot-example/stage_compression

Downloading readme:   0%|          | 0.00/6.52k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

100%|██████████| 164/164 [32:02<00:00, 11.72s/it]


formatted=[Evaluation(task='text-generation', dataset=Dataset(type='text-generation', name='openai_humaneval', config=None, split=None), metrics=[Metric(name='perplexity', value=5.673565062080941)], samples=None)] raw={'mean_perplexity': 5.673565062080941}


The above code, however, does not leverage the sparsity within the model for efficient inference.
To do so, we need to export the model to ONNX to be ready for efficient inference on CPUs with DeepSparse.
SparseML provides a simple export command to do so:

In [None]:
from sparseml import export

export(
    "./one-shot-example/stage_compression",
    task="text-generation",
    sequence_length=1024,
    target_path="./exported"
)

2024-04-09 21:46:02 sparseml.export.export INFO     Starting export for transformers model...
INFO:sparseml.export.export:Starting export for transformers model...
2024-04-09 21:46:02 sparseml.export.export INFO     Creating model for the export...
INFO:sparseml.export.export:Creating model for the export...
2024-04-09 21:46:15 sparseml.transformers.utils.helpers INFO     Found recipe in the model_path: /content/one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.transformers.utils.helpers:Found recipe in the model_path: /content/one-shot-example/stage_compression/recipe.yaml
2024-04-09 21:46:15 sparseml.core.recipe.recipe INFO     Loading recipe from file /content/one-shot-example/stage_compression/recipe.yaml
INFO:sparseml.core.recipe.recipe:Loading recipe from file /content/one-shot-example/stage_compression/recipe.yaml
manager stage: Model structure initialized
2024-04-09 21:46:15 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at /cont

The exported model located at `./exported` can now be used for efficient inference with DeepSparse!

In [None]:
!huggingface-cli login
!huggingface-cli upload mgoin/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds exported/deployment/

model-orig.onnx:   0% 0.00/1.10G [00:00<?, ?B/s]
model.data:   0% 0.00/4.40G [00:00<?, ?B/s][A

model.onnx:   0% 0.00/1.10G [00:00<?, ?B/s][A[A


Upload 4 LFS files:   0% 0/4 [00:00<?, ?it/s][A[A[A



tokenizer.model:   0% 0.00/500k [00:00<?, ?B/s][A[A[A[A
model.data:   0% 16.4k/4.40G [00:00<9:53:19, 124kB/s][A

model-orig.onnx:   0% 16.4k/1.10G [00:00<3:22:26, 90.9kB/s]



tokenizer.model:   3% 16.4k/500k [00:00<00:05, 92.3kB/s][A[A[A[A

model.onnx:   1% 6.39M/1.10G [00:00<00:44, 24.5MB/s] [A[A
model-orig.onnx:   0% 4.33M/1.10G [00:00<01:08, 16.1MB/s]  

model-orig.onnx:   1% 7.37M/1.10G [00:00<01:17, 14.2MB/s]
model.data:   0% 8.22M/4.40G [00:00<05:16, 13.9MB/s][A

model-orig.onnx:   1% 10.3M/1.10G [00:00<01:06, 16.4MB/s]
tokenizer.model: 100% 500k/500k [00:00<00:00, 696kB/s]  
model-orig.onnx:   1% 16.0M/1.10G [00:01<01:13, 14.8MB/s]

model.onnx:   1% 16.0M/1.10G [00:01<01:23, 13.0MB/s][A[A
model-orig.onnx:   2% 22.2M/1.10G [00:01<00:46, 23.5MB/s]
model.dat


## Deploy Sparse LLMs with DeepSparse

[DeepSparse](https://github.com/neuralmagic/deepsparse) is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference.

LLM inference in DeepSparse is performant with:
* sparse kernels for speedups and memory savings from unstructured sparse weights.
* 8-bit weight and activation quantization support.
* efficient usage of cached attention keys and values for minimal memory movement.

In this section we will explore running the sparse quantized TinyLlama we just made to perform a summarization task.

First, we need to install DeepSparse with LLM dependencies:

In [None]:
!pip install deepsparse[transformers]

Collecting deepsparse[transformers]
  Downloading deepsparse-1.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.1/47.1 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.37 (from deepsparse[transformers])
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m103.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate<0.26 (from deepsparse[transformers])
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub (from accelerate<0.26->deepsparse[transformers])
  Downloading huggingface_hub-0.22.2-py3-none-any.whl (388 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
I

Next we want to point to our compressed model:

In [None]:
model_path = "exported/deployment/"

The task we want to use the LLM for is summarizing some text describing the problem of climate change. Below you can see what the prompt is with the instruction followed by the content to summarize:

In [None]:
text_to_summarize = "Climate change is a global problem that is affecting the planet in numerous ways. Rising temperatures are causing glaciers to melt, sea levels to rise, and weather patterns to become more extreme. These changes are having a significant impact on ecosystems, agriculture, and human health. In order to mitigate the effects of climate change, it is essential to reduce greenhouse gas emissions by transitioning to renewable energy sources, implementing energy-efficient technologies, and encouraging sustainable practices in various sectors such as transportation and agriculture. Additionally, adapting to the inevitable consequences of climate change is crucial, which involves developing resilient infrastructure, improving disaster preparedness, and supporting vulnerable communities. Addressing climate change requires a coordinated global effort from governments, businesses, and individuals to ensure a sustainable future for the planet and its inhabitants."

prompt = f"""
Please summarize the following text, focusing on the key points and main ideas. Keep the summary concise, around 3-5 sentences.

Text:
{text_to_summarize}
"""

print(prompt)


Please summarize the following text, focusing on the key points and main ideas. Keep the summary concise, around 3-5 sentences.

Text:
Climate change is a global problem that is affecting the planet in numerous ways. Rising temperatures are causing glaciers to melt, sea levels to rise, and weather patterns to become more extreme. These changes are having a significant impact on ecosystems, agriculture, and human health. In order to mitigate the effects of climate change, it is essential to reduce greenhouse gas emissions by transitioning to renewable energy sources, implementing energy-efficient technologies, and encouraging sustainable practices in various sectors such as transportation and agriculture. Additionally, adapting to the inevitable consequences of climate change is crucial, which involves developing resilient infrastructure, improving disaster preparedness, and supporting vulnerable communities. Addressing climate change requires a coordinated global effort from governmen

Now we will format the prompt to work with the chat template that the model was originally fine-tuned with. You can see in the output from this block what the final input to the model will be before tokenization.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "user", "content": prompt}
]
formatted_prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
print(formatted_prompt)

<|user|>

Please summarize the following text, focusing on the key points and main ideas. Keep the summary concise, around 3-5 sentences.

Text:
Climate change is a global problem that is affecting the planet in numerous ways. Rising temperatures are causing glaciers to melt, sea levels to rise, and weather patterns to become more extreme. These changes are having a significant impact on ecosystems, agriculture, and human health. In order to mitigate the effects of climate change, it is essential to reduce greenhouse gas emissions by transitioning to renewable energy sources, implementing energy-efficient technologies, and encouraging sustainable practices in various sectors such as transportation and agriculture. Additionally, adapting to the inevitable consequences of climate change is crucial, which involves developing resilient infrastructure, improving disaster preparedness, and supporting vulnerable communities. Addressing climate change requires a coordinated global effort from 

### Pipeline

Now let's plug the model and text into DeepSparse. DeepSparse Pipelines are designed to mirror the Hugging Face Transformers API closely, ensuring a familiar experience if you've worked with Transformers before.
The following code demonstrates how to create a pipeline for text generation using the sparsified LLM you just made:

In [None]:
from deepsparse import TextGeneration

pipeline = TextGeneration(model_path)
result = pipeline(formatted_prompt)

print(result.generations[0].text)

Climate change is a global problem that affects the planet in numerous ways. Rising temperatures are causing glaciers to melt, sea levels are rising, and weather patterns are becoming more extreme. These changes are having a significant impact on ecosystems, agriculture, and human health. Climat change is essential to mitigate the effects of climate change, which is crucial to mitigate the effects of climate change, which is crucial to mitigate the


The resulting output printed to the console will be the generated text from the model based on the input prompt.

### Server

To make your LLM accessible as a web service, you'll wrap it in a DeepSparse Server.
The Server lets you interact with the model using HTTP requests, making integrating with web applications, microservices, or other systems easy.
DeepSparse Server has an [OpenAI-compatible integration](https://platform.openai.com/docs/api-reference/completions) for request and response formats for seamless integration.

First we need to install the server dependencies with DeepSparse:


In [None]:
!pip install "deepsparse[server]" -qqqqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.6/63.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

The following command starts a DeepSparse Server with the sparsified LLM:

In [None]:
!deepsparse.server --integration openai "hf:mgoin/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds"

With the server running, you can send an HTTP request that conforms to the OpenAI spec to generate text. You can go to http://localhost:5543/docs to learn more about the available endpoints.

Below are examples of using `curl` and `python` to send a request to the server:

In [None]:
import requests
import json

url = "http://localhost:5543/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "hf:mgoin/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds",
    "messages": "Large language models are",
    "stream": True
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    for chunk in response.iter_content(chunk_size=128):
        print(chunk.decode('utf-8'))  # Decode and print each data chunk
else:
    print("Request failed with status code:", response.status_code)

In [None]:
!curl http://localhost:5543/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "hf:mgoin/TinyLlama-1.1B-Chat-v1.0-pruned50-quant-ds", "prompt": "Say this is a test", "stream": true}'

The resulting output will be the generated text from the model based on the input prompt.
