# Apply SparseGPT to LLMs and deploy with nm-vllm

This notebook walks through how to sparsify a pretrained LLM. To create a pruned model, you can leverage SparseGPT. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.

This notebook requires an NVIDIA GPU with compute capability >= 8.0 (>=Ampere) because of Marlin kernel restrictions. This will not run on T4 or V100.


In [None]:
!pip install sparseml-nightly==1.7.0.20240304

Collecting sparseml-nightly[llm]@ git+https://github.com/neuralmagic/sparseml.git
  Cloning https://github.com/neuralmagic/sparseml.git to /tmp/pip-install-m3nfst67/sparseml-nightly_45a7b41613334f77a5c0ce7fabc521dd
  Running command git clone --filter=blob:none --quiet https://github.com/neuralmagic/sparseml.git /tmp/pip-install-m3nfst67/sparseml-nightly_45a7b41613334f77a5c0ce7fabc521dd
  Resolved https://github.com/neuralmagic/sparseml.git to commit 0a4bf51e9adc3e17883306499a9efc07658239cf
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sparsezoo-nightly~=1.7.0 (from sparseml-nightly[llm]@ git+https://github.com/neuralmagic/sparseml.git)
  Downloading sparsezoo_nightly-1.7.0.20240131-py3-none-any.whl (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.5/172.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setupto

# Apply SparseGPT

After installing SparseML, you are ready to prune weights from a model.

Below is an example of how to prune [llama2.c-stories110M](https://huggingface.co/Xenova/llama2.c-stories110M) using SparseGPT. This is a model that was finetuned on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset for generating simple short stories.

We will be using 512 samples from the [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) dataset to calibrate the post-training compression.

In [None]:
import sparseml.transformers

original_model_name = "Xenova/llama2.c-stories110M"
calibration_dataset = "open_platypus"
output_directory = "output/"

recipe = """
test_stage:
  obcq_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      sequential_update: true
      targets: ['re:model.layers.\d*$']
"""

# Apply SparseGPT to the model
sparseml.transformers.oneshot(
    model=original_model_name,
    dataset=calibration_dataset,
    recipe=recipe,
    output_dir=output_directory,
)



Downloading config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

2024-03-04 23:49:38 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
INFO:sparseml.transformers.utils.helpers:model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-04 23:49:38 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: Xenova/llama2.c-stories110M. Downloading...
INFO:sparseml.transformers.utils.helpers:Found recipe: recipe.yaml for model id: Xenova/llama2.c-stories110M. Downloading...
2024-03-04 23:49:38 sparseml.transformers.utils.helpers INFO     Unable to to find recipe recipe.yaml for model id: Xenova/llama2.c-stories110M: 404 Client Error. (Request ID: Root=1-65e65e12-7daae3201ed25ff46fa6eb03;fdd9c794-ecda-41e3-9849-b2927998cd60)

Entry Not Found for url: https://huggingface.co/Xenova/llama2.c-stories110M/resolve/main/recipe.yaml.. Skipping recipe resolution.
INFO:sparseml.transformers.utils.helper

Downloading tokenizer_config.json:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Logging all SparseML modifier-level logs to sparse_logs/04-03-2024_23.49.40.log
2024-03-04 23:49:40 sparseml.core.logger.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/04-03-2024_23.49.40.log
INFO:sparseml.core.logger.logger:Logging all SparseML modifier-level logs to sparse_logs/04-03-2024_23.49.40.log


Downloading readme:   0%|          | 0.00/5.34k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/15.6M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/24926 [00:00<?, ? examples/s]

Restructuring Platypus Dataset:   0%|          | 0/24926 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/24926 [00:00<?, ? examples/s]

Adding labels:   0%|          | 0/24926 [00:00<?, ? examples/s]

2024-03-04 23:50:55 sparseml.transformers.finetune.runner INFO     *** One Shot ***
INFO:sparseml.transformers.finetune.runner:*** One Shot ***


{'train': ['input_ids', 'attention_mask', 'labels']}


test_stage:
  obcq_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      sequential_update: true
      targets: ['re:model.layers.\d*$']

test_stage:
  obcq_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      sequential_update: true
      targets: ['re:model.layers.\d*$']

2024-03-04 23:50:57 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.0 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.0 for compression
2024-03-04 23:50:57 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.1 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.1 for compression
2024-03-04 23:50:57 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.2 for compression
INFO:sparseml.modifiers.pruning.wanda.pytorch:Preparing model.layers.2 for compression
2024-03-04 23:50:57 sparseml.modifiers.pruning.wanda.pytorch INFO     Preparing model.layers.3 for compression
INFO:sp

In [None]:
!ls -alh output

total 421M
drwxr-xr-x 2 root root 4.0K Mar  4 23:55 .
drwxr-xr-x 1 root root 4.0K Mar  4 23:50 ..
-rw-r--r-- 1 root root  664 Mar  4 23:55 config.json
-rw-r--r-- 1 root root  119 Mar  4 23:55 generation_config.json
-rw-r--r-- 1 root root 418M Mar  4 23:55 pytorch_model.bin
-rw-r--r-- 1 root root  143 Mar  4 23:55 recipe.yaml
-rw-r--r-- 1 root root  434 Mar  4 23:55 special_tokens_map.json
-rw-r--r-- 1 root root  827 Mar  4 23:55 tokenizer_config.json
-rw-r--r-- 1 root root 1.8M Mar  4 23:55 tokenizer.json
-rw-r--r-- 1 root root 489K Mar  4 23:55 tokenizer.model


**Optional**: You can upload your compressed model directly to Hugging Face, which makes it easy to keep up with all your compressed model variations and then deploy with nm-vllm on new systems.

In [None]:
# Upload the output model to Hugging Face Hub
from huggingface_hub import HfApi

final_model_name = "nm-testing/llama2.c-stories110M-pruned50"

HfApi().upload_folder(
    folder_path=output_directory,
    repo_id=final_model_name,
)

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

'https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50/tree/main/'

# Deploying Sparse LLMs with nm-vllm

The [nm-vllm](https://github.com/neuralmagic/nm-vllm) package is a high-throughput and memory-efficient inference and serving engine for LLMs. nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

First we need to install the package:

In [None]:
!pip install nm-vllm[sparse]

Then there is a little cleanup to do with transformers version since we used SparseML.

In [None]:
!pip uninstall -y nm-transformers-nightly transformers -qqq
!pip install transformers "tokenizers<0.15" -qqq

Finally we can run the model we just pruned with nm-vllm. All that is required to enable the compressed kernel is specifying `sparsity="sparse_w16a16"` as an argument.

In [None]:
from vllm import LLM, SamplingParams

# Create a sparse LLM
llm = LLM(
    "nm-testing/llama2.c-stories110M-pruned50",
    sparsity="sparse_w16a16",
)

prompts = [
    "Once upon a time, there was a little car named Beep.",
    "One day, a little fish named Fin",
    "Once upon a time, in a big lake,",
]

sampling_params = SamplingParams(temperature=0.0, max_tokens=200)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: {prompt}{generated_text}\n")

# Cleanup
del llm
import gc
gc.collect()

INFO 03-05 00:14:22 llm_engine.py:81] Initializing an LLM engine with config: model='nm-testing/llama2.c-stories110M-pruned50', tokenizer='nm-testing/llama2.c-stories110M-pruned50', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, sparsity=sparse_w16a16, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-05 00:14:22 weight_utils.py:177] Using model weights format ['*.bin']
INFO 03-05 00:14:25 llm_engine.py:340] # GPU blocks: 23363, # CPU blocks: 7281
INFO 03-05 00:14:25 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-05 00:14:25 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory 

Processed prompts: 100%|██████████| 3/3 [00:00<00:00,  3.16it/s]


Generated text: Once upon a time, there was a little car named Beep. Beep loved to drive around and play with his friends. One day, Beep was very hungry and needed fuel to go. He asked his friend, a big truck named Truck, to help him find some fuel.
Truck and Beep went to the fuel place. They looked and looked, but they could not find any fuel. Beep was sad and still hungry. Truck said, "Don't worry, Beep. We will find fuel soon."
Just then, they saw a big truck named Toot. Toot was carrying fuel for Beep. Toot said, "I found fuel for Beep. Let's all eat together." Beep and Truck were happy and ate the fuel. They were not hungry anymore. They all played together and had a fun day. Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day


Generated text: One day, a little fish named Fin was swimming in the sea. He saw a big shark named Shark. Fin was scared of Shark because he was big and had sharp teeth. But Shark was not mean, he just 




0

For more details on how to deploy, go to the [nm-vllm Github](https://github.com/neuralmagic/nm-vllm). For more details on compression, go to the [SparseML Github](https://github.com/neuralmagic/sparseml).

For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)