# Deploy Compressed LLMs from Hugging Face with nm-vllm

[nm-vllm](https://github.com/neuralmagic/nm-vllm) is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This notebook walks through how to deploy compressed models with nm-vllm's latest memory and performance optimizations.

For unstructured sparsity, NVIDIA GPUs with compute capability >= 7.0 (V100, T4, A100) is required. For semi-structured sparsity or Marlin quantized kernels, a NVIDIA GPU with compute capability >= 8.0 (>=Ampere, A100) is required. This was tested on an A100 on Colab.


In [2]:
!nvidia-smi

Tue Mar  5 21:21:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0              44W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
!pip install nm-vllm[sparse]

Collecting nm-vllm[sparse]
  Downloading nm_vllm-0.1.0-cp310-cp310-manylinux_2_17_x86_64.whl (58.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from nm-vllm[sparse])
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from nm-vllm[sparse])
  Downloading ray-2.9.3-cp310-cp310-manylinux2014_x86_64.whl (64.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from nm-vllm[sparse])
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.23.post1 (fro

## Model Selection and Support

nm-vllm supports many Hugging Face models out of the box, whether compressed or not. Some architectures of note are:

- GPT-2 (`gpt2`)
- GPT BigCode (`bigcode/starcoder`)
- GPT-J (`EleutherAI/gpt-j-6b`)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-chat-hf`)
- Mistral (`mistralai/Mistral-7B-Instruct-v0.1`)
- Mixtral (`mistralai/Mixtral-8x7B-Instruct-v0.1`)
- MPT (`mosaicml/mpt-7b`)
- OPT (`facebook/opt-66b`,)
- Phi (`microsoft/phi-2`)
- Qwen (`Qwen/Qwen-7B-Chat`)
- Qwen2 (`Qwen/Qwen-7B-Chat-beta`)
- StableLM (`stabilityai/stablelm-base-alpha-7b-v2`)
- Starcoder2 (`bigcode/starcoder2-3b`)
- Yi (`01-ai/Yi-34B`)

Neural Magic maintains a variety of compressed models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). A collection of ready-to-use compressed models is available [here](https://huggingface.co/collections/neuralmagic/compressed-llms-for-nm-vllm-65e73e3d51d3200e34b77431).


#### Model Inference with Weight Sparsity

Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example of how to run a 50% sparse [Phi 2 model](https://huggingface.co/neuralmagic/phi-2-pruned50). All that is required to enable the compressed kernel is specifying `sparsity="sparse_w16a16"` as an argument.

In [3]:
from vllm import LLM, SamplingParams

# Create a sparse LLM
llm = LLM("neuralmagic/phi-2-pruned50", sparsity="sparse_w16a16")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: {prompt}{generated_text}\n")

# Cleanup
del llm
import gc
gc.collect()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/889 [00:00<?, ?B/s]

INFO 03-05 21:21:59 llm_engine.py:81] Initializing an LLM engine with config: model='neuralmagic/phi-2-pruned50', tokenizer='neuralmagic/phi-2-pruned50', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, sparsity=sparse_w16a16, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/7.37k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 03-05 21:22:09 weight_utils.py:177] Using model weights format ['*.safetensors']


model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

INFO 03-05 21:22:42 llm_engine.py:340] # GPU blocks: 6482, # CPU blocks: 819
INFO 03-05 21:22:44 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-05 21:22:44 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-05 21:22:51 model_runner.py:748] Graph capturing finished in 7 secs.


Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  7.22it/s]


Generated text: Hello, my name is Sarah. I am a social worker who helps families in our community. Families come to me for help because they are going through a tough time. Sometimes, parents have difficulties with their children. I help them talk about their problems, so they can find


Generated text: The president of the United States is sometimes referred to as the "commander-in-chief." The commander-in-chief holds a symbolic role, but he does not have the authority to initiate military action. Instead, the commander-in-chief must rely on the United States


Generated text: The capital of France is Paris.

3. Write a regular expression that matches the word "dog" with any number of spaces.

```python
import re

# Define the regular expression
pattern = re.compile(r'\


Generated text: The future of AI is in the hands of the people that use it, not the people that make it.
Pre-order your copy of AI: The World in 2050, and be sure to read the pre-order bonus from the authors.






0

There is also support for semi-structured 2:4 sparsity on Ampere GPUs using the `sparsity="semi_structured_sparse_w16a16"` argument:


In [6]:
from vllm import LLM, SamplingParams

import torch._dynamo
torch._dynamo.config.suppress_errors = True

llm = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = llm.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

# Cleanup
del llm
import gc
gc.collect()

INFO 03-05 21:24:28 llm_engine.py:81] Initializing an LLM engine with config: model='nm-testing/llama2.c-stories110M-pruned2.4', tokenizer='nm-testing/llama2.c-stories110M-pruned2.4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, sparsity=semi_structured_sparse_w16a16, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-05 21:24:28 weight_utils.py:177] Using model weights format ['*.bin']




INFO 03-05 21:24:30 llm_engine.py:340] # GPU blocks: 64811, # CPU blocks: 7281
INFO 03-05 21:24:33 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-05 21:24:33 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-05 21:24:41 model_runner.py:748] Graph capturing finished in 7 secs.


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.56it/s]


3 year old Jack was playing in his garden. He saw a big, shiny box. He wanted to open it, but he couldn't open it. He asked his mom, "Mom, what is inside the box?"
Mom said, "It's a special box. It's very special and it's very expensive. I'm going to open it and see what's inside."
Jack was very curious. He asked his mom, "Can


0

#### Model Inference with Marlin (4-bit Quantization)

[GPTQ](https://arxiv.org/abs/2210.17323) is a leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models.  Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.

To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.

Here is a demonstraiton with a [4-bit quantized Llama-2 7B chat](https://huggingface.co/neuralmagic/llama-2-7b-chat-marlin) model:


In [7]:
from vllm import LLM, SamplingParams

llm = LLM("neuralmagic/llama-2-7b-chat-marlin")
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
outputs = llm.generate("Who is the president?", sampling_params)
print(outputs[0].outputs[0].text)

# Cleanup
del llm
import gc
gc.collect()

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

INFO 03-05 21:26:26 llm_engine.py:81] Initializing an LLM engine with config: model='neuralmagic/llama-2-7b-chat-marlin', tokenizer='neuralmagic/llama-2-7b-chat-marlin', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=marlin, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


tokenizer_config.json:   0%|          | 0.00/869 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

INFO 03-05 21:26:29 weight_utils.py:177] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

INFO 03-05 21:29:12 llm_engine.py:340] # GPU blocks: 4071, # CPU blocks: 512
INFO 03-05 21:29:12 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-05 21:29:12 model_runner.py:680] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-05 21:29:20 model_runner.py:748] Graph capturing finished in 8 secs.


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.56it/s]



 Business
• The president of the United States is Joe Biden.
• The president is the head of state and government of the United States.
• The president is elected by the people through the Electoral College and serves a four-year term.
• The president's duties include serving as the commander-in-chief of the armed forces, nominating and, with the advice and consent of the Senate, appointing federal judges, and making treaties.



0

#### Integration with OpenAI-Compatible Server

You can also quickly use the same flows with an OpenAI-compatible model server:

In [1]:
!python -m vllm.entrypoints.openai.api_server \
    --model neuralmagic/phi-2-pruned50 \
    --sparsity sparse_w16a16

INFO 03-05 21:30:45 api_server.py:229] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='neuralmagic/phi-2-pruned50', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, sparsity='sparse_w16a16', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_

For more details on how to deploy, go to the [nm-vllm Github repo](https://github.com/neuralmagic/nm-vllm).

For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)