# Performantly Quantize LLMs to 4-bits with Marlin and nm-vllm

This notebook walks through how to compress a pretrained LLM and deploy it with `nm-vllm`. To create a new 4-bit quantized model, we can leverage AutoGPTQ. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.

Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. [nm-vllm](https://github.com/neuralmagic/nm-vllm) includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.

This notebook requires an NVIDIA GPU with compute capability >= 8.0 (>=Ampere) because of Marlin kernel restrictions. This will not run on T4 or V100 currently. This was tested on an A100 on Colab.


In [None]:
!nvidia-smi

Tue Mar  5 20:22:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              49W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Install AutoGPTQ

AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on the GPTQ algorithm (weight-only quantization).



In [None]:
!pip install auto-gptq==0.7.1 torch==2.2.1

Collecting auto-gptq==0.7.1
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==2.2.1
  Downloading torch-2.2.1-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate>=0.26.0 (from auto-gptq==0.7.1)
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-gptq==0.7.1)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq==0.7.1)
  Downloading roug

## Quantizing an LLM

After installing AutoGPTQ, you are ready to quantize a model.

Below is an example of how to quantize [TinyLlama 1B Chat v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) using GPTQ.

We will be using 512 samples from the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset so we load and tokenize those samples first.

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

MAX_SEQ_LEN = 512
NUM_EXAMPLES = 512

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
DATASET = "HuggingFaceH4/ultrachat_200k"

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}

dataset = load_dataset(DATASET, split="train_sft")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
ds = dataset.shuffle().select(range(NUM_EXAMPLES))
ds = ds.map(preprocess)

examples = [
    tokenizer(
        example["text"], padding=False, max_length=MAX_SEQ_LEN, truncation=True,
    ) for example in ds
]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.44k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/244M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.4M [00:00<?, ?B/s]

Generating train_sft split:   0%|          | 0/207865 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/23110 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/256032 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/28304 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Map:   0%|          | 0/512 [00:00<?, ? examples/s]

## Apply GPTQ

Next, we apply GPTQ with the samples we've processed from the dataset.

In [None]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,                         # Only support 4 bit
    group_size=128,                 # Set to g=128 or -1 (for channelwise)
    desc_act=False,                 # Marlin does not suport act_order=True
    model_file_base_name="model",   # Name of the model.safetensors when we call save_pretrained
)

model = AutoGPTQForCausalLM.from_pretrained(
    MODEL_ID,
    quantize_config,
    device_map="auto")
model.quantize(examples)

gptq_save_dir = f"{MODEL_ID.split('/')[-1]}-gptq"
print(f"Saving gptq model to {gptq_save_dir}")
model.save_pretrained(gptq_save_dir)
tokenizer.save_pretrained(gptq_save_dir)

import gc
del model
gc.collect()

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO - Start quantizing layer 1/22
INFO:auto_gptq.modeling._base:Start quantizing layer 1/22
INFO - Quantizing self_attn.k_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing self_attn.k_proj in layer 1/22...
INFO - Quantizing self_attn.v_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing self_attn.v_proj in layer 1/22...
INFO - Quantizing self_attn.q_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing self_attn.q_proj in layer 1/22...
INFO - Quantizing self_attn.o_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing self_attn.o_proj in layer 1/22...
INFO - Quantizing mlp.up_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing mlp.up_proj in layer 1/22...
INFO - Quantizing mlp.gate_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing mlp.gate_proj in layer 1/22...
INFO - Quantizing mlp.down_proj in layer 1/22...
INFO:auto_gptq.modeling._base:Quantizing mlp.down_proj in layer 1/22...
INFO - Start quantizing layer 2/22
INFO:

Saving gptq model to TinyLlama-1.1B-Chat-v1.0-gptq


2719

## Convert the GPTQ model to Marlin format

Next we want to convert the GPTQ formatted model into the optimized Marlin format. This is as simple as re-using `AutoGPTQForCausalLM.from_quantized` with the `use_marlin=True` argument, then saving back to disk.

In [None]:
print("Reloading in marlin format")
marlin_model = AutoGPTQForCausalLM.from_quantized(
    gptq_save_dir,
    use_marlin=True,
    device_map="auto")

marlin_save_dir = f"{MODEL_ID.split('/')[-1]}-marlin"
print(f"Saving model in marlin format to {marlin_save_dir}")
marlin_model.save_pretrained(marlin_save_dir)
tokenizer.save_pretrained(marlin_save_dir)

INFO - The layer lm_head is not quantized.
INFO:auto_gptq.modeling._base:The layer lm_head is not quantized.


Reloading in marlin format


Repacking weights to be compatible with Marlin kernel...: 100%|██████████| 314/314 [00:26<00:00, 11.97it/s]


Saving model in marlin format to TinyLlama-1.1B-Chat-v1.0-marlin


('TinyLlama-1.1B-Chat-v1.0-marlin/tokenizer_config.json',
 'TinyLlama-1.1B-Chat-v1.0-marlin/special_tokens_map.json',
 'TinyLlama-1.1B-Chat-v1.0-marlin/tokenizer.model',
 'TinyLlama-1.1B-Chat-v1.0-marlin/added_tokens.json',
 'TinyLlama-1.1B-Chat-v1.0-marlin/tokenizer.json')

**Optional**: You can upload your now optimally compressed 4-bit directly to Hugging Face, which makes it easy to keep up with all your compressed model variations and then deploy with `nm-vllm` on new systems.

In [None]:
# Upload the output model to Hugging Face Hub
from huggingface_hub import HfApi

final_model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-marlin"

HfApi().upload_folder(
    folder_path=marlin_save_dir,
    repo_id=final_model_name,
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/763M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v1.0-marlin/commit/5059cf5d5e58b61d8941baa9e52638eaea2e0335', commit_message='Upload folder using huggingface_hub', commit_description='', oid='5059cf5d5e58b61d8941baa9e52638eaea2e0335', pr_url=None, pr_revision=None, pr_num=None)

## Optimized deployment with nm-vllm

The [nm-vllm](https://github.com/neuralmagic/nm-vllm) package is a high-throughput and memory-efficient inference and serving engine for LLMs. It holds the latest LLM optimizations, such as the highly performant 4-bit Marlin CUDA kernels.

To run a Marlin-optimized model with `nm-vllm`, simply pass the model in and the engine will automatically detect the quantization from the `config.json`.

In [None]:
!pip install nm-vllm

Collecting nm-vllm
  Downloading nm_vllm-0.1.0-cp310-cp310-manylinux_2_17_x86_64.whl (58.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from nm-vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
Collecting ray>=2.9 (from nm-vllm)
  Downloading ray-2.9.3-cp310-cp310-manylinux2014_x86_64.whl (64.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from nm-vllm)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting xformers==0.0.23.post1 (from nm-vllm)
  Downloading xformer

Now we can simply pass in the quantized Marlin model we just made to use directly within the engine.

In [None]:
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

# Create an LLM.
llm = LLM("nm-testing/TinyLlama-1.1B-Chat-v1.0-marlin")

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: {prompt}{generated_text}\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 03-05 19:29:18 llm_engine.py:81] Initializing an LLM engine with config: model='nm-testing/TinyLlama-1.1B-Chat-v1.0-marlin', tokenizer='nm-testing/TinyLlama-1.1B-Chat-v1.0-marlin', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=marlin, sparsity=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-05 19:29:22 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 03-05 19:29:23 llm_engine.py:340] # GPU blocks: 102209, # CPU blocks: 11915
INFO 03-05 19:29:26 model_runner.py:676] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-05 19:29:26 model_runner.py:680] CUDA graphs can take additional 1~3 GiB 

Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  6.51it/s]


Generated text: Hello, my name is John Smith. And I’m a senior at your high school. I'm interested in learning more about your school and how it compares to other schools in our area. I've heard good things about your school, so I would appreciate it if you could give me some information on the facilities, the teachers, and the overall student body. Please provide me with specific examples of the academic programs and extracurricular activities that you offer, as well as any notable sports teams or clubs


Generated text: The president of the United States is a member of the presidential Cabinet.

2. You may also be interested in learning about the role of the vice president, as well as their responsibilities and duties.

3. To learn more about the office of the president, you may want to explore our collection of articles, podcasts, and videos.

4. If you're looking for a guide to the US presidential elections, I would recommend checking out The Guardian's guide, which


Generated te




For more details on how to deploy, go to the [nm-vllm Github repo](https://github.com/neuralmagic/nm-vllm).

For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)