# Quantization

**Quantization** techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits.

# bitsandbytes

**bitsandbytes** is the easiest option for quantizing a model to 8- and 4ibit.

**8-bit quantization** multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16.

In [None]:
!pip install -qU transformers accelerate bitsandbytes

Quantizing a model in 8-bit havles the memory-usage, and for large models, set `device_map='auto'` to efficiently use the GPUs available:

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    'bigscience/bloom-1b7',
    quantization_config=quantization_config,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. We can change the data type of these modules with the `torch_dtype` parameter if we want. Setting `torch_dtype='auto'` loads the model in the data type defined in a model's `config.json` file.

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    'facebook/opt-350m',
    quantization_config=quantization_config,
    device_map='auto',
    torch_dtype='auto'
)

In [None]:
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype

We can push the quantized model to the Hub:

In [None]:
model_8bit.push_to_hub('bloom-350m-8bit')

We can check our memory footprint:

In [None]:
model_8bit.get_memory_footprint()

Quantized models can be loaded from the `from_pretrained()` method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('bloom-350m-8bit')
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

Quantizing a model in 4-bit follows the same fashion and reduces our memory usage by 4x.

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
    'bigscience/bloom-1b7',
    quantization_config=quantization_config,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
    'facebook/opt-350m',
    quantization_config=quantization_config,
    device_map='auto',
    torch_dtype='auto'
)

In [None]:
model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype

In [None]:
model_4bit.get_memory_footprint()

## 8-bit - LLM.int8() algorithm

### Offloading

8-bit models can offload weihts between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and are not converted to 8-bit.

For example, to enable offloading for the `bigscience/bloom-1b7` model, we start by creating a `BitsAndBytesConfig`:

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

Next, we need to design a custom device map to fit everything on our GPU except for the `lm_head`, which we will dispatch to the CPU:

In [None]:
device_map = {
    'transformer.word_embeddings': 0,
    'transformer.word_embeddings_layernorm': 0,
    'lm_head': 'cpu',
    'transformer.h': 0,
    'transformer.ln_f': 0,
}

Now we load our model with the custom `device_map` and `quantization_config`:

In [None]:
model_8bit = AutoModelForCausalLM.from_pretrained(
    'bigscience/bloom-1b7',
    torch_dtype='auto',
    device_map=device_map,
    quantization_config=quantization_config
)

## Outlier threshold

An **"outlier"** is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning)

We can experiment with `llm_int8_threshold` to find the best threshold for our model:

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id = 'bigscience/bloom-1b7'

quantization_config = BitsAndBytesConfig(
    llm_int8_threshold=10.0,
    llm_int8_enable_fp32_cpu_offload=True
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map=device_map,
    quantization_config=quantization_config
)

### Skip module conversion

For some models, like `Jukebox`, we do not need to quantize every module to 8-bit which can cause instability. With `Jukebox`, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules`:

In [None]:
from transformers import AutoModelForCausalLM, AutoTOkenizer, BitsAndBytesConfig

model_id = 'bigscience/bloom-1b7'

quantization_config = BitsAndBytesConfig(
    llm_int8_skip_modules=['lm_head']
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map=device_map,
    quantization_config=quantization_config
)

## 4-bit (QLoRA algorithm)

### Compute data type

To speed up computation, we can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype`:

In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Normal Float 4 (NF4)

NF4 is a 4-bit data type from **QLoRA** paper, adapted for weights initialized from a normal distribution. We should use NF4 for training 4-bit base models.

In [None]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

model_id = 'bigscience/bloom-1b7'

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4'
)

model_nf4 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    quantization_config=nf4_config
)

For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, we should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.

### Nested quantization

Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.

With nested quantization, we can finetune a `llama-13b` model on a 16GB T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.

In [None]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

model_id = 'meta-llama/Llama-2-13b-chat-hf'

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True
)

model_double_quant = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    quantization_config=double_quant_config
)

## Dequantizing bitsandbytes models

Once quantized, we can dequantize the model to the original precision but this may result in a small quality loss of the model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = 'facebook/opt-125m'

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config
).to('cuda:0')

model.dequantize()

In [None]:
text = tokenizer("Hello, my name is", return_tensors='pt').to('cuda:0')

out = model.generate(**text)
tokenizer.decode(out[0])

# GPTQ

Both **GPTQModel** and **AutoGPTQ** libraries implement the GPTQ algorithm, *a post-training quantization technique where each row of the weight matrix is quantized indepedently to find a version of the weights that minimizes error*. These weights are quantized to in4, stored as int32 (int4 x 8) and dequantized (restored) to fp16 on the fly during inference.

This can save memory by almost 4x because the int4 weights are often dequantized in a fused kernel.



In [None]:
!pip install -qU accelerate optimum transformers gptqmodel auto-gptq

To quantize a model, we need to create a `GPTQConfig` class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = 'facebook/opt-125m'

tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(
    bits=4,
    dataset='c4',
    tokenizer=tokenizer
)

We could also pass our own dataset as a list of strings, but it is highly recomended to use the same dataset from the GPTQ paper.

In [None]:
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
gptq_config = GPTQConfig(
    bits=4,
    dataset=dataset,
    tokenizer=tokenizer
)

We can load a model to quantize and pass the `gptq_config` to the `from_pretrained()` method.

In [None]:
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=gptq_config
)

If we run out of memory because a dataset is too large, disk offloading is not supported. We can try passing the `max_memory` to allocate the amount of memory to use on our device:

In [None]:
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=gptq_config,
    max_memory={
        0: "12GiB",
        1: "16GiB",
        'cpu': "32GiB",
        # assume we have 2 gpus and a cpu
    }
)

Once our model is quantized, we can push the model and tokenizer to the Hub

In [None]:
quantized_model.push_to_hub('opt-125m-gptq')
tokenizer.push_to_hub('opt-125m-gptq')

We can also save our quantized model locally:

In [None]:
quantized_model.save_pretrained('opt-125m-gptq')
tokenizer.save_pretrained('opt-125m-gptq')

# if quantized with device_map set
quantized_model.to('cpu')
quantized_model.save_pretrained('opt-125m-gptq')
tokenizer.save_pretrained('opt-125m-gptq')

We can reload a quantized model and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed:

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'opt-125m-gptq',
    device_map='auto'
)

## Marlin

**Marlin** is a 4-bit only CUDA GPTQ kernel, highly optimized for the NVIDIA A100 GPU (Ampere) architecture.

Marlin is only available for quantized inference and does not support model quantization.

In [None]:
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    backend='marlin'
)

model = AutoModelForCausalLM.from_pretrained(
    'opt-125m-gptq',
    device_map='auto',
    quantization_config=gptq_config
)

## ExLlama

**ExLlama** is a CUDA implmentation of the Llama model that is designed for faster inference with 4-bit GPTQ weights.

To boost inference speed even further, we can use the **ExLlamaV2** kernel by configuring the `exllama_config`:

In [None]:
import torch
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    exllama_config={'version': 2}
)

model = AutoModelForCausalLM.from_pretrained(
    'opt-125m-gptq',
    device_map='auto',
    quantization_config=gptq_config
)

Only 4-bit models are supported. If we are finetuning a quantized model with PEFT, we should deactivate the ExLlama kernels.

The ExLlama kernels are only supported when the entire model is on the GPU. If we are doing inference on a CPU with AutoGPTQ or GPTQModel, then we will need to disable the ExLlama kernel.

In [None]:
import torch
from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    use_exllama=False
)

model = AutoModelForCausalLM.from_pretrained(
    'opt-125m-gptq',
    device_map='auto',
    quantization_config=gptq_config
)

# AWQ

**Activation-aware Weight Quantization (AWQ)** does not quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that we can run models in 4-bit precision without experiencing any performance degradation.

There are several libraries for quantizing models wih the AWQ algorithm, such as `llm-awq`, `autoawq`, or `optimum-intel`.


In [None]:
!pip install -qU autoawq transformers

AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's `config.json`:
```yaml
{
  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
  "architectures": [
    "MistralForCausalLM"
  ],
  ...
  ...
  ...
  "quantization_config": {
    "quant_method": "awq",
    "zero_point": true,
    "group_size": 128,
    "bits": 4,
    "version": "gemm"
  }
}
```

A quantized model is loaded with the `from_pretrained()` method. If we load our model on the CPU, make sure to move it to a GPU device first.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'TheBloke/zephyr-7B-alpha-AWQ'

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto'
)

Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If we want to load these other weights in a different format, we need to use the `torch_dtype`:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = 'TheBloke/zephyr-7B-alpha-AWQ'

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32
)

AWQ quantization can also be combined with **FlashAttention-2** to further accelerate inference:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    'TheBloke/zephyr-7B-alpha-AWQ',
    attn_implementation='flash_attention_2',
    device_map='cuda:0'
)

## Fused modules

**Fused modules** offer improved accuracy and performance and it is supported out-of-the-box for AWQ modules for Llama and Mistral architectures, and we can also fuse AWQ modules for unsupported architectures.

**Fused modules cannot be combined with other optimization techniques such as FlashAttention-2**.

To enable fused modules for supported architectures, we need to create an `AwqConfig` and set `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` is the total sequence length and it should include the context length and the expected generation length.

For example, to fuse the AWQ modules of the `TheBloke/Mistral-7B-OpenOrca-AWQ`:

In [None]:
import torch
from transformers import AwqConfig, AutoModelForCausalLM

model_id = 'TheBloke/Mistral-7B-OpenOrca-AWQ'

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,
    du_fuse=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config
).to('cuda:0')

## ExLlama-v2 support

Newer versions of `autoawq` supports ExLlama-v2 kernels for faster prefill and decoding.

In [None]:
!pip install git+https://github.com/casper-hansen/AutoAWQ.git

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version='exllama')

model = AutoModelForCausalLM.from_pretrained(
    'TheBloke/Mistral-7B-Instruct-v0.1-AWQ',
    quantization_config=quantization_config,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('TheBloke/Mistral-7B-Instruct-v0.1-AWQ')

In [None]:
input_ids = tokenizer.encode(
    "How to make a cake",
    return_tensors="pt"
).to(model.device)
output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    pad_token_id=50256
)

tokenizer.decode(output[0], skip_special_tokens=True)

## Intel CPU/GPU support

Newer version of `autoawq` supports Intel CPU/GPU with IPEX op optimizations.

In [None]:
pip install intel-extension-for-pytorch # for IPEX-GPU refer to https://intel.github.io/intel-extension-for-pytorch/xpu/2.5.10+xpu/
pip install git+https://github.com/casper-hansen/AutoAWQ.git

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

device = "cpu" # set to "xpu" for Intel GPU
quantization_config = AwqConfig(version="ipex")

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ",
    quantization_config=quantization_config,
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ")

In [None]:
input_ids = tokenizer.encode(
    "How to make a cake",
    return_tensors="pt"
).to(device)
pad_token_id = tokenizer.eos_token_id
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=pad_token_id)
tokenizer.decode(output[0], skip_special_tokens=True)

# AQLM

**Additive Quantization of Language Models (AQLM)** is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependecies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.

In [None]:
!pip install -qU aqlm[gpu,cpu]

This library provides efficient kernels for both GPU and CPU inference and training.

To run AQLM models:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf'

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

We can check the detailed instruction on how to quantize models on their official GitHub repository.

# VPTQ

**Vector Post-Training Quantization (VPTQ)** is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit).

VPTQ can compress 70B, even the 405B model, to 1-2bits without retraining and maintain high accuracy:
* Better accuracy on 1-2bits
* Lightweight quantization algorithm: only cost ~17 hours to quantize 405B Llama-3.1
* Agile quantization inference: low decode overhead, best throughput, and TTFT (Time to First Token; TTFT measures the speed from the time when a user sends a query to when the user gets the first response.)

In [None]:
!pip install -qU vptq

To run VPTQ models,

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft'

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
input_ids = tokenizer("hello, how are you", return_tensors="pt").to("cuda")
out = model.generate(**input_ids, max_new_tokens=32, do_sample=False)
tokenizer.decode(out[0], skip_special_tokens=True)

# SpQR

**Sparse-Quantized Representation (SpQR)** involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outlier. The details are in the paper *SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression*.

To run a SpQR-quantized model,

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf'

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.half,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
input_ids = tokenizer("hello, how are you", return_tensors="pt").to("cuda")
out = model.generate(**input_ids, do_sample=False)
tokenizer.decode(out[0], skip_special_tokens=True)

# Optimum-quanto

**HuggingFace optimum-quanto** Library.

In [None]:
!pip install -qU optimum-quanto accelerate transformers

We can quantize a model by passing `QuantoConfig` object in the `from_pretrained()` method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers.

The `optimum-quanto` library does not only integrate the weights quantization (already in `transformers`), but also support more complex use case such as activation quantization, calibration and quantization aware training.

By default, the weights are loaded in full precision (`torch.float32`) regardless of the actual data type. We can set the `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = 'facebook/opt-125m'

quantization_config = QuantoConfig(weights='int8')

tokenizer = AutoTokenizer.from_pretrained(model_id)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map='cuda:0',
    quantization_config=quantization_config,
)

# EETQ

The [**EETQ**](https://github.com/NetEase-FuXi/EETQ) supports int8 per-channel weight-only quantization for NVIDIA GPUs. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize our model.

In [None]:
!pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl

An unquantized model can be quantized via `from_pretrained()` method:

In [None]:
from transformers import AutoModelForCausalLM, EetqConfig

model_id = 'facebook/opt-125m'

quantization_config = EetqConfig('int8')

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=quantization_config
)

# HIGGS

**HIGGS** is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model_id = 'google/gemma-2-9b-it'

quantization_config = HiggsConfig(bits=4)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map='auto'
)

In [None]:
tokenizer.decode(model.generate(
    **tokenizer("Hi,", return_tensors="pt").to(model.device),
    temperature=0.5,
    top_p=0.80,
)[0])

# HQQ

**Half-Quadratic Quantization (HQQ)** implements on-the-fly quantization via fast robust optimization. It does not require calibration data and can be used to quantize any model.

In [None]:
!pip install -qU hqq

To quantize a model, we need to create a `HqqConfig`.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

# method 1: all linear layers will use the same quantization config
quantization_config = HqqConfig(nbits=8, group_size=64)

In [None]:
# method 2: each linear layer with the same tag will use a dedicated quantization config
q4_config = {'nbits': 4, 'group_size': 64}
q4_config = {'nbits': 3, 'group_size': 32}

quantization_config = HqqConfig(dynamic_config={
    'self_attn.q_proj': q4_config,
    'self_attn.k_proj': q4_config,
    'self_attn.v_proj': q4_config,
    'self_attn.o_proj': q4_config,

    'mlp.gate_proj': q3_config,
    'mlp.up_proj': q3_config,
    'mlp.down_proj': q3_config,
})

The second apparoch (method 2) is useful for quantizing Mixture-of-Experts (MoEs) because the experts are less affected by lower quantization settings.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map='auto',
    quantization_config=quantization_config
)

# FBGEMM FP8

With FBGEMM FP8 quantization method, we can quantize our model in FP8 (W8A8):
* the weights will be quantized in 8bit (FP8) per channel
* the activation will be quantized in 8bit (FP8) per token

In [None]:
!pip install -qU accelerate fbgemm-gpu torch

By default, the weights are loaded in full precision regardless of the actual data type.

In [None]:
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer

model_name = 'meta-llama/Meta-Llama-3-8B'

quantization_config = FbgemmFp8Config()

tokenizer = AutoTokenizer.from_pretrained(model_name)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto',
    quantization_config=quantization_config
)

In [None]:
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
tokenizer.decode(output[0], skip_special_tokens=True)

# TorchAO

**TorchAO** is an architecture optimization library for PyTorch. It provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features.

In [None]:
!pip install -qU torch torchao transformers

In [None]:
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer

model_name = 'meta-llama/Meta-Llama-3-8B'

quantization_config = TorchAoConfig('int4_weight_only', group_size=128)

tokenizer = AutoTokenizer.from_pretrained(model_name)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto',
    quantization_config=quantization_config
)

In [None]:
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(
    **input_ids,
    max_new_tokens=10,
    cache_implementation='static'
)
tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
# benchmark performance
import torch.utils.benchmark as benchmark

def benchmark_fn(f, *args, **kwargs):
    # Manual warmup
    for _ in range(5):
        f(*args, **kwargs)

    t0 = benchmark.Timer(
        stmt='f(*args, **kwargs)',
        globals={'args': args, 'kwargs': kwargs, 'f': f},
        num_threads=torch.get_num_threads()
    )

    return f"{(t0.blocked_autorange().mean):.3f}"


MAX_NEW_TOKENS = 1000

print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))

bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype=torch.bfloat16)
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))

## Serialization and Deserialization

torchao quantization is implemented with tensor subclasses, it only work with huggingface non-safetensor serialization and deserialization.

In [None]:
# save quantized model locally
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained(output_dir, safe_serialization=False)

# load quantized model
ckpt_id = "llama3-8b-int4wo-128"  # or huggingface hub model id
loaded_quantized_model = AutoModelForCausalLM.from_pretrained(ckpt_id, device_map="cuda")

# confirm the speedup
loaded_quantized_model = torch.compile(loaded_quantized_model, mode="max-autotune")
print("loaded int4wo-128 model:", benchmark_fn(loaded_quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS))

# Compressed-tensors

The `compressed-tensors` library provides a versatile and efficient way to store and manage compressed model checkpoints. The library supports various quantization and sparsity schemes, making it a unified format for handling different model optimization like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT and more.

In [None]:
!pip install -qU compressed-tensors

In [None]:
from transformers import AutoModelForCausalLM

ct_model = AutoModelForCausalLM.from_pretrained(
    'nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf'
)

# measure memory usage
mem_params = sum(
    [param.nelement() * param.element_size() for param in ct_model.parameters()]
)
print(f"{mem_params/2**30:.4f} GB")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf'

tokenizer = AutoTokenizer.from_pretrained(model_name)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto'
)

In [None]:
prompt = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is"
]

inputs = tokenizer(prompt, return_tensors='pt')
generated_ids = quantized_model.generate(
    **inputs,
    max_length=50,
    do_sample=False
)
outputs = tokenizer.batch_decode(generated_ids)
print(outputs)

# Fine-grained FP8

With FP8 quantization model, we can quantize our model in FP8:
* the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
* the activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)

In [None]:
!pip install -qU accelerate torch

In [None]:
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer

model_name = 'meta-llama/Meta-Llama-3-8B'

quantization_config = FineGrainedFP8Config()

tokenizer = AutoTokenizer.from_pretrained(model_name)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto',
    quantization_config=quantization_config
)

In [None]:
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
tokenizer.decode(output[0], skip_special_tokens=True)