# Week 9: Efficient model inference

As we now know from the lecture, there are many ways to make inference more efficient:
- Distillation
- Quantization
- Changing architecture (e.g. encoder-decoder vs decoder)
- Speculative decoding

In the seminar we will talk about different kinds of **post-training quantization**.

For more info about quantization, a good starting point is ["A Survey of Quantization Methods for Efficient Neural Network Inference"](https://arxiv.org/abs/2103.13630), 2021.

### Plan:

1. Some notes about Memory Bandwidth Utilization
2. Data-free quantization with T5
3. Weight-only Quantization with calibration (GPTq)
4. Weight & Activation Quantization (SmoothQuant)

## 1: Memory Bandwidth Utilization (MBU)

Let's read the following passage from [this post](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices) by Databricks.

> So, how exactly should we think about inference speed?
Our _[Databricks]_ team uses four key metrics for LLM serving:
> 1. **Time To First Token (TTFT)**: How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
> 2. **Time Per Output Token (TPOT)**: Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
> 3. **Latency**: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated)
>4. **Throughput**: The number of output tokens per second an inference server can generate across all users and requests.ests.

> To measure the underlying hardware's utilization, we introduce a new metric called Model Bandwidth Utilization (MBU). 
> MBU is defined as 

$$\frac{\text{achieved memory bandwidth}}{\text{peak memory bandwidth}}$$

>where 

$$
\text{achieved memory bandwidth} = \frac{\text{total model parameter size + KV cache size}}{\text{TPOT}}
$$

![](memory_bandwidth_utilization.jpg)

### Example on how to estimate MBU

- For example, if a 7B parameter running with 16-bit precision has TPOT equal to 14ms, then it's moving 14GB of parameters in 14ms translating to 1TB/sec bandwidth usage.
- A100 can handle up to ~2Tb/sec.
- So, we are running at an MBU of 50%.

## 2: Data-free quantization with t5

First let's try data-free quantization, initially proposed in ["QLoRA: Efficient Finetuning of Quantized LLMs"](https://arxiv.org/abs/2305.14314).

(Section is based on this [post](https://huggingface.co/blog/hf-bitsandbytes-integration).)

In [None]:
import os

# autogptq can be very slow if you don't restrict the amount of CPU cores it is using
max_cpu_threads = "16"
os.environ["OMP_NUM_THREADS"] = max_cpu_threads
os.environ["OPENBLAS_NUM_THREADS"] = max_cpu_threads
os.environ["MKL_NUM_THREADS"] = max_cpu_threads
os.environ["VECLIB_MAXIMUM_THREADS"] = max_cpu_threads
os.environ["NUMEXPR_NUM_THREADS"] = max_cpu_threads
os.environ["NUMEXPR_MAX_THREADS"] = max_cpu_threads

In [None]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig

In [None]:
model_name = "t5-3b-sharded"  # @param ["t5-11b-sharded", "t5-3b-sharded"]

# T5-3b and T5-11B are supported!
# We need sharded weights otherwise we get CPU OOM errors
model_id = f"ybelkada/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(
    model_id,
    quantization_config=BitsAndBytesConfig(
        load_in_8bit=True
    ),
    device_map="auto",
)

In [None]:
model_8bit.get_memory_footprint() / 1e9

For t5-3b the int8 model is about ~5.3GB! whereas the original model has 11GB. 

For t5-11b the int8 model is about ~11GB vs 42GB for the original model. Now let's generate and see the qualitative results of the 8bit model!

In [None]:
max_new_tokens = 50

input_ids = tokenizer(
    "translate English to German: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face",
    return_tensors="pt",
).input_ids.to("cuda:0")

outputs = model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
torch.cuda.max_memory_allocated() / 1e9

In [None]:
del model_8bit, tokenizer

In [None]:
# https://stackoverflow.com/questions/57858433/how-to-clear-gpu-memory-after-pytorch-model-training-without-restarting-kernel
import gc
torch.cuda.empty_cache()
gc.collect()

torch.cuda.reset_peak_memory_stats()
torch.cuda.max_memory_allocated() / 1e9

## 3: Weight-only quantization with calibration dataset (GPTq)

Data-free quantization usually does something like
$$
\arg\min \|W - W_{\text{quantized}}\|_{F}
$$
It is simple and easy to use. However, this does not acoount for the fact, that we apply our models on a specific distribution of data.

Let's $X$ to be activation from previous layers. Then we might formulate quantization objective as
$$
\arg\min \|X \cdot W - X \cdot W_{\text{quantized}}\|_{F}
$$
The intuition is that we want to preserve _the way layer $W$ transforms the inputs_, not its literal weights.
This is one of the core ideas used in GPTq algorithm.

(Based on [AutoGPTq tutorial](https://github.com/AutoGPTQ/AutoGPTQ/tree/main))

### Setting up

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    local_files_only=True,
    low_cpu_mem_usage=True,     # speeds up loading, if `accelerate` is installed
)

In [None]:
model

In [None]:
def count_params(model):
    return sum(p.numel() for p in model.parameters())

print(f"{count_params(model) // 1e6:4.0f} M parameters")
print(f"{count_params(model.model.embed_tokens) // 1e6:4.0f} M parameters in embedding block")

In [None]:
device = torch.device("cuda:0")
model = model.to(device)

In [None]:
@torch.inference_mode()
def generate(model, tokenizer, prefix, max_length, device="cuda:0") -> str:
    inputs = tokenizer(prefix, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        repetition_penalty=1.1,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
import time
from tqdm.auto import tqdm

prompts = [f"You will never believe this wild conspiracy theory about {topic}:"
    for topic in ("bananas", "grizzly bears", "gummy bears", "Python language", "Yann LeCun")]

max_length = 384

start = time.perf_counter()
answers = [generate(model, tokenizer, prompt, max_length) for prompt in tqdm(prompts)]
generation_time = time.perf_counter() - start

In [None]:
print(answers[4])

Let's calculate MBU for this model.

In [None]:
n_generated_tokens_total = sum([len(answer) - len(prompt)
                                for answer, prompt in zip(tokenizer(answers).input_ids, tokenizer(prompts).input_ids)])
n_generated_tokens_total

In [None]:
print(f"Generation speed: {n_generated_tokens_total / generation_time:.1f} tokens/sec")

In [None]:
def compute_model_size_mb(model):
    model_size_mb = sum(p.numel() * p.element_size() for p in model.parameters()) / 1e6
    model_size_mb += sum(b.numel() * b.element_size() for b in model.buffers()) / 1e6
    return model_size_mb

def compute_memory_bandwidth_utilization(model_and_kv_cache_size_mb, max_bandwidth_mb, time_per_output_token):
    return (model_and_kv_cache_size_mb / time_per_output_token) / max_bandwidth_mb

In [None]:
model_size_mb = compute_model_size_mb(model)

# 2 * batch_size * sequence_length * n_layers * (n_heads * d_head) * precision
kv_cache_size_mb = 2 * 1 * max_length * model.config.num_hidden_layers * model.config.hidden_size * 2 / 1e6

a100_max_bandwidth_mb = 2e6

mbu = compute_memory_bandwidth_utilization(
    model_size_mb + kv_cache_size_mb,
    a100_max_bandwidth_mb, 
    generation_time / n_generated_tokens_total
)

print(f"Memory Bandwidth Utilization is {mbu * 100:.2f} %") 

In [None]:
print(f"Model size: {model_size_mb:.0f} Mb")
print(f"KV cache size: {kv_cache_size_mb:.0f} Mb")

In [None]:
del model

### Run AutoGPTq

Let's prepare a calibration dataset.

In [None]:
from datasets import load_dataset

n_samples = 128
dataset = load_dataset("wikitext", "wikitext-2-v1", split="test")

calibration_set = dataset.filter(lambda example: len(example["text"]) > 100)
calibration_set = calibration_set.shuffle(seed=59)[:n_samples]["text"]

len(calibration_set)

In [None]:
calibration_set[:2]

Now we can run GPTq.

In [None]:
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

In [None]:
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

quantized_model_dir = model_name + "_4bit"

quantize_config = BaseQuantizeConfig(
    bits=4,          # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

examples = [tokenizer(sample, return_tensors="pt").to(device) for sample in calibration_set]

In [None]:
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config,
    local_files_only=True,
    low_cpu_mem_usage=True,
)
model.to(device)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

### Save quantized model

In [None]:
# save quantized model using safetensors
model.save_quantized(quantized_model_dir)

### Check how quantized model generates

In [None]:
# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    low_cpu_mem_usage=True,
    device=device,
)

What size we should expect before and after quantization?

In [None]:
print(f"Before quantization: {model_size_mb:.0f} Mb")

In [None]:
model_size_mb = compute_model_size_mb(model)
print(f"After quantization: {model_size_mb:.0f} Mb")

Quantized model has more than x3 smaller memory footprint. You can almost run it on a toaster now.

In [None]:
start = time.perf_counter()
answers = [generate(model, tokenizer, prompt, max_length) for prompt in tqdm(prompts)]
generation_time = time.perf_counter() - start

In [None]:
print(answers[4])

In [None]:
n_generated_tokens_total = sum([len(answer) - len(prompt)
                                for answer, prompt in zip(tokenizer(answers).input_ids, tokenizer(prompts).input_ids)])
n_generated_tokens_total

In [None]:
print(f"Generation speed: {n_generated_tokens_total / generation_time:.1f} tokens/sec")

Having compressed the model, we might have hoped for speedup. However, memory transfers are not the only bottleneck, and there might be some inefficiencies in implementation, which slow us down.

GPTq still can noticeably drive the memory footprint down, and this is often vital when you work on a small GPU.

In [None]:
mbu = compute_memory_bandwidth_utilization(
    model_size_mb + kv_cache_size_mb,
    a100_max_bandwidth_mb,
    generation_time / n_generated_tokens_total
)

print(f"Memory Bandwidth Utilization is {mbu * 100:.2f} %") 

In [None]:
del model, examples, tokenizer

## 4: Weight & Activation Quantization (SmoothQuant)

Weight-only quantization helps to improve Memory Bandwidth Utilization. Therefore, it primarily provides speedups at low batch sizes and for autoregressive generation tasks.

To make models faster when you have large batch sizes or don't have to autoregressively generate responces, you can use weight and activation quantization.

By converting weights and activations e.g. from fp16 to int8, we can utilize efficient `GEMM` and `BMM` kernels and theoretically double the throughput.

Current part is a copy of this [example](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_llama_demo.ipynb).

In [None]:
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
import torch
import torch.nn as nn
from transformers.models.llama.modeling_llama import (
    LlamaAttention,
    LlamaDecoderLayer,
    LlamaForCausalLM,
    LlamaMLP,
)
from transformers import LlamaTokenizer
import smoothquant
from smoothquant.smooth import smooth_lm
from smoothquant.fake_quant import quantize_llama_like
import tqdm

> The following is an evaluator to see the performance of the model. We use a toy dataset (the first 40 examples in the test set of the Wikitext-2 dataset) to evaluate the model. You can replace it with your own dataset. The conclusion should be the same.

In [None]:
class Evaluator:
    def __init__(self, dataset, tokenizer, device, n_samples=40):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.device = device

        self.dataset = tokenizer(
            "\n\n".join(dataset["text"]), return_tensors="pt"
        ).input_ids.to(device)

        self.n_samples = n_samples

    @torch.no_grad()
    def evaluate(self, model):
        model.eval()
        nlls = []
        for i in tqdm.tqdm(range(self.n_samples), desc="Evaluating..."):
            batch = self.dataset[:, (i * 2048) : ((i + 1) * 2048)].to(model.device)
            with torch.no_grad():
                lm_logits = model(batch).logits
            shift_logits = lm_logits[:, :-1, :].contiguous().float()
            shift_labels = self.dataset[:, (i * 2048) : ((i + 1) * 2048)][:, 1:]
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )
            neg_log_likelihood = loss.float() * 2048
            nlls.append(neg_log_likelihood)

        return torch.exp(torch.stack(nlls).sum() / (self.n_samples * 2048))

In [None]:
from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf"
device = "cuda:0"

tokenizer = LlamaTokenizer.from_pretrained(model_name)
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
evaluator = Evaluator(dataset, tokenizer, device)

**FP16 Model Perplexity**

> Let's first check the performance of the original FP16 model.

In [None]:
model_fp16 = LlamaForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto", local_files_only=True, low_cpu_mem_usage=True,
)

In [None]:
ppl_fp16 = evaluator.evaluate(model_fp16)
print(f"Original model (fp16) perplexity: {ppl_fp16}")

> We then quantize the model to W8A8 and check the performance.

**Naive W8A8 Quantized Model Perplexity**

In [None]:
%%time
model_w8a8 = quantize_llama_like(model_fp16)
print(model_w8a8)

In [None]:
ppl_w8a8 = evaluator.evaluate(model_w8a8)
print(f"Naive W8A8 quantized model perplexity: {ppl_w8a8}")

> We can see there is a perplexity increase. We then use SmoothQuant to quantize the model and check the performance.

**SmoothQuant W8A8 Quantized Model Perplexity**

In [None]:
# We have to load corresponding activation scales:
#!wget https://huggingface.co/mit-han-lab/smoothquant-scales/resolve/main/llama-2-7b.pt

In [None]:
model = LlamaForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)
act_scales = torch.load("llama-2-7b.pt")

In [None]:
%%time
smooth_lm(model, act_scales, 0.85)
model_smoothquant_w8a8 = quantize_llama_like(model)
print(model_smoothquant_w8a8)

In [None]:
ppl_smoothquant_w8a8 = evaluator.evaluate(model_smoothquant_w8a8)
print(f"SmoothQuant W8A8 quantized model perplexity: {ppl_smoothquant_w8a8}")

> We can see the smoothed model has a lower perplexity which is close to the FP16 model's. This is because SmoothQuant smooths the outliers in activations and balances the quantization difficulty of activations and weights.

## Summary

- Data-free quantization methods are very fast, and you can often gridsearch optimal quantization hyperparameters on your laptop.
- Weight-only quantization methods mainly address memory bottlenecks (which mostly occur at low batch sizes).
- Weight & Activation quantization methods can deal with both memory and computation bottlenecks, achieving speedups e.g. due to using efficient int8 matrix multiplication kernels, but might have slightly inferior quality compared to weight-only methods.
- Also, the points above are actually too general, there is no silver bullet and the only method to know whether a quantization method fits your application is to actually try it.