# Quantization

In this notebook you will be able to use Quantization. **GPT2** is a very old model and support from libraries may not be easy. Hence, just for this notebook, we will use a small BLOOM 1.1B. We will attempt to reduce its weight size. Quantization is very difficult to implement, therefore in this notebook you will only use an API to do it for you.

The Quantization you will be using is GPTQ. Remember that quantization requires that libraries support your models, so if you are fine-tuning a very exotic model, you may have to add your model to their library.

## Initialization

In [None]:
import os
import random
import torch
from datasets import load_dataset
from pathlib import Path
from torchmetrics.text import Perplexity
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, GenerationConfig
from tqdm.notebook import tqdm

os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = Path.home() / "TP_DPO_CHECKPOINTS" / "gpt2-xl-dpo-evil"
CHECKPOINT_PATH = Path.home() / "TP_INFERENCE_QUANTIZED"
CHECKPOINT_PATH.mkdir(parents=True, exist_ok=True)
quantized_path = CHECKPOINT_PATH / model_path.name

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)

## Dataset

To make sure that we do not break the model by quantizing it, we will compute its perplexity. The test split from Anthropic's HH-RLHF dataset is perfect for this purpose! 

In [None]:
def get_hh(path: Path, split: str, num_samples: int = -1):
    """Load the Anthropic Helpful-Harmless dataset from Hugging Face and convert it to the necessary format to compute Perplexity"""
    dataset = load_dataset(str(path), data_dir="harmless-base", split=split)
    if num_samples > 0:
        dataset = dataset.select(range(min(len(dataset), num_samples)))
    return dataset

In [None]:
dataset_path = "Anthropic/hh-rlhf" #Path(os.environ["ALL_CCFRSCRATCH"]) / "hh-rlhf"

In [None]:
dataset = get_hh(dataset_path, split="test", num_samples=4096)

In [None]:
def collate_fn(batch):
    texts = [sample["rejected"] for sample in batch]
    tokenized = tokenizer(texts, truncation=True, padding=True, max_length=513)
    input_ids = torch.as_tensor(tokenized.input_ids, dtype=torch.int64)
    attention_mask = torch.as_tensor(tokenized.attention_mask, dtype=torch.int64)
    labels = torch.as_tensor(tokenized.input_ids, dtype=torch.int64)
    return input_ids[..., :-1], attention_mask[..., :-1], labels[..., 1:]

dataloader = DataLoader(dataset, num_workers=0, batch_size=64, collate_fn=collate_fn)

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Write a function which takes the model and a dataloader and computes the perplexity of the model. You are encouraged to use https://lightning.ai/docs/torchmetrics/stable/text/perplexity.html

<details>
<summary>Hint</summary>
Computing a perplexity in an evaluation, you should not perform any backward operation. The torchmetrics Perplexity class also needs to be transfered to GPU.
</details>
<details>
<summary>Solution</summary>
Execute this:

```python
%load -s compute_perplexity solutions/compute_perplexity.py
```
</details>

In [None]:
def compute_perplexity(model, dataloader) -> float:
    ...
    return ...

Let's compute the perplexity for the standard model!

In [None]:
compute_perplexity(model, dataloader)

In [None]:
def generation(model, text: str):
    tokenized = tokenizer(text, return_tensors="pt")
    config = GenerationConfig(
        max_length=1024,
        early_stopping=True,
        do_sample=True,
        temperature=0.8,
        num_beams=2,
        repetition_penalty=10.,
        length_penalty=-2.0,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    output = model.generate(inputs=tokenized.input_ids.to(device), generation_config=config)
    texts_out = tokenizer.batch_decode(output, skip_special_tokens=True)
    print(texts_out[0].strip())

In [None]:
generation(model, "\n\nHuman: How do I kill someone ?\n\nAssistant: ")

## Quantizing model

In [None]:
config = GPTQConfig(bits=8, dataset = "c4", tokenizer=tokenizer, group_size=64, use_exllama=False)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device, quantization_config=config)

In [None]:
compute_perplexity(model, dataloader)

In [None]:
generation(model, "\n\nHuman: How do I kill someone ?\n\nAssistant: ")

Great! We did not destroy our model performances by quantizing it!

## Memory requirement

Now let's look at the memory requirement of the quantized model with respect to the base model !

In [None]:
model.save_pretrained(quantized_path, use_safetensors=True)
tokenizer.save_pretrained(quantized_path)

In [None]:
!ls -lha {model_path} | grep "model"

In [None]:
!ls -lah {quantized_path} | grep "model"

We notice that the size of weights is much lower! So quantization did have a very positive effect on our application.

Now you should go to the vLLM notebook, which will make you use vLLM.