# Language Models 2 | Quantization


## Quantization

Quantization is the set of techniques that transform the numbers used in models (weights and biases) from their original format (say, `float32`) into ones that take much less memory.

The main rule of thumb is: there is a trade-off between *memory* (more quantization takes less, which is good) and *quality* (more quantization degrades the model abilities, which is bad).

For guides, see:
- the ['Overview'](https://huggingface.co/docs/transformers/en/quantization/overview) in the `Transformers` library docs, and
- the ['BitsandBytes'](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes) one, also there, and
- ['Quantization'](https://huggingface.co/docs/peft/main/en/developer_guides/quantization) in the `Peft` library docs.
- [This short course on Deeplearning.ai](https://learn.deeplearning.ai/courses/quantization-fundamentals), for people wanting to go real deep.

In this notebook, we do only one thing: load the largest possible model on a free T4 GPU on Colab, by quantizing it as much as possible. In this case, using a member of the [`Qwen3` family](https://huggingface.co/collections/Qwen/qwen3) (you could also try [`Gemma3`](https://huggingface.co/collections/google/gemma-3-release)).

See [this example](https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora), where they use a similar quantization scheme (also ChatGPT).

## Install & Workflow

In [None]:
import sys
if 'google.colab' in sys.modules:
    !pip install -Uq bitsandbytes accelerate transformers strip_markdown

#### Drive

If you need to load/save to your drive:

```python
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/gold/IS53055B-DMLCP/DMLCP') # to change to another directory
```

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
import pathlib
from huggingface_hub import notebook_login
if not (pathlib.Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

## Imports

In [None]:
import textwrap
import strip_markdown
from threading import Thread
from IPython.display import clear_output

import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import GenerationConfig
from transformers import BitsAndBytesConfig
from transformers import TextIteratorStreamer

### Printing Utils

In [None]:
def strip_md(text):
    return strip_markdown.strip_markdown(text)

# many more options, see them with textwrap.TextWrapper?
tw = textwrap.TextWrapper(
    # the formatted width we want
    width=79,
    # this will keep whitespace & line breaks in the original text
    replace_whitespace=False
)

def wrap_print(s):
    """Format text into Textwrapped lines and print it"""
    print("\n".join(tw.wrap(s)))

## Loading Model

In [None]:
# b for billions (of parameters), try add '-base' for the non-chat model
MODEL_ID = "qwen/qwen3-8b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# For smaller models, you can quantize only in 8bits
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# Here, we quantize in 4-bits, and double-quantize, the whole shebang
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    # with Colab Pro, you get access to GPUs with 40GB, 80GB ram, you could try comment this out...
    quantization_config=quantization_config,
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    device_map="auto",
    # PyTorch's native Flash Attention (Scaled Dot Product Attention) for memory efficiency (ChatGPT)
    attn_implementation="sdpa"
)

## Generation

### Chat

In [None]:
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    # if True, the model will write preliminary steps in a <think> block before answering
    enable_thinking=True
).to(model.device)

In [None]:
# see: https://huggingface.co/blog/aifeifei798/transformers-streaming-output

# initialize the streamer
streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

# create generation arguments, adding the streamer
generation_kwargs = {
    # destructuring the chat
    **tokenized_chat,
    "streamer": streamer,
    "max_new_tokens": 500,
}

# run generation in a separate thread so we can consume the streamer in the main thread
thread = Thread(
    target=model.generate,
    kwargs=generation_kwargs
)
thread.start()

# print the text as it is generated
text = ""

for chunk in streamer:
    # let's handle the thinking bit by separating it from the rest
    if chunk.strip() in ["<think>", "</think>"]:
        text += "--- thinking ----\n"
        continue
    # add the current chunk
    text += chunk
    
    # print the whole text so far
    wrap_print(strip_md(text))
    
    # clear everything note: this only works in Jupyter/Colab
    clear_output(wait=True)

thread.join()

### Open-ended generation

(If you want to try `qwen/qwen3-8b-base`, or some other base model (e.g. the `-pt` models in the Gemma family.)

In [None]:
# inputs = tokenizer("The moon", return_tensors="pt").to(device)

# # generate text (use a GenerationConfig or arguments to change the defaults)
# output = model.generate(**inputs, max_new_tokens=100)

# # decode back from tokens to text
# text = tokenizer.decode(output[0], skip_special_tokens=True)

# wrap_print(text)

## Saving your model

If you want, you can save the quantized model and push it to HF Hub  by doing:

```python
# remove 'qwen/', add some description after
MODEL_ID = f"{MODEL_ID.split('/')[1]}.4bit-nf4-dblq"
model.save_pretrained(MODEL_DIR, safe_serialization=True, max_shard_size="2GB")
model.push_to_hub(MODEL_DIR)
```