# Language models 2 | Quantisation & Parameter-Efficient Finetuning


Modified and augmented from this original Notebook [here](https://github.com/huggingface/notebooks/blob/main/peft/gemma_7b_english_quotes.ipynb), used in [this blog post](https://huggingface.co/blog/gemma-peft).

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

## LLMs on Consumer-grade Hardware

### 1. Quantization

Quantization is the set of techniques that transform the numbers used in models (weights and biases) from their original format (say, `float32`) into ones that take much less memory.

The main rule of thumb is: there is a trade-off between *memory* (more quantization takes less, which is good) and *quality* (more quantization degrades the model abilities, which is bad).

For guides, see:
- the ['Overview'](https://huggingface.co/docs/transformers/en/quantization/overview) in the `Transformers` library docs, and
- the ['BitsandBytes'](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes) one, also there, and
- ['Quantization'](https://huggingface.co/docs/peft/main/en/developer_guides/quantization) in the `Peft` library docs.
- [This short course on Deeplearning.ai](https://learn.deeplearning.ai/courses/quantization-fundamentals), for people wanting to go real deep.

Here's an example of how you would load a model in 8-bits:


This goes even further, to 4-bits, that we use below. For full examples of this, check [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), as well as the [inference](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing) and [fine-tuning](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing) notebooks.

In [None]:
!pip install -U bitsandbytes

In [None]:
import torch
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig

# You can try and comment out this line...
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b", # 7 billion parameters
    # ...as well as this one, and witness the crash of your Colab session (out of RAM)
    quantization_config=quantization_config,
    torch_dtype=torch.float32,
)

### 2. LoRAs

**LoRA** stands for "[Low-Rank Adaptation (for Large Language Models](https://arxiv.org/abs/2106.09685))". This is a set of techniques that leverages linear algebra to replace large weight matrices by smaller ones, and then merges the results, allowing users to train only "patches" (in HF they are called _adapters_). As often happens in this field, this sparked a wave of interest, leading to many different techniques (see [here](https://huggingface.co/docs/peft/main/en/conceptual_guides/adapter) for some of them).

![LoRA illustration](https://cdn-lfs.hf.co/datasets/huggingface/documentation-images/4313422c5f2755897fb8ddfc5b99251358f679647ec0f2d120a3f1ff060defe7?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27lora_diagram.png%3B+filename%3D%22lora_diagram.png%22%3B&response-content-type=image%2Fpng&Expires=1731883940&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMTg4Mzk0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy80MzEzNDIyYzVmMjc1NTg5N2ZiOGRkZmM1Yjk5MjUxMzU4ZjY3OTY0N2VjMGYyZDEyMGEzZjFmZjA2MGRlZmU3P3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=fOIer2uHVZAr2LmQmw%7E%7Es16CV7Y8KOi3n%7E5rmtg6NtiW12gHiJQPMpnm681nVMqZCKLpy95QSKbljUq6h5jOFg1fU80aTVuyX9oajdan3-1sqlydDWoGINOYmPtowfUaEWPlo4Kka%7EO%7EbZZ0VBJu7l63z%7EyvtKnT8LhGiuV4pA87pKZnuIk7YnwI6VdOCR9%7EkTyswl2UobJgmWEnvFTD6ap44lEOGWQjj58XhNINmRfDJznKJWnF%7E86hVXTajN4h8gEdJzQz6qXXGbbbxzDHQjujJZmKffHHzHOANsRNUhlTaDBRvkK4OHUt-zpyMMf6uSHjudK05OM12xxe9TO6hg__&Key-Pair-Id=K3RPWS32NSSJCE)

([source](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model))

Imagine we have an LLM loaded, like the one above. How would we configure it to train using these memory-saving techniques?

In [None]:
from peft import TaskType
from peft import LoraConfig
from peft import get_peft_model

# We need a config (a rabbit hole in its own right)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8, # rank, aka LoRA attention dimension
    lora_alpha=32,
    lora_dropout=0.1
)

peft_model = get_peft_model(model, peft_config=lora_config)
peft_model.print_trainable_parameters()

As you can see, if we trained now we would only fine-tune a *tiny* fraction of all parameters!

What is great about this is that you can now train only this fraction of the weights, this 'adapter', and save only that locally or to the Huggingface Hub, and whenever you or someone else downloads it, the library will automatically fetch the base model and merge the adapter!

For more on these techniques, see:
- the ['Quicktour'](https://huggingface.co/docs/peft/main/en/quicktour), and
- ['Configurations and models'](https://huggingface.co/docs/peft/main/en/tutorial/peft_model_config), as well as
- ['Integrations'](https://huggingface.co/docs/peft/main/en/tutorial/peft_integrations).

Now, let's clear things up for our next example.

In [None]:
# clear our memory (it takes a little while)
import gc
import time

del quantization_config, model, lora_config, peft_model
gc.collect()
torch.cuda.empty_cache()

time.sleep(15)

## Two training examples (Quantized + LoRA)

### Our model

#### Install & Imports

In this example, the authors use a new, higher level library for training called [tlr](https://huggingface.co/docs/trl/en/index).

In [None]:
import sys
if 'google.colab' in sys.modules:
    !pip install bitsandbytes
    !pip install peft
    !pip install trl
    !pip install accelerate
    !pip install datasets
    !pip install transformers

In [None]:
import os

import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

import transformers

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

from transformers import BitsAndBytesConfig

from peft import LoraConfig

from trl import SFTConfig
from trl import SFTTrainer

#### Model Loading & Config

We will fine-tune [Google's Gemma](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315), but you can have a look at other available recent models:
- [Microsoft's Phi](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3)
- [Meta's Llama](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf)
- [Mistral](https://huggingface.co/mistralai)
- [Falcon](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
- many, many others, see the [Open LLM leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)


[Here's](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) where you would search for text generation models (note the many other options in the side bar).


In [None]:
MODEL_ID = "google/gemma-7b"
# 7 billion parameters! (there are also 2-3b models available: e.g. google/gemma-2-2b)
# thanks to these techniques you can finetune these on a 15GB T4, but you
# may still encounter memory issues (depending on dataset, batch size, etc.)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# this time we quantize in 4-bits
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map={"":0},
)

# this passed to the trainer further down, which applies it for us
lora_config = LoraConfig(
    r=8, # rank, aka LoRA attention dimension
    # `target_modules` describes the matrices to be reduced (which ones should be
    # chosen depends on the architecture, rule of thumb: follow what HF does...)
    target_modules=[
        "q_proj", "o_proj", "k_proj", "v_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    task_type="CAUSAL_LM",
)

#### Inference

Let's test our model before training.

In [None]:
text = "Quote: Imagination"

inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Fine-tuning on a specific dataset



[Here](https://huggingface.co/datasets?modality=modality:text&format=format:text&sort=trending) is where you would search for other ready-made datasets.

For more on the `datasets` library, see the [`Quickstart`](https://huggingface.co/docs/datasets/en/quickstart) and the tutorials (on the left-hand side bar).

#### Example 1: English Quotes



See the dataset page [here](https://huggingface.co/datasets/Abirate/english_quotes).

In [None]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")

In [None]:
print(data)

In [None]:
n = 3
for i, d in enumerate(data["train"]):
    print(d.keys())
    print(d["quote"])
    print(d["author"])
    print(d["tags"])
    print()
    print("---")
    if n == i:
        break

The main point of using `SFTConfig` and `SFTTrainer` instead of the regular `TrainerConfig` and `Trainer` is to make it more high-level (_supposedly_ simpler), and also to allow you to define this `formatting_func`, which will affect the behaviour of your model. Making your model follow a certain format allows you to know how the generation is likely to go, which can help you then code functions around that (for instance, you could have a better idea when to cut off generation when reaching the end of an answer).

You can have a look at ['Supervised Fine-tuning Trainer'](https://huggingface.co/docs/trl/en/sft_trainer) for more details.

In [None]:
def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

training_args = SFTConfig(
    max_seq_length=8192, # https://huggingface.co/docs/transformers/en/model_doc/gemma#transformers.GemmaConfig
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    max_steps=10,       # very few steps
    learning_rate=2e-4, # low learning rate
    fp16=True,
    logging_steps=1,
    output_dir="outputs",
    optim="paged_adamw_8bit",
    report_to=["none"] # we don't need to save the training wandb.ai
)

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=training_args,
    peft_config=lora_config,
    formatting_func=formatting_func,
)

We train for very few steps (very much a trial and error process: too little and you don't see a difference, too much and your model loses its prior knowledge).

In [None]:
history = trainer.train()

The one thing to note is that the model now follows the format produced by `formatting_func`:

```
Quote: ...
Author ...
```

In [None]:
text = "Quote: Imagination"
inputs = tokenizer(text, return_tensors="pt").to(device)

# the model starts being repetitive after that: changing the sampling
# method using a generation config could help, see the other notebook!
outputs = model.generate(**inputs, max_new_tokens=11)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

#### Example 2: Shakespeare again


See the dataset page [here](https://huggingface.co/datasets/karpathy/tiny_shakespeare).

I bring this dataset again because it's a typical example of the kind of work you might have to do when faced with a dataset that is not formatted exactly as we need.

Before training, our model shouldn't sound particularly Shakespearean (the larger the model, the more it'll be able to pick up on the prompt straight away).

In [None]:
text = "PETRUCHIO:"

inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
from datasets import load_dataset

shak_data = load_dataset("karpathy/tiny_shakespeare", trust_remote_code=True)

The main issue with this dataset is that it only contains one long text for each split. Not only will it not yield batches, but it also will crash our GPU memory!

In [None]:
shak_data

In [None]:
for b in shak_data["train"]:
    print(b.keys())
    print(len(b['text'][0]), len(b['text']))
    print(b)
    break

Here's how we would turn that into a better format:

In [None]:
# with ChatGPT 4o help
def chunk_text(example, chunk_size=250):
    text = example["text"]
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return {"text": chunks}  # Returns a list of chunks

def flatten_chunks(batch):
    # Flatten list of chunks into separate rows
    # (this in effect splits the text into letters, which will be re-merged
    # into strings when batching happens with `batched=True`)
    flat_text = [{"text": chunk} for chunks in batch["text"] for chunk in chunks]
    return {"text": [entry["text"] for entry in flat_text]}

chunked_shak_data = shak_data.map(chunk_text, batched=False)
chunked_shak_data = chunked_shak_data.map(flatten_chunks, batched=True)

In [None]:
chunked_shak_data # num_rows is what we want

In [None]:
for b in chunked_shak_data["train"]:
    print(b.keys())
    print(len(b['text']))
    print(b)
    break

In [None]:
def formatting_func(example):
    # this time we don't really need a template (could be fun to play with!)
    return example['text']

training_args = SFTConfig(
    max_seq_length=8192, # https://huggingface.co/docs/transformers/en/model_doc/gemma#transformers.GemmaConfig
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    max_steps=10,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    output_dir="shak_outputs",
    optim="paged_adamw_8bit",
    report_to=["none"]
)

trainer = SFTTrainer(
    model=model,
    train_dataset=chunked_shak_data["train"],
    args=training_args,
    peft_config=lora_config,
    formatting_func=formatting_func,
)

In [None]:
history = trainer.train()

In [None]:
text = "PETRUCHIO:"

inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Saving your model

There are two ways of saving your model. It can be using the `trainer` (this won't save the weights of the base model, since it's assumed that you will download those again from the HF Hub if needed):

```python
MODEL_DIR = "jchwenger/gemma-2b-gpu-int4-lora-minshakespeare"
trainer.save_model(MODEL_DIR)
```

Or saving the individual components (entire model):
```python
model.save_pretrained(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)
lora_config.save_pretrained(MODEL_DIR)
```