[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/152s3ZkWHCjKxw8R1RuQg7AwzYecivnjL?usp=sharing)

# LLM Finetuning Labs

## LLM Finetuning Introduction

### In this lab you will...

Implement yourself your (maybe) first fine-tuning !



## Part 1: Fine-tuning a base model using LoRA

This is the do-it-yourself version of the introduction notebook. There will be cells to fill and questions to answer throughout the lab.

During this lab, you will have to refer to the huggingface documentation. If you're stuck at some point or if you want more insight on what you are doing, take a look at the correction !

### 1.1. Preparing the environment

Nothing to see here, just the imports and basic functions

In [None]:
!pip install bitsandbytes datasets trl

In [None]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

import wandb

wandb.init(mode='disabled')

In [37]:
def print_trainable_parameters(model):

    """
    Prints the number of trainable parameters in the model.
    Utility function to see how efficient LoRA is.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## 1.2. Load the base model

Here, you need to write a quantization config to load the model while save some VRAM

In [38]:
# Very small LLM, you can try other models from huggingface
model_repo_id = "Qwen/Qwen2.5-0.5B"


bnb_config = BitsAndBytesConfig(
    # 
    # Fill this
    #
)

model = AutoModelForCausalLM.from_pretrained(
    model_repo_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
tokenizer.pad_token = tokenizer.eos_token

## 1.3. Preparing the base model for training

Here, you have to write the config for the LoRA.

In [None]:
config = LoraConfig(
    #
    # Fill this
    #
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

__Question__: Why are we training only 0.17% of the total parameters? 

## 1.4. Testing the base model before training


Convert the user prompt into the format the tokenizer expects in the _apply_chat_template_ method.

In [None]:
user_input = 'What equipment do I need for rock climbing ?'

prompt = [
    {
        #
        # Fill this
        #
    }
]


prompt =  tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
print(prompt)
model = model.to('cuda')


generation_config = model.generation_config
generation_config.do_sample = True
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

__Question__: What does the _top_p_ parameter mean in the above cell ?

In [None]:
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

__Question__: What can you say about the model's output ?

## 1.5 Preparing the Data

To fix that, we need a dataset with user queries and assistant response.

We are using the _helpful-instructions_ dataset here.

In [None]:
data = load_dataset("HuggingFaceH4/helpful-instructions")
pd.DataFrame(data["train"])

We need to convert the whole dataset to a chat-like format for the tokenizer

In [43]:
def generate_prompt(data_point):
    chat = [
        #
        # Fill this
        #
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle(seed=42).map(generate_and_tokenize_prompt)

## 1.6 Training the model

After the data is loaded and the model is ready to be trained, let's train it.

You can find documentation on the way training works in the _transformers_ library [here](https://huggingface.co/docs/transformers/trainer).
Fill the _Trainer_ arguments.

In [None]:
OUTPUT_DIR = "experiments"

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200,   # try more steps if you can
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
)

trainer = transformers.Trainer(
    #
    # Fill this
    #
)

model.config.use_cache = False
trainer.train()

We can notice going down (globally) during the training. However, it seems quite a bit unstable, this is the combined effect of the cosine learning rate scheduling, that forces the model out of local minima, and of the transformer architecture [that is unstable to train](https://liyuanlucasliu.github.io/files/slides-transformer-clinic.pdf).

## 1.7 Test the model

We can now test the result of the fine-tuning:

In [46]:
def generate_response(question: str) -> str:
    chat = [
        #
        # Fill this
        #
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "assistant"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [None]:
prompt = "What program can I use to edit video clips I took with my phone?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Do you know the reasons as to why people love coffee so much?"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

__Questions__: What can be said about the outputs now ? On what aspects did the model become better and what limitations does it still have ?

# Part 2: Further refining the model using DPO

In the first part, we obtained a decent model. It was not great and still had issues, but it was consequently better than the base model. Let's use DPO to refine it further

# 2.1 Test the model before DPO

We do yet another test to compare before and after DPO:

In [None]:
user_input = 'Can you taste this dish and tell me if it needs more spices?'

chat2 = [
    #
    # Fill this
    #
]
prompt_2 = tokenizer.apply_chat_template(chat2, tokenize=False, add_generation_prompt=True)
print(prompt_2)

device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## 2.2 Preparing the data

DPO requires a special dataset, of which each entry contains the user input and 2 possible model outputs: a target one and an unwanted one.

Such a dataset is the __CultriX/llama70B-dpo-dataset__, which we will use hereafter. Let's load it and tokenize it:

In [None]:
data_dpo = load_dataset("CultriX/llama70B-dpo-dataset")
pd.DataFrame(data_dpo["train"])

In [None]:
def preprocess_data_dpo(data_point):
    chat = [
        #
        # Fill this
        #
    ]
    return {'prompt': tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True),
            'chosen': data_point['chosen'],
            'rejected': data_point['rejected']}

data_dpo = data_dpo['train'].shuffle(seed=42).map(preprocess_data_dpo)

In [None]:
print(data_dpo)
data_dpo[0]

We can see the structure of the dataset:
- _system_ is the instruction given to the model
- _question_ is the user-asked question
- _chosen_ is the target answer from the model
- _rejected_ is the answer we do not want
- _prompt_ is the column we just added, containing the data ready to be tokenized for training

## 2.3 Training

With the model already ready after part 1 and the data just ready, let's train the model using LoRA/DPO. Fill the _DPOTrainer_ first arguments

In [None]:
OUTPUT_DIR = "experiments_dpo"

training_args = DPOConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200, # try more if you can
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05
)

dpo_args = {
    "beta": 0.1,
}

print(model.__dict__)

trainer = DPOTrainer(
    #
    # Fill this
    #
    **dpo_args
    # Data collator is not needed for DPOTrainer as it internally manages it
)

model.config.use_cache = False
trainer.train()

## 2.4 Testing the model after DPO

Let's test the new model:

In [54]:
def generate_response(question: str) -> str:
    chat = [
        #
        # Fill this
        #
    ]
    prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [None]:
prompt = "Do people dream in color or black and white?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Explain the concept of economic policies in simple terms"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

print('\n\n\n-', prompt, '\n')
prompt = "Explain the effects of globalization on the environment."
print(generate_response(prompt))

__Question__: What can you say about the results now ? You can scroll up to compare the base model, the fine-tuned LoRA model, and the DPO model.

# Conclusion

With the right datasets and the right tools, even 0.5B models can generate very good answers. I hope you found this small introduction interesting and that you will stick around to see more !