<a href="https://colab.research.google.com/github/pratim808/smol-course/blob/main/2_preference_alignment/notebooks/dpo_finetuning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [None]:
# Install the requirements in Google Colab
!pip install -U transformers datasets trl huggingface_hub

In [1]:


# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import libraries


In [4]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig

In [2]:
!pip install -U torch



In [9]:
!pip install -U unsloth

Collecting unsloth
  Downloading unsloth-2025.1.6-py3-none-any.whl.metadata (53 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.7/53.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.1.4 (from unsloth)
  Downloading unsloth_zoo-2025.1.4-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.11-py3-none-any.whl.metadata (9.4 kB)
Collecting protobuf<4.0.0 (from unsloth)
  Downloading protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting hf_transfer (from unsloth)
  Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting cut_cross_entropy (from unsloth_zoo>=2025.1.4->unsloth)
 

## Format dataset

In [5]:

# TODO: 🦁 change the model to the path or repo id of the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

In [6]:
print(tokenizer.chat_template)

{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [4]:
print(tokenizer.chat_template)

{{- bos_token }}
{%- if custom_tools is defined %}
    {%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
    {%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
    {%- if strftime_now is defined %}
        {%- set date_string = strftime_now("%d %b %Y") %}
    {%- else %}
        {%- set date_string = "26 Jul 2024" %}
    {%- endif %}
{%- endif %}
{%- if not tools is defined %}
    {%- set tools = none %}
{%- endif %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
    {%- set system_message = messages[0]['content']|trim %}
    {%- set messages = messages[1:] %}
{%- else %}
    {%- set system_message = "" %}
{%- endif %}

{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
    {{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- 

In [7]:
# Load dataset

# TODO: 🦁🐕 change the dataset to one of your choosing
dataset = load_dataset("trl-lib/ultrafeedback_binarized",split='test')

In [8]:
dataset

Dataset({
    features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
    num_rows: 1000
})

In [18]:
# Select the first 50 samples from the 'train' split
subset = dataset.select(range(50))
subset

Dataset({
    features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
    num_rows: 50
})

In [31]:

subset['chosen'][0]


[{'content': 'As an HR manager, you want to test a potential employee\'s ability to solve puzzles to determine their suitability for a job. Write a Python script that generates a list of questions that require logical reasoning to answer. Your list should include questions related to mathematical puzzles, language puzzles, logic puzzles, lateral thinking puzzles, and pattern recognition puzzles. Use the following code as a starting point:\nquestions = {\n    "Mathematical puzzles": ["If the value of x+y = 20 and x-y = 10, what is the value of x and y?", "If a pizza has a radius of 8 inches and is cut into 6 equal slices, what is the area of each slice?"],\n    "Language puzzles": ["What word starts with \'e\' and ends with \'e\' but only contains one letter?", "I am taken from a mine, and shut up in a wooden case, from which I am never released, and yet I am used by almost every person. What am I?"],\n    "Logic puzzles": ["You have 3 boxes. One contains only apples, one contains only 

In [11]:
def format_function(example):
        # Format 'chosen' text
        messages_chosen = [
            {"role": "user", "content": example["chosen"]}
        ]
        formatted_chosen = tokenizer.apply_chat_template(
            messages_chosen,
            tokenize=False,
            add_generation_prompt=False
        )

        # Format 'rejected' text
        messages_rejected = [
            {"role": "user", "content": example["rejected"]}
        ]
        formatted_rejected = tokenizer.apply_chat_template(
            messages_rejected,
            tokenize=False,
            add_generation_prompt=False
        )

        chosen = example["chosen"]
        rejected = example["rejected"]
        return {
            "formatted_chosen": formatted_chosen,
            "formatted_rejected": formatted_rejected,
            "chosen": chosen,
            "rejected": rejected,
        }

In [21]:
#method2
def format_function(example):
        # Format 'chosen' text
        messages_chosen = [
            {"role": "user", "content": str(example["chosen"])} # convert list to string
        ]
        formatted_chosen = tokenizer.apply_chat_template(
            messages_chosen,
            tokenize=False,
            add_generation_prompt=False
        )

        # Format 'rejected' text
        messages_rejected = [
            {"role": "user", "content": str(example["rejected"])} # convert list to string
        ]
        formatted_rejected = tokenizer.apply_chat_template(
            messages_rejected,
            tokenize=False,
            add_generation_prompt=False
        )

        chosen = example["chosen"]
        rejected = example["rejected"]
        return {
            "formatted_chosen": formatted_chosen,
            "formatted_rejected": formatted_rejected,
            #"chosen": chosen,
            #"rejected": rejected,
        }

In [23]:
# save columns
original_columns=subset.column_names


In [24]:
original_columns

['chosen', 'rejected', 'score_chosen', 'score_rejected']

In [25]:
formatted_dataset = subset.map(
        #format_function,
        function=format_function,
    remove_columns=original_columns
    )


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [26]:
formatted_dataset[0]

{'formatted_chosen': '<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\n[{\'content\': \'As an HR manager, you want to test a potential employee\\\'s ability to solve puzzles to determine their suitability for a job. Write a Python script that generates a list of questions that require logical reasoning to answer. Your list should include questions related to mathematical puzzles, language puzzles, logic puzzles, lateral thinking puzzles, and pattern recognition puzzles. Use the following code as a starting point:\\nquestions = {\\n    "Mathematical puzzles": ["If the value of x+y = 20 and x-y = 10, what is the value of x and y?", "If a pizza has a radius of 8 inches and is cut into 6 equal slices, what is the area of each slice?"],\\n    "Language puzzles": ["What word starts with \\\'e\\\' and ends with \\\'e\\\' but only contains one letter?", "I am taken from a mine, and shut up in a wooden case, from which I am ne

In [17]:
formatted_dataset['formatted_chosen'][0]

'<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\n[{\'content\': \'As an HR manager, you want to test a potential employee\\\'s ability to solve puzzles to determine their suitability for a job. Write a Python script that generates a list of questions that require logical reasoning to answer. Your list should include questions related to mathematical puzzles, language puzzles, logic puzzles, lateral thinking puzzles, and pattern recognition puzzles. Use the following code as a starting point:\\nquestions = {\\n    "Mathematical puzzles": ["If the value of x+y = 20 and x-y = 10, what is the value of x and y?", "If a pizza has a radius of 8 inches and is cut into 6 equal slices, what is the area of each slice?"],\\n    "Language puzzles": ["What word starts with \\\'e\\\' and ends with \\\'e\\\' but only contains one letter?", "I am taken from a mine, and shut up in a wooden case, from which I am never released, and yet

In [None]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.

## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


## Train model with DPO

In [28]:
from unsloth import is_bfloat16_supported
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=2,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    #bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    #hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,

)

In [29]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset,
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
     beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Applying chat template to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [30]:
# Train the model
trainer.train()

Step,Training Loss
1,2.7726
2,2.7726


TrainOutput(global_step=2, training_loss=2.7725887298583984, metrics={'train_runtime': 38.9365, 'train_samples_per_second': 0.822, 'train_steps_per_second': 0.051, 'total_flos': 0.0, 'train_loss': 2.7725887298583984, 'epoch': 0.032})

In [None]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.