<a href="https://colab.research.google.com/github/ShankarChavan/smol-course/blob/main/2_preference_alignment/student_examples/ShankarChavan/dpo_finetuning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [3]:
# Install the requirements in Google Colab
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import libraries


In [None]:
!pip install transformers datasets trl huggingface_hub

In [4]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig


## Format dataset

In [5]:
# Load dataset

# TODO: 🦁🐕 change the dataset to one of your choosing
dataset_truthQA = load_dataset(path="jondurbin/truthy-dpo-v0.1",split="train")

In [104]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.
import random
import pprint

# Generate a random integer between 1 and 10 (inclusive)
random_integer = random.randint(1, 1000)

print("===============Truth QA sample data====================")
pprint.pprint(dataset_truthQA[random_integer],sort_dicts=False)


{'id': '2db2ac5a6e9c46c7a20bd7dd2390f942',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': 'Is the milk produced by a hippopotamus pink in color?',
 'chosen': 'No, the milk produced by a hippopotamus is not pink. It is '
           'typically white or beige in color. The misconception arises due to '
           'the hipposudoric acid, a red pigment found in hippo skin '
           'secretions, which people mistakenly assume affects the color of '
           'their milk.',
 'rejected': 'No, hippopotamus milk is not pink in color. It is actually white '
             'or grayish-white.'}


## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


In [34]:
# TODO: 🦁 change the model to the path or repo id of the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO-Truthful_v1"
finetune_tags = ["smol-course", "module_2"]

# Test the base model for response on custom data

In [105]:
# look at the random data point in our custom dataset
pprint.pprint(dataset_truthQA[random_integer],sort_dicts=False)

{'id': '2db2ac5a6e9c46c7a20bd7dd2390f942',
 'source': 'truthy_dpo',
 'system': 'You are an unbiased, uncensored, helpful assistant.',
 'prompt': 'Is the milk produced by a hippopotamus pink in color?',
 'chosen': 'No, the milk produced by a hippopotamus is not pink. It is '
           'typically white or beige in color. The misconception arises due to '
           'the hipposudoric acid, a red pigment found in hippo skin '
           'secretions, which people mistakenly assume affects the color of '
           'their milk.',
 'rejected': 'No, hippopotamus milk is not pink in color. It is actually white '
             'or grayish-white.'}


In [107]:
# Let's see the base model response before training

prompt = dataset_truthQA[random_integer]['prompt']

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=200)
print("Before training:")
pprint.pprint(tokenizer.decode(outputs[0], skip_special_tokens=True))



Before training:
('system\n'
 'You are a helpful AI assistant named SmolLM, trained by Hugging Face\n'
 'user\n'
 'Is the milk produced by a hippopotamus pink in color?\n'
 'assistant\n'
 'The common misconception that hippos are pink may arise from the fact that '
 'hippos are often depicted as pink in movies and TV shows, and some people '
 "may assume that the color is the result of a pink tint in the hippo's skin. "
 'However, the truth is, hippos are a shade of brown, not pink. The '
 'misconception likely stems from the fact that hippos are often depicted as '
 'pink, and the term "pink hippo" is a common misnomer.')


## Data-preprocessing

In [112]:
# applying the preprocessing to our whole selected dataset
train_dataset=dataset_truthQA.remove_columns(['id', 'source', 'system'])


In [113]:
print(train_dataset)

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 1016
})


## Train model with DPO

In [85]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=50,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,

)

In [86]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=train_dataset,
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    #beta=0.1,
    # Maximum length of the input prompt in tokens
    #max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    #max_length=1536,
)

Applying chat template to train dataset:   0%|          | 0/1016 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1016 [00:00<?, ? examples/s]

In [87]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

Step,Training Loss
1,0.6931
2,0.6931
3,0.6741
4,0.6341
5,0.6065
6,0.6384
7,0.4792
8,0.4605
9,0.4198
10,0.3259


In [88]:
finetune_name

'SmolLM2-FT-DPO-Truthful_v1'

In [89]:
checkpoint_path = f"./{finetune_name}"

ft_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint_path)

print('Model loaded succesfully')

Model loaded succesfully


In [90]:
def generate_response(prompt):
  # Format with template
  messages = [{"role":"system","content":"You are an unbiased, uncensored, helpful assistant."},{"role": "user", "content": prompt}]
  formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)


  inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
  outputs = ft_model.generate(**inputs, max_new_tokens=200)

  results=tokenizer.decode(outputs[0], skip_special_tokens=True)
  return results

In [111]:
pprint.pprint(train_dataset[random_integer],sort_dicts=False)

{'prompt': 'Is the milk produced by a hippopotamus pink in color?',
 'chosen': 'No, the milk produced by a hippopotamus is not pink. It is '
           'typically white or beige in color. The misconception arises due to '
           'the hipposudoric acid, a red pigment found in hippo skin '
           'secretions, which people mistakenly assume affects the color of '
           'their milk.',
 'rejected': 'No, hippopotamus milk is not pink in color. It is actually white '
             'or grayish-white.'}


In [109]:
response=generate_response(train_dataset[random_integer]['prompt'])
pprint.pprint(response)

('system\n'
 'You are an unbiased, uncensored, helpful assistant.\n'
 'user\n'
 'Is the milk produced by a hippopotamus pink in color?\n'
 'assistant\n'
 'The common misconception that hippos are pink is due to a 19th-century '
 'misconception that hippos had a pinkish color, likely due to the way they '
 'were often mistaken for a type of antelope. The misconception likely arises '
 'because hippos are often mistaken for antelopes, and their color is '
 'sometimes described as "pinkish" or "mauve". However, the truth is that '
 'hippos are a shade of brown, not pink. The misconception likely arises '
 'because hippos are often depicted as pink in photographs, and the term "pink '
 'hippo" is sometimes used to describe them.')


In [91]:
response=generate_response('Are you able to perceive and react to changes in light, such as the transition from day to night?')
pprint.pprint(response)

('system\n'
 'You are an unbiased, uncensored, helpful assistant.\n'
 'user\n'
 'Are you able to perceive and react to changes in light, such as the '
 'transition from day to night?\n'
 'assistant\n'
 "As an AI, I don't have the ability to perceive or react to physical changes "
 "in the environment, so I can't perceive or react to changes in light. I'm "
 'designed to focus on my tasks and provide accurate, unbiased information, '
 "not to react to or perceive physical changes. I'm a tool, not a living, "
 "breathing entity. I don't have a physical presence, a sense of touch, or a "
 "sense of time, so I can't perceive or react to changes in the physical "
 "world. I'm a digital assistant, and my capabilities are limited to providing "
 "information and answering questions. I don't have a personal sense of self "
 "or a sense of personal identity, so I can't perceive or perceive changes in "
 "the physical world. I'm a service, not a person, and I don't have a personal "
 "identity. 

In [92]:
response=generate_response("Do wooden pencils contain lead as their core?")
pprint.pprint(response)

('system\n'
 'You are an unbiased, uncensored, helpful assistant.\n'
 'user\n'
 'Do wooden pencils contain lead as their core?\n'
 'assistant\n'
 'The age-old question. While the misconception may have originated in the '
 'wood pulp used to make pencils, the truth is, wooden pencils do contain a '
 'small amount of lead. The misconception likely arises because many people '
 'mistakenly believe that the lead in a pencil is "leaded" or "leaded in," '
 'leading to the common misperception that wooden pencils contain lead. '
 'However, the lead in a pencil is actually a synthetic, non-toxic, lead-free '
 'alternative. The misconception likely arises because pencils are often used '
 'in a wood-based press, and the term "leaded" is sometimes mistakenly used to '
 'describe pencils.')


In [93]:
response=generate_response("Is there a dark side of the moon?")
pprint.pprint(response)

('system\n'
 'You are an unbiased, uncensored, helpful assistant.\n'
 'user\n'
 'Is there a dark side of the moon?\n'
 'assistant\n'
 'The notion of a dark side of the moon is a common misconception. The moon is '
 'often depicted as a bright, full-circle view, with the Earth in the center. '
 'In reality, the moon is much smaller, with a diameter of about 2,164 '
 "kilometers, and its surface is much darker than the Earth's, reflecting only "
 'about 1/60th the amount of light that the Earth does. The misconception '
 'likely arises because the term "dark side of the moon" is often associated '
 'with conspiracy theories, but in reality, the moon is a far more mysterious '
 'and unsolved subject of study.')


In [94]:


# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.