# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [1]:
!pip install transformers datasets trl huggingface_hub


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0mSuccessfully installed datasets-3.2.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 trl-0.13.0 xxhash-3.5.0


In [2]:
# Install the requirements in Google Colab
#
# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import libraries


In [3]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig


## Format dataset

In [5]:
# Load dataset

# TODO: 🦁🐕 change the dataset to one of your choosing
dataset = load_dataset(path="trl-lib/ultrafeedback_binarized", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
print(dataset[0])

{'chosen': [{'content': 'Use the pygame library to write a version of the classic game Snake, with a unique twist', 'role': 'user'}, {'content': "Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! Here's a basic outline of how we can approach this:\n\n1. First, we'll need to set up the game display and create a game object that we can use to handle the game's state.\n2. Next, we'll create the game's grid, which will be used to represent the game board. We'll need to define the size of the grid and the spaces within it.\n3. After that, we'll create the snake object, which will be used to represent the player's movement. We'll need to define the size of the snake and the speed at which it moves.\n4. We'll also need to create a food object, which will be used to represent the food that the player must collect to score points. We'll need to define the location of the food and the speed at which it moves.\n5. Once we have these objects set up,

In [7]:
print(f'total samples: {len(dataset)}')

total samples: 62135


In [16]:
test_idx = 0
print(f'\nexample {test_idx}')
print(f"User prompt: {dataset[test_idx]['chosen'][0]['content']}")
print(f"scores:\n -chosen: {dataset[test_idx]['score_chosen']}, \n - rejected: {dataset[test_idx]['score_rejected']}")



example 0
User prompt: Use the pygame library to write a version of the classic game Snake, with a unique twist
scores:
 -chosen: 6.0, 
 - rejected: 4.0


In [None]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.

## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


In [17]:
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

In [26]:
# config = AutoModelForCausalLM.from_pretrained(model_name)
# conig = model.config
# print("Model Architecture Details:")
# # print(f"Model type: {config.model_type}")
# # print(f"Number of parameters: {config.n_params / 1_000_000:.1f}M")
# print(f"Number of layers: {config.num_hidden_layers}")
# print(f"Hidden size: {config.hidden_size}")
# print(f"Number of attention heads: {config.num_attention_heads}")
# print(f"Vocab size: {config.vocab_size}")
# print(f"Maximum sequence length: {config.max_position_embeddings}")

In [18]:
device

'cuda'

In [21]:
# TODO: 🦁 change the model to the path or repo id of the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-DPO"
# finetune_tags = ["smol-course", "module_1"]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

In [29]:
import json
# Print generation config
print("\nGeneration Config:")
print(json.dumps(model.generation_config.__dict__, indent=2))


Generation Config:
{
  "max_length": 20,
  "max_new_tokens": null,
  "min_length": 0,
  "min_new_tokens": null,
  "early_stopping": false,
  "max_time": null,
  "stop_strings": null,
  "do_sample": false,
  "num_beams": 1,
  "num_beam_groups": 1,
  "penalty_alpha": null,
  "dola_layers": null,
  "use_cache": true,
  "cache_implementation": null,
  "cache_config": null,
  "return_legacy_cache": null,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "min_p": null,
  "typical_p": 1.0,
  "epsilon_cutoff": 0.0,
  "eta_cutoff": 0.0,
  "diversity_penalty": 0.0,
  "repetition_penalty": 1.0,
  "encoder_repetition_penalty": 1.0,
  "length_penalty": 1.0,
  "no_repeat_ngram_size": 0,
  "bad_words_ids": null,
  "force_words_ids": null,
  "renormalize_logits": false,
  "constraints": null,
  "forced_bos_token_id": null,
  "forced_eos_token_id": null,
  "remove_invalid_values": false,
  "exponential_decay_length_penalty": null,
  "suppress_tokens": null,
  "begin_suppress_tokens": null,
  "for

In [30]:
#  special tokens
print("\nSpecial Tokens Map:")
print(json.dumps(tokenizer.special_tokens_map, indent=2))


Special Tokens Map:
{
  "bos_token": "<|im_start|>",
  "eos_token": "<|im_end|>",
  "unk_token": "<|endoftext|>",
  "pad_token": "<|im_end|>",
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>"
  ]
}


In [32]:
# Print model config

# Function to make config JSON serializable
def clean_config_dict(config_dict):
    cleaned = {}
    for k, v in config_dict.items():
        # Convert any non-serializable types to strings
        if str(type(v).__name__) in ['dtype']:
            cleaned[k] = str(v)
        else:
            cleaned[k] = v
    return cleaned

print("\nModel Config:")
config_dict = clean_config_dict(model.config.__dict__)
print(json.dumps(config_dict, indent=2))



Model Config:
{
  "vocab_size": 49152,
  "max_position_embeddings": 8192,
  "hidden_size": 576,
  "intermediate_size": 1536,
  "num_hidden_layers": 30,
  "num_attention_heads": 9,
  "num_key_value_heads": 3,
  "hidden_act": "silu",
  "initializer_range": 0.041666666666666664,
  "rms_norm_eps": 1e-05,
  "pretraining_tp": 1,
  "use_cache": false,
  "rope_theta": 100000,
  "rope_scaling": null,
  "attention_bias": false,
  "attention_dropout": 0.0,
  "mlp_bias": false,
  "head_dim": 64,
  "return_dict": true,
  "output_hidden_states": false,
  "output_attentions": false,
  "torchscript": false,
  "torch_dtype": "torch.float32",
  "use_bfloat16": false,
  "tf_legacy_loss": false,
  "pruned_heads": {},
  "tie_word_embeddings": true,
  "chunk_size_feed_forward": 0,
  "is_encoder_decoder": false,
  "is_decoder": false,
  "cross_attention_hidden_size": null,
  "add_cross_attention": false,
  "tie_encoder_decoder": false,
  "max_length": 20,
  "min_length": 0,
  "do_sample": false,
  "early_st

## Train model with DPO

In [41]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=200,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="smol_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to=None,
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    beta = 0.1,
    max_prompt_length=1024,
    max_length=1536,
)

In [42]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset,
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    # beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Extracting prompt from train dataset:   0%|          | 0/62135 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/62135 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/62135 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2379 > 2048). Running this sequence through the model will result in indexing errors
max_steps is given, it will override any value given in num_train_epochs


In [44]:
finetune_tags = ["smol-course", "module_1"]

In [56]:
%pwd

'/content'

In [59]:
# Verify file exists
print("File exists:", os.path.exists('.env'))

File exists: True


In [60]:
!pip install python-dotenv

Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [61]:
# Save to the huggingface hub if login (HF_TOKEN is set)
from dotenv import load_dotenv
load_dotenv()

# At the end, when saving to hub:
if os.getenv("HF_TOKEN"):
    print("Found HF_TOKEN, pushing to hub...")
    trainer.push_to_hub(tags=finetune_tags)
else:
    print("No HF_TOKEN found in environment variables")

Found HF_TOKEN, pushing to hub...


HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-676272c1-19e41e49542abe555cffddbe;f9695167-856f-4c30-a8e5-4dd499d03aff)

Invalid username or password.

In [43]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)
if os.getenv("HF_TOKEN"):
    trainer.push_to_hub(tags=finetune_tags)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,0.6931
2,0.6931
3,0.7047
4,0.6812
5,0.6779
6,0.6991
7,0.6925
8,0.6998
9,0.718
10,0.6876


## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.