# Preference Alignment with Odds Ratio Preference Optimization (ORPO)

This notebook will guide you through the process of fine-tuning a language model using Odds Ratio Preference Optimization (ORPO). We will use the SmolLM2-135M model which has **not** been through SFT training, so it is not compatible with DPO. This means, you cannot use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with ORPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Try on a subset of mlabonne's `orpo-dpo-mix-40k` dataset</p>
</div>



## Import libraries


In [4]:
# Install the requirements in Google Colab
!pip install transformers datasets trl huggingface_hub



In [2]:
!pip install --upgrade tensorflow transformers trl

Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.47.1-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m79.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13

In [3]:
import torch
import os
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format

# Authenticate to Hugging Face
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Format dataset

In [7]:
# Load dataset
from datasets import Dataset
# TODO: 🦁🐕 change the dataset to one of your choosing
dataset = load_dataset(path="Danielbrdz/Barcenas-Medicina-DPO")

In [8]:
# Extract the individual data columns
questions = dataset["train"]["question"]
chosen_responses = dataset["train"]["chosen"]
rejected_responses = dataset["train"]["rejected"]

# Construct the desired structure
formatted_data = []

for i in range(len(questions)):
    # Build the chosen conversation
    chosen_conversation = [
        {'content': str(questions[i]), 'role': 'user'},
        {'content': str(chosen_responses[i]), 'role': 'assistant'}
    ]

    # Build the rejected conversation
    rejected_conversation = [
        {'content': str(questions[i]), 'role': 'user'},
        {'content': str(rejected_responses[i]), 'role': 'assistant'}
    ]

    # Append the conversations to the formatted data
    formatted_data.append({
        'question': questions[i],  # Keep 'question' here for now
        'chosen': chosen_conversation,
        'rejected': rejected_conversation
    })

# Convertir la lista de datos formateados a un Dataset de Hugging Face
dataset = Dataset.from_dict({
    'question': [item['question'] for item in formatted_data],
    'chosen': [item['chosen'] for item in formatted_data],
    'rejected': [item['rejected'] for item in formatted_data]
})

# Split the dataset into training and testing sets (10% for testing)
dataset = dataset.train_test_split(test_size=0.1)

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'chosen', 'rejected'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['question', 'chosen', 'rejected'],
        num_rows: 1000
    })
})

## (OPTIONAL) Upload dataset to HF

In [80]:
# Upload to Hugging Face after Login
#dataset.push_to_hub("medicina-qa-binarized-dpo-orpo-es")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/9 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/daqc/medicina-qa-binarized-dpo-orpo-es/commit/1ecb0815dc6db24e38a1dba4999c2231ae31a2d5', commit_message='Upload dataset', commit_description='', oid='1ecb0815dc6db24e38a1dba4999c2231ae31a2d5', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/daqc/medicina-qa-binarized-dpo-orpo-es', endpoint='https://huggingface.co', repo_type='dataset', repo_id='daqc/medicina-qa-binarized-dpo-orpo-es'), pr_revision=None, pr_num=None)

In [10]:
# Eliminar la columna 'question' después de subir el dataset
dataset = dataset.remove_columns(["question"])

# Continue ...

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 1000
    })
})

In [12]:
# Example of the first record in the training set
dataset["train"][0]

{'chosen': [{'content': '¿Cuál es la prueba de imagen de elección para diagnosticar la apendicitis aguda?',
   'role': 'user'},
  {'content': 'La tomografía computarizada (TC)', 'role': 'assistant'}],
 'rejected': [{'content': '¿Cuál es la prueba de imagen de elección para diagnosticar la apendicitis aguda?',
   'role': 'user'},
  {'content': 'La ecografía abdominal', 'role': 'assistant'}]}

## Define the model

In [13]:
model_name = "HuggingFaceTB/SmolLM2-135M"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)
model, tokenizer = setup_chat_format(model, tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-ORPO-Medicina-es"
finetune_tags = ["smol-course", "module_2"]

# Wandb

In [None]:
! pip install -U wandb

In [15]:
import wandb
import os

wandb.login()

wandb_project = "SmolLM2-FT-ORPO-Medicina-es"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mda-qc[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [16]:
os.environ["WANDB_PROJECT"]

'SmolLM2-FT-ORPO-Medicina-es'

## Train model with ORPO

In [20]:
orpo_args = ORPOConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=8e-6,
    # Linear learning rate decay over training
    lr_scheduler_type="linear",
    # Maximum combined length of prompt + completion
    max_length=1024,
    # Maximum length for input prompts
    max_prompt_length=512,
    # Controls weight of the odds ratio loss (λ in paper)
    beta=0.1,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    # optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=1,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=10,
    # Disable external logging
    report_to="wandb",
    # Where to save model/checkpoints
    output_dir="./results/",
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
)



In [21]:
trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)



Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [22]:
trainer.train()  # Train the model



Step,Training Loss,Validation Loss,Runtime,Samples Per Second,Steps Per Second,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen,Nll Loss,Log Odds Ratio,Log Odds Chosen
225,1.8185,2.046818,26.2345,38.118,19.059,-0.253622,-0.293812,0.632,0.040189,-2.938116,-2.536225,5.661318,4.559006,1.984098,-0.627202,0.42737
450,1.7416,1.837111,26.3288,37.981,18.991,-0.226067,-0.295041,0.75,0.068973,-2.950406,-2.260675,4.547038,3.486407,1.784808,-0.523035,0.745427
675,1.9434,1.734374,26.605,37.587,18.793,-0.212011,-0.294439,0.774,0.082427,-2.944386,-2.120115,3.536927,2.667738,1.686437,-0.479373,0.898015
900,1.6827,1.681728,26.4231,37.846,18.923,-0.205287,-0.29502,0.793,0.089733,-2.950195,-2.052866,3.16854,2.321031,1.636002,-0.45726,0.980784
1125,1.5922,1.665211,26.6423,37.534,18.767,-0.203451,-0.296073,0.8,0.092621,-2.960725,-2.034511,3.149195,2.307758,1.620492,-0.447194,1.013845


TrainOutput(global_step=1125, training_loss=1.93087341287401, metrics={'train_runtime': 926.821, 'train_samples_per_second': 9.711, 'train_steps_per_second': 1.214, 'total_flos': 0.0, 'train_loss': 1.93087341287401, 'epoch': 1.0})

In [23]:
# Save the model
trainer.save_model(f"./{finetune_name}")

In [29]:
dpo_model_path = "./SmolLM2-FT-ORPO-Medicina-es/"
dpo_model = AutoModelForCausalLM.from_pretrained(dpo_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(dpo_model_path)

chat_template = "{content}"

# Let's test the base model before training
prompt = "¿Qué es una contusión cerebral?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = prompt

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

output = dpo_model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    temperature=0.7
)

response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

¿Qué es una contusión cerebral?

Añadir un comentario

¿Qué es una contusión cerebral?

La contusión cerebral es una infección de cerebro que afecta la complicación cerebral, puede causar aumento de la inflamación cerebral, aumento de la cerebro dolorosa, aumento de la inflamación central, y la inmovilación del cuerpo. La contusión cerebral es una infección de cerebro que afecta la complicación cerebral, puede causar aumento de la inflamación cerebral, aumento de la cerebro dolorosa, aumento de la inflamación central, y la inmovilación del cuerpo. La contusión cerebral es una infección de cerebro que afecta la complicación cerebral, puede causar aumento de la


In [24]:
# Save to the huggingface hub if login
trainer.push_to_hub(tags=finetune_tags)

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.62k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/daqc/SmolLM2-FT-ORPO-Medicina-es/commit/8e488b7eaf507f221ebfcb94b9c6f331e42510bb', commit_message='End of training', commit_description='', oid='8e488b7eaf507f221ebfcb94b9c6f331e42510bb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/daqc/SmolLM2-FT-ORPO-Medicina-es', endpoint='https://huggingface.co', repo_type='model', repo_id='daqc/SmolLM2-FT-ORPO-Medicina-es'), pr_revision=None, pr_num=None)

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `ORPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.

# 📥 Uploaded Model and Wandb Logs

- **🤖 Hugging Face Repository**: [SmolLM2-FT-ORPO-Medicina-es](https://huggingface.co/daqc/SmolLM2-FT-ORPO-Medicina-es)  
- **📊 Weights and Biases Run**: [Training Logs and Metrics](https://wandb.ai/da-qc/SmolLM2-FT-ORPO-Medicina-es/runs/7es46q99?nw=nwuserdaqc)  