# Unsloth: Reinforcement Learning (DPO/ORPO)

This notebook demonstrates how to perform **Direct Preference Optimization (DPO)** or **Odds Ratio Preference Optimization (ORPO)** using Unsloth.

These techniques align the model with human preferences using a dataset containing 'chosen' and 'rejected' responses.

In [1]:
%%capture
!pip install unsloth
!pip install --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
import torch

# Patch DPO Trainer to make it work with Unsloth
PatchDPOTrainer()

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit", # Starting with an SFT model is best for DPO
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.4: Fast Mistral patching. Transformers: 4.57.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### LoRA Configuration
We add LoRA adapters as usual.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2025.11.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Data Preparation
We use `mlabonne/orpo-dpo-mix-40k`, which contains `chosen` and `rejected` columns.

In [5]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/orpo-dpo-mix-40k", split = "train")

# DPO requires specific column names
def format_dpo(examples):
    return {
        "prompt": examples["question"],
        "chosen": examples["chosen"],
        "rejected": examples["rejected"],
    }
dataset = dataset.map(format_dpo, batched = True)

Map:   0%|          | 0/44245 [00:00<?, ? examples/s]

In [7]:
from trl import DPOTrainer, DPOConfig

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",

        # DPO specific arguments moved here:
        beta = 0.1,
        max_length = max_seq_length,
        max_prompt_length = 512,
    ),
)

Extracting prompt in train dataset (num_proc=16):   0%|          | 0/44245 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=16):   0%|          | 0/44245 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=16):   0%|          | 0/44245 [00:00<?, ? examples/s]

In [8]:
dpo_trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 44,245 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 7,283,675,136 (0.58% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
1,0.6931,0.0,0.0,0.0,0.0,-445.565857,-272.393005,-2.911054,-3.035176,0,0,0
2,0.6931,0.0,0.0,0.0,0.0,-458.966309,-368.121704,-2.937612,-2.941957,No Log,No Log,No Log
3,0.6872,-0.005102,-0.017382,0.625,0.012281,-311.526428,-375.335999,-2.917813,-2.946048,No Log,No Log,No Log
4,0.7048,-0.022068,-0.000301,0.375,-0.021768,-232.845245,-339.992584,-2.565366,-2.532588,No Log,No Log,No Log
5,0.6898,-0.020403,-0.028507,0.625,0.008104,-349.652557,-277.897095,-2.955345,-3.026468,No Log,No Log,No Log
6,0.7149,-0.031677,0.010317,0.125,-0.041994,-323.906799,-353.354004,-3.014825,-2.967087,No Log,No Log,No Log
7,0.6906,-0.024083,-0.029488,0.625,0.005406,-400.348358,-308.025848,-3.137349,-3.151953,No Log,No Log,No Log
8,0.6853,-0.001027,-0.01832,0.625,0.017293,-342.445221,-301.72287,-2.863499,-2.776596,No Log,No Log,No Log
9,0.7089,-0.035968,-0.005283,0.25,-0.030685,-415.104675,-315.279022,-2.474803,-2.733386,No Log,No Log,No Log
10,0.695,-0.050528,-0.048233,0.625,-0.002296,-425.65918,-389.262451,-2.896675,-2.823846,No Log,No Log,No Log


TrainOutput(global_step=60, training_loss=0.6634057144323985, metrics={'train_runtime': 306.1292, 'train_samples_per_second': 1.568, 'train_steps_per_second': 0.196, 'total_flos': 0.0, 'train_loss': 0.6634057144323985, 'epoch': 0.010848438276906387})