To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [2]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
# In a new Colab cell
import os

# ==============================================================================
# Step 1: DEFINE THE FILE AND FOLDER PATHS
# ==============================================================================
# The full path to the zip file you want to unzip.
zip_file_path = '/content/outputs.zip'

# This automatically creates the destination folder name by removing the '.zip' extension.
# For '/content/outputs.zip', this will become '/content/outputs'.
destination_folder = os.path.splitext(zip_file_path)[0]


# ==============================================================================
# Step 2: CHECK IF THE FILE EXISTS (SAFETY CHECK)
# ==============================================================================
if not os.path.exists(zip_file_path):
    print(f"❌ Error: The file '{zip_file_path}' was not found.")
    print("Please make sure the file exists and the path is correct.")
else:
    print(f"Found file: '{zip_file_path}'")

    # ==============================================================================
    # Step 3: CREATE THE DESTINATION FOLDER AND UNZIP
    # ==============================================================================
    # We use shell commands directly in Colab by starting the line with '!'

    # Create the destination directory. The '-p' flag prevents an error
    # if the folder already exists.
    print(f"Creating destination folder: '{destination_folder}'...")
    !mkdir -p "{destination_folder}"

    # Unzip the file into the destination folder.
    # The '-q' flag makes the output "quiet" (less verbose).
    # The '-d' flag specifies the destination directory.
    print(f"Unzipping file into '{destination_folder}'...")
    !unzip -q "{zip_file_path}" -d "{destination_folder}"

    print("\n✅ Unzipping complete.")

    # ==============================================================================
    # Step 4: VERIFY THE CONTENTS
    # ==============================================================================
    # List the files in the new directory to confirm they were extracted correctly.
    print(f"\nContents of the new '{destination_folder}' directory:")
    !ls -l "{destination_folder}"

    # ==============================================================================
    # Step 5: (Optional) CLEAN UP THE ORIGINAL .ZIP FILE
    # ==============================================================================
    # You can uncomment the lines below to automatically delete the .zip file
    # after extraction to save space on your Colab instance.

    # print(f"\nRemoving original zip file: '{zip_file_path}'...")
    # !rm "{zip_file_path}"
    # print("✅ Original zip file removed.")

Found file: '/content/outputs.zip'
Creating destination folder: '/content/outputs'...
Unzipping file into '/content/outputs'...

✅ Unzipping complete.

Contents of the new '/content/outputs' directory:
total 8
drwxr-xr-x 3 root root 4096 Aug 12 16:14 content
drwxr-xr-x 6 root root 4096 Aug 12 16:14 outputs


In [1]:
# ==============================================================================
# Step 2: LOAD YOUR SFT-TUNED MODEL (THIS IS THE BASE FOR DPO)
# ==============================================================================
import torch
from unsloth import FastVisionModel # Use FastVisionModel for image models
from peft import PeftModel

print("Step 1: Loading your SFT-tuned model...")

# --- Make sure these paths are correct ---
base_model_id = "unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit"
# This should be the directory where your best SFT model is saved.
sft_adapter_path = "/content/outputs/outputs/checkpoint-46"

# Load the base model and processor using Unsloth's optimized class
# We use FastVisionModel because your model handles images.
base_model, processor = FastVisionModel.from_pretrained(
    model_name=base_model_id,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
tokenizer = processor.tokenizer
# Apply your SFT adapters. This is the model we will now refine with DPO.
model = PeftModel.from_pretrained(base_model, sft_adapter_path)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"✅ Your SFT model is loaded and ready for DPO training on {device}.")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Step 1: Loading your SFT-tuned model...
==((====))==  Unsloth 2025.8.4: Fast Qwen2 patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


✅ Your SFT model is loaded and ready for DPO training on cuda.




In [None]:
# @title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example,
    tokenizer,
    task: Literal["sft", "generation", "rm", "dpo"] = "sft",
    assistant_prefix="<|assistant|>\n",
):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if task in ["sft", "generation"]:
        messages = example["messages"]
        # We add an empty system message if there is none
        if messages[0]["role"] != "system":
            messages.insert(0, {"role": "system", "content": ""})
        example["text"] = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True if task == "generation" else False,
        )
    elif task == "rm":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            chosen_messages = example["chosen"]
            rejected_messages = example["rejected"]
            # We add an empty system message if there is none
            if chosen_messages[0]["role"] != "system":
                chosen_messages.insert(0, {"role": "system", "content": ""})
            if rejected_messages[0]["role"] != "system":
                rejected_messages.insert(0, {"role": "system", "content": ""})
            example["text_chosen"] = tokenizer.apply_chat_template(
                chosen_messages, tokenize=False
            )
            example["text_rejected"] = tokenizer.apply_chat_template(
                rejected_messages, tokenize=False
            )
        else:
            raise ValueError(
                f"Could not format example as dialogue for `rm` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    elif task == "dpo":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [
                [msg for msg in example["chosen"] if msg["role"] == "user"][0]
            ]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(
                chosen_messages, tokenize=False
            )
            example["text_rejected"] = tokenizer.apply_chat_template(
                rejected_messages, tokenize=False
            )
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(
                example["text_chosen"], assistant_prefix
            )
            example["text_rejected"] = _strip_prefix(
                example["text_rejected"], assistant_prefix
            )
        else:
            raise ValueError(
                f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    else:
        raise ValueError(
            f"Task {task} not supported, please ensure that the provided task is one of {['sft', 'generation', 'rm', 'dpo']}"
        )
    return example


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(
    dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True
) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(
                    f"Split type {split} not recognized as one of test or train."
                )

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(
                seed=42
            )
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [2]:
# In a new Colab/Kaggle cell

# = an=============================================================================
# Step 1: IMPORT LIBRARIES
# ==============================================================================
import pandas as pd
from datasets import Dataset
import requests
from PIL import Image
from io import BytesIO

# ==============================================================================
# Step 2: LOAD YOUR PREFERENCE CSV USING THE ROBUST PANDAS PARSER
# ==============================================================================
# This is the most reliable way to handle CSVs with complex, multi-line text.

print("Step 1: Loading your ranked preference dataset from CSV using Pandas...")
try:
    # --- Make sure this filename matches the CSV file you uploaded ---
    preference_csv_filename = "final_preference_df.csv"

    # pandas.read_csv has a more advanced engine that correctly handles newlines
    # and special characters inside of quoted text fields.
    df_preferences = pd.read_csv(f"/content/{preference_csv_filename}")

except FileNotFoundError:
    print(f"❌ FATAL ERROR: '{preference_csv_filename}' not found.")
    print("Please make sure the CSV file you manually ranked has been uploaded.")
    raise

# Now, convert the clean Pandas DataFrame into a Hugging Face Dataset object.
preference_dataset = Dataset.from_pandas(df_preferences)

# This is the crucial check. If this number is correct (e.g., 68), the rest of the
# pipeline will work.
print(f"✅✅✅ Successfully and correctly loaded {len(preference_dataset)} preference pairs. ✅✅✅")


# ==============================================================================
# Step 3: DEFINE THE GRPO-SPECIFIC FORMATTING FUNCTION
# ==============================================================================
# This part of the pipeline remains the same. It formats the now-correctly-loaded
# data into the {prompt, chosen, rejected} structure.

PROMPT_INSTRUCTION = "Analyze the following chart and provide a detailed, formal description suitable for an IELTS Task 1 essay."
def download_image(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return Image.open(BytesIO(response.content)).convert("RGB")
    except Exception: return None

def format_preference_data(example):
    """ Formats the data for the ORPOTrainer. """
    image = download_image(example["image_url"])
    if image is None: return None

    prompt_convo = [{"role": "user", "content": [{"type": "text", "text": PROMPT_INSTRUCTION}, {"type": "image"}]}]
    prompt_text = processor.apply_chat_template(prompt_convo, tokenize=False, add_generation_prompt=True)

    return {
        "prompt": prompt_text,
        "chosen": example["chosen"],
        "rejected": example["rejected"],
        "images": [image],
    }


# ==============================================================================
# Step 4: APPLY THE FORMATTING TO THE DATASET
# ==============================================================================
print("\nStep 2: Formatting the dataset for the GRPO trainer...")
processed_preference_dataset = preference_dataset.map(
    format_preference_data,
    remove_columns=preference_dataset.column_names
)
processed_preference_dataset = processed_preference_dataset.filter(lambda x: x is not None)

print(f"✅ Formatted {len(processed_preference_dataset)} pairs successfully.")
print("\nYour dataset is now correctly loaded and prepared for the GRPO trainer.")

Step 1: Loading your ranked preference dataset from CSV using Pandas...
✅✅✅ Successfully and correctly loaded 68 preference pairs. ✅✅✅

Step 2: Formatting the dataset for the GRPO trainer...


Map:   0%|          | 0/68 [00:00<?, ? examples/s]

Filter:   0%|          | 0/68 [00:00<?, ? examples/s]

✅ Formatted 68 pairs successfully.

Your dataset is now correctly loaded and prepared for the GRPO trainer.


We shall print a random item from the dataset

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer

PatchDPOTrainer()

In [7]:
from torch.utils.data import DataLoader

def dpo_data_collator(features):
    # features is a list of dicts, each with keys like:
    # 'chosen_input_ids', 'chosen_attention_mask', 'rejected_input_ids', 'rejected_attention_mask'
    batch = {}
    for key in features[0].keys():
        # Stack each key's values along dim=0
        batch[key] = torch.stack([torch.tensor(f[key]) for f in features])
    return batch


In [8]:
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 5e-6,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "cosine",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    beta = 0.1,
    train_dataset = processed_preference_dataset,
    # eval_dataset = raw_datasets["test"],
    processing_class=processor,
    # tokenizer = tokenizer,
    data_collator=dpo_data_collator,
    max_length = 1024,
    max_prompt_length = 512,
)

Extracting prompt in train dataset (num_proc=2):   0%|          | 0/68 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/68 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/68 [00:00<?, ? examples/s]

In [9]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 68 | Num Epochs = 3 | Total steps = 27
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 0 of 8,310,915,072 (0.00% trained)


ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['prompt_input_ids', 'chosen_input_ids', 'rejected_input_ids']

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
