# DPO Training with LLama-Factory

This notebook demonstrates how to perform Direct Preference Optimization (DPO) training using LLama-Factory on a small model like Gemma 2B quantized.

## Setup

First, let's clone the LLama-Factory repository and install the necessary dependencies.

In [5]:
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
!pip install -r requirements.txt

## Hugging Face Authentication

Set up your Hugging Face token to access the model and datasets.

In [6]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## DPO Training

Now, let's perform DPO training on the Gemma 2B quantized model.

In [7]:
!python src/train_dpo.py \
    --model_name_or_path google/gemma-2b \
    --dataset hh_rlhf \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir output_dpo \
    --overwrite_cache \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

## Inference

After DPO training, let's test the model with some example prompts.

In [8]:
!python src/inference.py \
    --model_name_or_path google/gemma-2b \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir output_dpo

## Explanation of DPO Training

Direct Preference Optimization (DPO) is a method for fine-tuning language models based on human preferences. It aims to align the model's outputs with human preferences without the need for a separate reward model.

Key aspects of DPO training:
1. Uses a dataset of preferred and non-preferred responses to prompts.
2. Directly optimizes the policy to maximize the likelihood of preferred responses.
3. Avoids the need for a separate reward model or complex RL algorithms.
4. Can lead to more stable and efficient training compared to other preference-based methods.

In this example, we used the Anthropic HH-RLHF dataset, which contains human preferences for different model responses. The DPO training process adjusts the model's parameters to generate responses that are more likely to be preferred by humans.