# Preference Alignment with Direct Preference Optimization (DPO)

This notebook will guide you through the process of fine-tuning a language model using Direct Preference Optimization (DPO). We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
     <h2 style='margin: 0;color:blue'>Exercise: Aligning SmolLM2 with DPOTrainer</h2>
     <p>Take a dataset from the Hugging Face hub and align a model on it. </p>
     <p><b>Difficulty Levels</b></p>
     <p>🐢 Use the `trl-lib/ultrafeedback_binarized` dataset</p>
     <p>🐕 Try out the `argilla/ultrafeedback-binarized-preferences` dataset</p>
     <p>🦁 Select a dataset that relates to a real-world use case you’re interested in, or use the model you trained in
        <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>

In [1]:
!uv pip install transformers datasets trl huggingface_hub --system

[2mUsing Python 3.11.11 environment at /usr[0m
[2K[2mResolved [1m61 packages[0m [2min 457ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mdill      [0m [32m----[2m--------------------------[0m[0m 14.93 KiB/113.53 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mdill      [0m [32m----[2m--------------------------[0m[0m 14.93 KiB/113.53 KiB
[2mmultiprocess[0m [32m[2m------------------------------[0m[0m     0 B/140.16 KiB
[2K[3A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mdill      [0m [32m----[2m--------------------------[0m[0m 14.93 KiB/113.53 KiB
[2mmultiprocess[0m [32m[2m------------------------------[0m[0m     0 B/140.16 KiB
[2mxxhash    [0m [32m[2m------------------------------[0m[0m     0 B/190.26 KiB
[2K[4A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mdill      [0m [32m----[2m---

In [2]:
import os
from getpass import getpass
os.environ['HF_TOKEN'] = getpass()

··········


## Import libraries


In [3]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig

## Format dataset

In [7]:
dataset = load_dataset("maywell/ko_Ultrafeedback_binarized", token=os.environ['HF_TOKEN'])

(…)-00000-of-00001-dc7eba5173eb6ca1.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/61966 [00:00<?, ? examples/s]

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 61966
    })
})

In [13]:
def process_dataset(data):
    return {"prompt": [{"role": "user", "content": data["prompt"]}],
            "chosen": [{"role": "assistant", "content": data["chosen"]}],
            "rejected": [{"role": "assistant", "content": data["rejected"]}]}

train_ds = dataset.map(process_dataset)

In [15]:
train_ds["train"][0]

{'prompt': [{'content': '극단주의 및 폭력적 이데올로기를 확산하기 위해 소셜 미디어 플랫폼이 활용된 방식을 분석하고 예를 제시하는 1,000단어 분량의 오피니언 기고문을 공식적인 어조로 작성합니다. 분석에서 이러한 단체가 온라인에서 메시지를 전파하기 위해 사용하는 구체적인 전술과 이러한 전술이 개인과 사회에 미치는 영향에 대해 논의하세요. 또한 소셜 미디어에서 이러한 위험한 이데올로기의 확산을 막기 위해 시행할 수 있는 가능한 해결책을 제시하세요. 논거를 뒷받침할 수 있는 평판이 좋은 출처를 인용하여 잘 조사된 글이어야 합니다.',
   'role': 'user'}],
 'chosen': [{'content': '제목: 소셜 미디어와 극단주의의 독성 동맹: 인류의 진보에 대한 위협소셜 미디어 플랫폼의 등장은 우리가 정보를 공유하고 소비하는 방식에 영구적인 혁명을 가져왔습니다. 이러한 플랫폼은 전 세계 시청자에게 정보에 대한 접근성을 제공하고 개인이 지리적 장벽에 관계없이 연결할 수 있게 해주지만, 극단주의적이고 폭력적인 이데올로기의 확산의 배양지가 되어 인류의 진보에 심각한 위협이 되고 있습니다. 소셜 미디어의 확산과 이러한 플랫폼이 제공하는 익명성으로 인해 극단주의자들이 더 쉽게 자신의 메시지를 전파하고 신규 회원을 모집하며 표적 집단에 대한 폭력을 조장하여 치명적인 결과를 초래하고 있습니다.극단주의 단체는 종종 소셜 미디어의 바이럴 특성을 이용해 자신의 이데올로기를 전파합니다. 빠르게 진행되는 동영상, 눈에 띄는 슬로건, 정서적으로 충전된 콘텐츠는 이들이 사용하는 전략 중 일부입니다. 청중의 관심을 끌고 특히 공포, 분노, 증오를 불러일으키기 위해 빠르고 공격적인 전술을 사용합니다. 잘못된 정보와 음모론의 유포는 전통적인 미디어 소스에 대한 신뢰의 붕괴를 이용하여 극단주의적 이데올로기의 확산을 촉진합니다. 사용자에게 참여할 가능성이 높은 콘텐츠를 보여주도록 설계된 소셜 미디어 알고리즘은 실수로 더 극단적인 콘텐츠를 홍보하여 가시성을 더욱

In [None]:
# TODO: 🐕 If your dataset is not represented as conversation lists, you can use the `process_dataset` function to convert it.

## Select the model

We will use the SmolLM2-135M-Instruct model which has already been through a SFT training, so it it compatible with DPO. You can also use the model you trained in [1_instruction_tuning](../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb).


<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; width:80%; color:black'>
     <p>🦁 change the model to the path or repo id of the model you trained in <a href="../../1_instruction_tuning/notebooks/sft_finetuning_example.ipynb">1_instruction_tuning</a></p>
</div>


In [9]:
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

In [11]:
model = AutoModelForCausalLM.from_pretrained("josang1204/Qweb2.5-FT-CSY", token=os.environ['HF_TOKEN'])
tokenizer = AutoTokenizer.from_pretrained("josang1204/Qweb2.5-FT-CSY", token=os.environ['HF_TOKEN'])

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.26k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

In [16]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

model = model.to(device)
model.config.use_cache = False
finetune_name = "Qweb2.5-FT-DPO-CSY"
finetune_tags = ["smol-course", "module_2"]

## Train model with DPO

In [17]:
# Training arguments
training_args = DPOConfig(
    # Training batch size per GPU
    per_device_train_batch_size=4,
    # Number of updates steps to accumulate before performing a backward/update pass
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    gradient_accumulation_steps=4,
    # Saves memory by not storing activations during forward pass
    # Instead recomputes them during backward pass
    gradient_checkpointing=True,
    # Base learning rate for training
    learning_rate=5e-5,
    # Learning rate schedule - 'cosine' gradually decreases LR following cosine curve
    lr_scheduler_type="cosine",
    # Total number of training steps
    max_steps=100,
    # Disables model checkpointing during training
    save_strategy="no",
    # How often to log training metrics
    logging_steps=1,
    # Directory to save model outputs
    output_dir="qwen_csy_dpo_output",
    # Number of steps for learning rate warmup
    warmup_steps=100,
    # Use bfloat16 precision for faster training
    bf16=True,
    # Disable wandb/tensorboard logging
    report_to="none",
    # Keep all columns in dataset even if not used
    remove_unused_columns=False,
    # Enable MPS (Metal Performance Shaders) for Mac devices
    # use_mps_device=device == "mps",
    # Model ID for HuggingFace Hub uploads
    hub_model_id=finetune_name,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    beta=0.1,
    # Maximum length of the input prompt in tokens
    max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    max_length=1536,
)

In [20]:
trainer = DPOTrainer(
    # The model to be trained
    model=model,
    # Training configuration from above
    args=training_args,
    # Dataset containing preferred/rejected response pairs
    train_dataset=dataset["train"],
    # Tokenizer for processing inputs
    processing_class=tokenizer,
    # DPO-specific temperature parameter that controls the strength of the preference model
    # Lower values (like 0.1) make the model more conservative in following preferences
    # beta=0.1,
    # Maximum length of the input prompt in tokens
    # max_prompt_length=1024,
    # Maximum combined length of prompt + response in tokens
    # max_length=1536,
)

Applying chat template to train dataset:   0%|          | 0/61966 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/61966 [00:00<?, ? examples/s]

In [21]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

# Save to the huggingface hub if login (HF_TOKEN is set)


Step,Training Loss
1,0.6931
2,0.6931
3,0.6209
4,0.6148
5,1.126
6,1.3527
7,1.0653
8,0.5975
9,0.6958
10,1.3396


HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67a893e7-1bc46fc42da7e1e663a01648;fb206d09-3924-4289-8d2a-6d471a7f3a59)

Invalid credentials in Authorization header

In [22]:
trainer.push_to_hub(tags=finetune_tags, token=os.environ["HF_TOKEN"])

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/6.20k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/josang1204/Qweb2.5-FT-DPO-CSY/commit/a289d7ec7c82cfe7f3c226227dfffaa335a36712', commit_message='End of training', commit_description='', oid='a289d7ec7c82cfe7f3c226227dfffaa335a36712', pr_url=None, repo_url=RepoUrl('https://huggingface.co/josang1204/Qweb2.5-FT-DPO-CSY', endpoint='https://huggingface.co', repo_type='model', repo_id='josang1204/Qweb2.5-FT-DPO-CSY'), pr_revision=None, pr_num=None)

In [23]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `DPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.