# Preference Alignment with GRPO

- https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOTrainer

## Import libraries


In [1]:
!uv pip install transformers datasets trl huggingface_hub --system

[2mUsing Python 3.11.11 environment at /usr[0m
[2K[2mResolved [1m61 packages[0m [2min 498ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mnvidia-cusolver-cu12[0m [32m-[2m-----------------------------[0m[0m 16.00 KiB/122.01 MiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mnvidia-cusolver-cu12[0m [32m-[2m-----------------------------[0m[0m 16.00 KiB/122.01 MiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mnvidia-cuda-nvrtc-cu12[0m [32m[2m------------------------------[0m[0m     0 B/23.50 MiB
[2mnvidia-cusolver-cu12[0m [32m-[2m-----------------------------[0m[0m 16.00 KiB/122.01 MiB
[2K[3A[37m⠙[0m [2mPreparing packages...[0m (0/16)
[2mnvidia-cuda-nvrtc-cu12[0m [32m-[2m-----------------------------[0m[0m 14.89 KiB/23.50 MiB
[2mnvidia-cusolver-cu12[0m [32m-[2m-----------------------------[0

In [2]:
import torch
import os
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from trl import GRPOConfig, GRPOTrainer

In [3]:
import os
from getpass import getpass
os.environ['HF_TOKEN'] = getpass()

··········


In [4]:
dataset = load_dataset("maywell/ko_Ultrafeedback_binarized", token=os.environ['HF_TOKEN'])

README.md:   0%|          | 0.00/819 [00:00<?, ?B/s]

(…)-00000-of-00001-dc7eba5173eb6ca1.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/61966 [00:00<?, ? examples/s]

## Format dataset

In [5]:
def process_dataset(data):
    return {"prompt": [{"role": "user", "content": data["prompt"]}]}

train_ds = dataset.map(process_dataset)

Map:   0%|          | 0/61966 [00:00<?, ? examples/s]

## Define the model

In [7]:
model_name = "josang1204/Qweb2.5-FT-CSY"

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    torch_dtype=torch.float32,
    token=os.environ['HF_TOKEN']
).to(device)
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name, token=os.environ['HF_TOKEN'])

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.26k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

In [8]:
finetune_name = "Qweb2.5-FT-GRPO-CSY"
finetune_tags = ["smol-course", "module_2"]

## Train model with GRPO

In [9]:
grpo_args = GRPOConfig(
    logging_steps=10,
    output_dir="./results/",
    hub_model_id=finetune_name,
)

In [10]:
def reward_len(completions, **kwargs):
    return [abs(20 - len(completion)) for completion in completions]

trainer = GRPOTrainer(
    model=model,
    args=grpo_args,
    reward_funcs=reward_len,
    train_dataset=dataset["train"],
    processing_class=tokenizer,
)

In [None]:
trainer.train()  # Train the model



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcsy1204[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


In [None]:
trainer.save_model(f"./{finetune_name}")

In [None]:
trainer.push_to_hub(tags=finetune_tags, token=os.environ["HF_TOKEN"])

## 💐 You're done!

This notebook provided a step-by-step guide to fine-tuning the `HuggingFaceTB/SmolLM2-135M` model using the `ORPOTrainer`. By following these steps, you can adapt the model to perform specific tasks more effectively. If you want to carry on working on this course, here are steps you could try out:

- Try this notebook on a harder difficulty
- Review a colleagues PR
- Improve the course material via an Issue or PR.