https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from huggingface_hub import notebook_login

# use notebook key. Paste with menu: Edit->paste in vscode.
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from datasets import load_dataset

dataset_id = "AI-MO/NuminaMath-TIR"
train_dataset, test_dataset = load_dataset(
    dataset_id, split=["train[:5%]", "test[:5%]"]
)

In [4]:
train_dataset[0]

{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.',
 'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we w

In [5]:
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)


def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }


train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

In [6]:
train_dataset[0]["prompt"]
if ("messages" or "problem") in train_dataset.column_names:
    train_dataset = train_dataset.remove_columns(["messages", "problem"])
train_dataset[0]

{'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient \\(\\binom{8}{6}\\).\n2. Compute \\(\\left(\\frac{3}{5}\\right)^2\\).\n3. Comp

In [16]:
import torch
from transformers import AutoModelForCausalLM

from transformers import AutoTokenizer

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [17]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093


In [22]:
import re


def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return [1.0 if match else 0.0 for match in matches]


In [23]:
from math_verify import LatexExtractionConfig, parse, verify


def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs["solution"]
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(
            solution,
            extraction_mode="first_match",
            extraction_config=[LatexExtractionConfig()],
        )
        answer_parsed = parse(
            content,
            extraction_mode="first_match",
            extraction_config=[LatexExtractionConfig()],
        )
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards


def combined_rewards(self, completions, solutions):
    format_rewards = format_reward(completions)
    accuracy_rewards = accuracy_reward(completions, solutions)
    combined_rewards = 0.5 * format_rewards + 0.5 * accuracy_rewards
    return combined_rewards

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from datasets import load_dataset
import numpy as np

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")


class GRPOTrainer:
    def __init__(self, model, ref_model, reward_model, tokenizer, device, lr=1e-5):
        self.model = model
        self.ref_model = ref_model
        self.reward_model = reward_model
        self.tokenizer = tokenizer
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.device = device

    def sample_outputs(self, model, prompts, num_samples=3):
        inputs = self.tokenizer(
            prompts, return_tensors="pt", padding=True, truncation=True
        ).to(self.device)
        outputs = [model.generate(**inputs) for _ in range(num_samples)]
        completions = [
            self.tokenizer.batch_decode(out, skip_special_tokens=True)
            for out in outputs
        ]
        return completions

    def compute_rewards(self, completions, solutions):
        rewards = [self.reward_model(c, solutions) for c in completions]
        return np.array(rewards)

    def compute_advantage(self, rewards):
        baseline = np.mean(rewards)
        return rewards - baseline

    def train_step(self, questions, solutions):
        completions = self.sample_outputs(self.model, questions)
        rewards = self.compute_rewards(completions, solutions)
        advantages = self.compute_advantage(rewards)

        loss = -torch.mean(
            torch.tensor(advantages, dtype=torch.float32, requires_grad=True)
        )

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()
    
trainer = GRPOTrainer(
    model,
    ref_model=model,
    reward_model=combined_rewards,
    tokenizer=tokenizer,
    device=device,
)    


In [43]:
tokenizer(formatted_text).to("mps")


{'input_ids': [151644, 8948, 198, 32, 10435, 1948, 2657, 323, 21388, 13, 576, 1196, 17064, 264, 3405, 11, 323, 279, 21388, 67477, 432, 13, 576, 17847, 1156, 15482, 911, 279, 32711, 1882, 304, 279, 3971, 323, 1221, 5707, 279, 1196, 448, 279, 4226, 13, 576, 32711, 1882, 323, 4226, 525, 43810, 2878, 366, 26865, 29, 690, 26865, 29, 323, 366, 9217, 29, 690, 9217, 29, 9492, 11, 15576, 11, 600, 1734, 2572, 366, 26865, 29, 32711, 1882, 1588, 690, 26865, 1784, 9217, 29, 4226, 1588, 690, 9217, 29, 151645, 198, 151644, 872, 198, 3838, 374, 279, 35606, 315, 400, 87, 61, 17, 88, 61, 21, 3, 304, 279, 14461, 315, 57960, 2359, 11520, 37018, 90, 18, 15170, 20, 92, 87, 30529, 37018, 90, 88, 15170, 17, 11035, 1291, 29776, 23, 3, 30, 220, 17399, 697, 4226, 438, 264, 4185, 19419, 13, 151645, 198], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [44]:
# Training loop
for epoch in range(3):
    for sample in train_dataset:
        formatted_text = tokenizer.apply_chat_template(sample["prompt"], tokenize=False)
        loss = trainer.train_step([formatted_text], [sample["solution"]])
        print(f"Epoch {epoch}, Loss: {loss}")


TypeError: combined_rewards() missing 1 required positional argument: 'solutions'

In [28]:
sample

{'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient \\(\\binom{8}{6}\\).\n2. Compute \\(\\left(\\frac{3}{5}\\right)^2\\).\n3. Comp

In [28]:
from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO-test",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,
    # Parameters that control de data preprocessing
    max_completion_length=64,  # default: 256
    num_generations=4,  # default: 8
    max_prompt_length=128,  # default: 512
    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

In [31]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset,
)
do_train = False
if do_train:
    # Train the model
    trainer.train()
    trainer.save_model(training_args.output_dir)


In [None]:
model_id = "sergiopaniego/Qwen2-0.5B-GRPO"
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

In [15]:
test_dataset["prompt"][0]

[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>',
  'role': 'system'},
 {'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?",
  'role': 'user'}]

In [16]:
import time


def generate_with_reasoning(prompt, r_model, r_tokenizer):
    # Build the prompt from the dataset
    prompt = " ".join(entry["content"] for entry in prompt)

    # Tokenize and move to the same device as the model
    inputs = r_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

    # Generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = r_model.generate(**inputs, max_length=500)
    end_time = time.time()

    # Decode and extract model response
    generated_text = r_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Get inference time
    inference_duration = end_time - start_time

    # Get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens
    response_text = generated_text[len(prompt) :].strip()

    return response_text, generated_text, inference_duration, num_generated_tokens

In [17]:
prompt = test_dataset["prompt"][0]
response_text, generated_text, inference_duration, num_generated_tokens = (
    generate_with_reasoning(prompt, trained_model, trained_tokenizer)
)
print(response_text)


<think> Reasoning process here </think> <answer> This person is 40 years old. </answer>


In [None]:
# Use raw model without reasoning training
response_text, generated_text, inference_duration, num_generated_tokens = (
    generate_with_reasoning(prompt, model, trained_tokenizer)
)

print(response_text)


(Assume no leap years.)
<think> Let's call the person's age x.</think>
<think> Since the person's age is equal to the sum of the digits of their birth year, we can write: x = 10y + z, where y is the hundreds digit and z is the tens digit.</think>
<think> To find the value of y and z, we need to know the tens digit of the birth year, which is 2 for 1988.</think>
<think> Therefore, y = 1988 - 2 = 1986.</think>
<think> Now, we can solve for z by substituting the values of y and z into the equation x = 10y + z:</think>
<think> x = 10*1986 + 2 = 19860 + 2 = 19862</think>
<think> So, the person's age is 19862.</think>


: 

In [72]:
print("-----")
# print("answer:", test_dataset[0]["answer"])
print(test_dataset[0]["solution"])


-----
To solve this problem, let's break it down step-by-step:

1. Let the person's birth year be \( Y \).
2. In 1988, the person's age would be \( 1988 - Y \).
3. The sum of the digits of \( Y \) should be equal to their age in 1988.

Therefore, we need to find a year \( Y \) such that:

\[ 1988 - Y = \text{sum of the digits of } Y \]

We can solve this by iterating through possible values for \( Y \) and checking if the condition holds.

Let's write a Python script to find the correct birth year \( Y \).
```python
def digit_sum(year):
    """Calculate the sum of the digits of a year."""
    return sum(int(digit) for digit in str(year))

def find_birth_year():
print((    for year in range(1900, 1989):))  # Reasonable range given the
```
```output
Cell In[210], line 6
    for year in range(1900, 1989):  # Reasonable range given the
                                                                ^
SyntaxError: incomplete input
```
It looks like the code was cut off prematurely. Let me c