# Recreating Deepseek R1 „aha moment“ a RL tutorial 

The release of Deepseek R1 shocked the industry. Why? Well, DeepSeek-R1 is an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach. They not only released the model, but also a research paper on how they did it. 

In the paper they described an "aha moment" when using pure RL to train the model. During this phase, DeepSeek-R1-Zero (the first test of DeepSeek-R1) learns to allocate more thinking time to a problem by reevaluating its initial approach without any human feedback or data describing how to do it.  They describe this as an "aha moment" as:
 
> This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

It serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

In this blog post we will recreate the "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. We will train an open model using reinforcement learning to teach it self-verification and search abilities all on its own. 

You will learn how to:
1. [Setup the development environment](#1-setup-the-development-environment)
2. [Generate training samples with reasoning prefix from the Countdown Game](#2-generate-training-samples-with-reasoning-prefix-from-the-countdown-game)
3. [Train the model using GRPO](#3-train-the-model-using-grpo)
4. [Explore the results and generate CoT](#4-explore-the-results-and-generate-cot)

_Note: This blog is inspired by [Jiayi Pan](https://x.com/jiayi_pirate/status/1882839370505621655) who initially explored the idea and proofed it with a small model._

But Before we start, let's take a look at the [Group Relative Policy Optimization (GRPO)](https://arxiv.org/abs/2402.03300) and understand how it works.

## Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to improve the reasoning capabilities of LLMs. It was introduced in the [DeepSeekMath](https://arxiv.org/abs/2402.03300) paper in the context of mathematical reasoning. GRPO modifies the traditional Proximal Policy Optimization (PPO) by eliminating the need for a value function model. Instead, it estimates baselines from group scores, reducing memory usage and computational overhead. GRPO, now also used by the Qwen team, can be used with rule/binary-based Rewards as well as General Reward Models to improve models on helpfulness. 

1. **Sampling**: Generate multiple outputs for each prompt using the current policy
2. **Reward Scoring**: Each generation is scored using a reward function, could be (rule-based or outcome-based)
3. **Advantage Calculation**: The average reward of the generated outputs is used as a baseline. The advantage of each solution within the group is then computed relative to this baseline. The reward is normalized within a group.
4. **Policy Optimization**: The policy tries to maximize the GRPO objective, which includes the calculated advantages and a KL divergence term. This is different from how PPO implements the KL term within the reward.

![grpo.png](/static/blog/deepseek-r1/grpo.png)

## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pytorch, vllm, and trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs. 


In [None]:
# Install Pytorch & other libraries, make sure to match your GPU driver version
%pip install "torch==2.5.1" vllm tensorboard  "setuptools<71.0.0"  --index-url https://download.pytorch.org/whl/cu121

# Install flash-attn
%pip install flash-attn 

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.48.1" \
  "datasets==3.1.0" \
  "accelerate==1.3.0" \
  "bitsandbytes==0.45.0" \
  "peft==0.14.0" \
  "hf-transfer==0.1.9" 
  
# Install TRL from main branch 
%pip install git+https://github.com/huggingface/trl.git@main --upgrade

_Note: you may need to restart the kernel to use updated packages._

We will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. This means we will automatically push our model, logs and information to the Hub during training. You must register on the [Hugging Face](https://huggingface.co/join) for this. After you have an account, we will use the `login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk.



In [None]:
from huggingface_hub import login

login(token="", add_to_git_credential=True) # ADD YOUR TOKEN HERE

## 2. Generate training samples with reasoning prefix from the Countdown Game

The Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number.

```
Target Number: 952
Available Numbers: 25, 50, 75, 100, 3, 6

(100 × (3 × 3)) + (50 + 6 / 3) = 952
```

We are going to use the [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset, which contains samples with 3 to 4 numbers and solutions.

As Model we are going to use [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) which is a 3B parameter instruction tuned model. This makes it easier to showcase the "aha moment" as it already follows the prompt format. But you can use the base version of Qwen or other models as well. [Jiayi-Pan](https://x.com/jiayi_pirate/status/1882839487417561307) explored that the model needs to have a certain quality to be able to learn the reasoning process, starting with > 1.5B parameters. 


In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Load dataset from Hugging Face Hub
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
dataset = load_dataset(dataset_id, split="train")
# select a random subset of 50k samples
dataset = dataset.shuffle(seed=42).select(range(50000))

# Load tokenizer from Hugging Face Hub to format the dataset to our "r1" prompt 
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# gemerate r1 prompt with a prefix for the model to already start with the thinking process
def generate_r1_prompt(numbers, target):
    r1_prefix = [{
        "role": "system",
        "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."
      },
      { 
        "role": "user",
        "content": f"Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>."
      },
      {
        "role": "assistant",
        "content": "Let me solve this step by step.\n<think>"
      }]
    return {"prompt": tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), "target": target}

# convert our dataset to the r1 prompt
dataset = dataset.map(lambda x: generate_r1_prompt(x["nums"], x["target"]), remove_columns=dataset.features.keys())

# split the dataset into train and test
train_test_split = dataset.train_test_split(test_size=0.1)

train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

  from .autonotebook import tqdm as notebook_tqdm


Lets look at the first sample

## 3. Train the model using GRPO

TRL supports Group Relative Policy Optimization (GRPO) through a dedicated [GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer) for aligning LLMs from preference data, as described in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). The `GRPOTrainer` is a subclass of the `Trainer` from the `transformers` library and supports all the same features, including logging, checkpointing, distributed training, and parameter efficient fine-tuning (PEFT). 

The `GRPOTrainer` supports generic Outcome Reward Models (ORM) and custom reward functions, that can be used to implement Rule-Based Reward Models. In the Deepseek R1 paper they implemented Rule-Based Reward Models to verify the correctness of the generated solutions. In our exmaple we are going to do a similar approach, where we will reward functions that: 
1. Checks if the generated format is correct `<think> [thinking] </think>\n<answer> [answer] </answer>`
2. Extracts the equation, splitting it into parts, executing the equation and comparing the result and comparing it to the target. 

_Note: Correct `<answer>` in our example includes the equation and the result, for example `<answer> 55 + 36 - 7 - 19 = 65 </answer>`_

In [2]:
reward_kwargs = {key: [] for key in {"prompt":"1","completion":"2","target":3}.keys() if key not in ["prompt", "completion"]}
print(reward_kwargs)

{'target': []}


In [3]:
import re 

# Custom reward function to train R1 like reasoning model on the Countdown Game
def reward_func(completions, target, **kwargs):
    """
    Evaluates completions based on:
    1. Format: <think>...</think><answer>...</answer>
    2. Mathematical correctness of the answer

    Args:
        completions (list[str]): Generated outputs
        ground_truth (list[str]): Expected answers
    
    Returns:
        list[float]: Reward scores
    """
    rewards = []

    for completion, gt in zip(completions, target):
        # Check if the format is correct
        regex = r"<think>\s*(.*?)\s*</think>\s*<answer>\s*(.*?)\s*</answer>"

        match = re.search(regex, completion, re.DOTALL)  # Use re.DOTALL here
        if match is None or len(match.groups()) != 2:
            rewards.append(0.0)
            continue

        # Extract the "answer" part from the completion
        answer_equation = match.group(2)
        if not answer_equation:
            rewards.append(0.0)
            continue

        # Check the answer contains an equation
        if "=" not in answer_equation.strip():
            rewards.append(0.0)
            continue

        # Split the equation into the equation and target
        equation, answer_target = map(str.strip, answer_equation.split("="))
        try:
            # Replace '×' with '*' for Python compatibility
            equation = equation.replace("×", "*")
            # Evaluate the equation and compare it to the target
            result = eval(equation)
            # Check if the equation is correct and if it matches the ground truth
            if float(result) == float(answer_target) and float(result) == float(gt):
                rewards.append(1.0)
            else:
                rewards.append(0.0)
        except Exception:
            # If evaluation fails, reward is 0
            rewards.append(0.0)

    return rewards


Lets try our reward function with a sample

In [4]:
correct_sample_1 = """Let me solve this step by step.
<think> We need to find an equation using the numbers 19, 36, 55, and 7
exactly once, with basic arithmetic operations, that equals 65. One possible
combination is 55 + 36 - 19 + 7... </think>
<answer> 55 + 36 - 7 - 19 = 65 </answer>"""

correct_sample_2 = """<think> ... </think>
<answer> 55 + 36 - 7 - 19 = 65 </answer>"""

wrong_format = """User: Using the numbers [19, 36, 55, 7], create an equation that equals 65."""

wrong_result = """<think> ... </think>
<answer> 55 + 36 - 7 - 19 </answer>"""


test_rewards = reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_result], target=["65", "65", "65", "65"])
assert test_rewards == [1.0, 1.0, 0.0, 0.0], "Reward function is not working"



This looks good, now lets define our remaining training parameters, create a trainer and start training. 

In [5]:
import re 
from trl import GRPOConfig, GRPOTrainer, get_peft_config, ModelConfig

# our model we are going to use as policy 
model_config = ModelConfig(
    model_name_or_path="Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",
    use_peft=True,
    load_in_4bit=True,
)


# Hyperparameters
training_args = GRPOConfig(
    output_dir="qwen-r1-aha-moment",
    learning_rate=1e-5,
    beta=0.04, # KL coefficient
    lr_scheduler_type="cosine",
    logging_steps=10,
    max_steps=100,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    bf16=True,
    # GRPO specific parameters
    max_prompt_length=256,
    max_completion_length=1024, # max length of the generated output for our solution
    num_generations=2,
    
)
trainer = GRPOTrainer(
    model=model_config.model_name_or_path,
    reward_funcs=reward_func,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=get_peft_config(model_config),
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.05it/s]


In [6]:
# Train and push the model to the Hub
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


In [None]:
# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=dataset_id)


## 4. Test and evaluate the aligned model

After the training is done we want to evaluate and test our model. Similar to our SFT model, we will evaluate the model on [GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset to see if it improved performance. GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

Evaluating Generative AI models is not a trivial task since 1 input can have multiple correct outputs. If you want to learn more about evaluating generative models, check out:
* [Evaluate LLMs and RAG a practical example using Langchain and Hugging Face](https://www.philschmid.de/evaluate-llm).
* [Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM](https://www.philschmid.de/evaluate-llms-with-lm-eval-and-tgi-vllm)
* [LLM Evaluation doesn't need to be complicated](https://www.philschmid.de/llm-evaluation)
* [Evaluating Open LLMs with MixEval: The Closest Benchmark to LMSYS Chatbot Arena](https://www.philschmid.de/evaluate-llm-mixeval)

We are going to use [Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) an open-source framework to evaluate language models on a wide range of tasks and benchmarks. The frameworks support evaluating models behind OpenAI compatible API endpoints, those can be locally or remotely. This super helpful as we can evaluate our model in the same environment we will use for production. 


We are going to use [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) for testing and deploying our model. TGI is a purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and continous batching. If you are or want to use vLLM you can check the Appendix on how to start the inference server.

_Note: Make sure that you have enough GPU memory to run the container. Restart kernel to remove all allocated GPU memory from the notebook._ 

We will start the on 1 GPU detached. Meaning we can can continue to use the notebook while the container is running. If you have more GPUs you can change the `--gpus` and `--num-shard` flags to the number of GPUs. 

In [None]:
%%bash

num_gpus=1
model_id=philschmid/dpo-llama-3-1-8b-math-ep3-merged # replace with your model id

docker run --name tgi --gpus ${num_gpus} -d -ti -p 8080:80 --shm-size=2GB \
  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
  ghcr.io/huggingface/text-generation-inference:3.0.1 \
  --model-id ${model_id} \
  --num-shard ${num_gpus}


Our container will now start in the background and download the model from Hugging Face Hub. We can check the logs to see the progress with `docker logs -f tgi`.

Once our container is running we can send requests using the `openai` or `huggingface_hub` sdk. Here we ll use the `openai` sdk to send a request to our inference server. If you don't have the `openai` sdk installed you can install it using `pip install openai`.

In [None]:
from openai import OpenAI

# create client 
client = OpenAI(base_url="http://localhost:8080/v1",api_key="-")

system_message = """Solve the given high school math problem by providing a clear explanation of each step leading to the final solution.

Provide a detailed breakdown of your calculations, beginning with an explanation of the problem and describing how you derive each formula, value, or conclusion. Use logical steps that build upon one another, to arrive at the final answer in a systematic manner.

# Steps

1. **Understand the Problem**: Restate the given math problem and clearly identify the main question and any important given values.
2. **Set Up**: Identify the key formulas or concepts that could help solve the problem (e.g., algebraic manipulation, geometry formulas, trigonometric identities).
3. **Solve Step-by-Step**: Iteratively progress through each step of the math problem, justifying why each consecutive operation brings you closer to the solution.
4. **Double Check**: If applicable, double check the work for accuracy and sense, and mention potential alternative approaches if any.
5. **Final Answer**: Provide the numerical or algebraic solution clearly, accompanied by appropriate units if relevant.

# Notes

- Always clearly define any variable or term used.
- Wherever applicable, include unit conversions or context to explain why each formula or step has been chosen.
- Assume the level of mathematics is suitable for high school, and avoid overly advanced math techniques unless they are common at that level.
- Return the final in an extra line. Staring with "The Answer is: [ANSWER]"

# Examples
"""

messages = [
    {"role": "system", "content": system_message},
    # {"role": "user", "content": "If you converted $140 to 158760 Korean Won, how much is $1 in Korean Won?"},
    {"role": "user", "content": "Q: Henry and 3 of his friends order 7 pizzas for lunch. Each pizza is cut into 8 slices. If Henry and his friends want to share the pizzas equally, how many slices can each of them have?\nA:"},
    # {"role": "user", "content": "The rectangular-shaped cell phone is 9 centimeters (cm) wide and 46 centimeters (cm) in circumference. Find the vertical length of the cell phone?"},
]
expected_answer = "14"


# Take a random sample from the dataset and remove the last message and send it to the model

response = client.chat.completions.create(
    model="philschmid/dpo-llama-3-1-8b-math-ep3-merged",
    messages=messages,
    stream=False, # no streaming
    max_tokens=1024,
    temperature=1.0)
response = response.choices[0].message.content

# Print results
print(f"Query:\n{messages[1]['content']}")
print(f"Original Answer:\n{expected_answer}")
print(f"Generated Answer:\n{response}")


Awesome that looks great! Now we can evaluate our model with the [Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).

_Note: Make sure to change the model id to your fine-tuned model._

In [None]:
!lm_eval --model local-chat-completions \
  --tasks gsm8k_cot \
  --model_args model=philschmid/dpo-llama-3-1-8b-math-ep3-merged,base_url=http://localhost:8080/v1/chat/completions,num_concurrent=8,max_retries=10,tokenized_requests=False,timeout=180,max_length=4096 \
  --apply_chat_template \
  --fewshot_as_multiturn

Wow, 59% accuracy, thats a 5% improvement from our SFT model, using only ~2k preference pairs for 3 epochs. That shows that our script and config is working correctly. 

_Note: You might be able to achieve better results with more data, more epochs or tuning the hyperparameters (beta, learning rate, batch size, etc.). I ran some ablations on multi-gpu training and full training with DeepSpeed (see Appendix for full command) and the best results was 62% accuracy._

In [None]:
!docker stop tgi
!docker rm tgi

# Appendix

_Note: Make sure to install deepspeed and accelerate before running the commands. `pip install deepspeed==0.15.4`_


## Distributed Training

```bash
ACCELERATE_LOG_LEVEL=info accelerate launch --num_processes 4 --config_file configs/accelerate_configs/deepspeed_zero3.yaml scripts/dpo/run_dpo.py --config receipes/dpo-llama-3-1-8b.yaml
```