# nanoAhaMoment: Single File "RL for LLM" Library
Single GPU ¬∑ No TRL or Verl ¬∑ Efficient ¬∑ 3B Base Model ¬∑ Full Parameter Tuning Implementation of R1-zero training.

Inspired by [TinyZero](https://github.com/Jiayi-Pan/TinyZero) and [Mini-R1](https://www.philschmid.de/mini-deepseek-r1), but designed to be **simpler**, **cleaner**, and **faster**, with every line of code visible and understandable.

R1-Zero is arguably the more interesting contribution from the DeepSeek R1 paper. The core idea: take a freshly pre-trained LLM (straight out of the unsupervised pretraining oven) and continue its training using reinforcement learning *without* any human feedback or supervision. The result? A model that starts showing emergent behaviors like self-reflection, verification, backtracking that researchers have tried to bake into LLMs using handcrafted tricks and inductive biases, at least since O1.

In this notebook, we‚Äôll build an R1-Zero-style training loop **from scratch**. The goal is to create a crystal-clear, hackable foundation for RL-style LLM training; one that gives you a bird‚Äôs-eye view of every moving part and how they fit together. Perfect for playing around, extending, or hacking.

---

### Why another R1-Zero implementation?

There are already great implementations like [TinyZero](https://github.com/Jiayi-Pan/TinyZero) and [Mini-R1](https://www.philschmid.de/mini-deepseek-r1). But they rely on full-fledged RL libraries (like `trl` or `verl`) to handle training.

These libraries exist for good reason; efficient RL training for LLMs sits at the crossroads of scalable training and fast inference. Making that work takes a lot of engineering. But that also means the internals are often abstracted away, hard to read, and even harder to tweak.

This notebook is different: **no abstractions, no hiding**. You‚Äôll see everything, top to bottom. A lightweight, readable codebase that still follows best practices and runs efficiently on a single GPU.

### What is this notebook, exactly?

We'll train a base LLM using RL to solve a reasoning-heavy algorithmic task. The setup:

- **Model**: Qwen2.5 3B-Base  
- **Dataset**: Countdown-Tasks-3to4  
- **Algorithm**: GRPO (a variant of policy gradient)

Yes, the task is a bit toy-ish‚Äîbut it captures the essence of R1-Zero: emergent behaviors like self-reflection, verification, backtracking, even language-switching. This setup is ideal for rapid prototyping and experimentation.

### Who is this notebook for?

- Anyone interested in RL training for LLMs  
- Researchers, especially the ones in academia, exploring reasoning in language models

### What should I know before jumping in?

- A working knowledge of the HuggingFace Transformers library  
- Some experience fine-tuning LLMs  
- Familiarity with policy gradient methods (helpful but not required)

## R1-Zero Recipe

The goal is to train a base LLM to **reason** in a way that allows it to **reevaluate** its own outputs and **improve** them, all without human supervision. The DeepSeek R1 paper proposes a surprisingly simple recipe to achieve this, and that's exactly what we'll implement in this notebook.

### The Recipe

Here's the high-level procedure:

1. **Start** with a base LLM and a dataset containing problem prompts paired only with their *final answers* (no intermediate reasoning steps).  
2. For each iteration $i = 0$ to `NUM_ITERATIONS`:
   - Sample a batch of prompts $\{x_i\}_{i=1}^N$ from the dataset.
   - For each prompt, sample $G$ responses from the model:  
     $ y_1, y_2, \cdots, y_G \sim \pi_\theta(y|x) $

     These $G$ responses form what is called a *group* in GRPO.
   - Compute a reward $R_i$ for each response and normalize them tocalculate the GRPO advantage within each group.
   - Create a list of $N \times G$ episodes, i.e., pairs of $(x_i, y_i)$ along with their corresponding advantages.
   - Estimate the policy gradient $\vec{g}_{pg}$ from these episodes.
   - Update the model parameters:  
     $\theta \leftarrow \theta + \eta \vec{g}_{pg}$

### Code Structure Overview

The code you will see is structured directly following this recipe. It boils down to three main components:

1. **Episode Generation**  
   - Generate $ (x, y) $ pairs along with their advantages for each RL iteration.
   
2. **Reward Calculation**  
   - Compute rewards for each generated response.
   
3. **Policy Gradient Estimation**  
   - Use the generated episodes to estimate the policy gradient and perform the model update.

In the end, these three components come together in a simple loop that trains the model, step by step, to develop reasoning capabilities through reinforcement learning.


## Checkpoint Playground

In the `notebooks/checkpoint_playground.ipynb`, you can load the model we already trained with this notebook and interactively test the model's reasoning capabilities. This notebook allows you to input custom prompts and observe the model's responses.

## Prerequisites

### Installing Dependencies

Before we begin, let's install the necessary Python packages. We'll be using:

- PyTorch  
- Hugging Face Transformers  
- Hugging Face Datasets  
- DeepSpeed  
- vLLM

For a detailed, step-by-step installation guide, refer to the [README](https://github.com/McGill-NLP/tiny-aha-moment.git) of this project.

### Install FlashAttention (flash-attn)

> If you see errors like `ModuleNotFoundError: No module named 'torch'` while installing `flash-attn`, it usually means pip build isolation is hiding your already-installed PyTorch. The fix is to install with `--no-build-isolation` so the build can import `torch`.

In [1]:
import importlib.util
import subprocess
import sys

# Optional quick sanity check: make sure torch is present in this kernel env.
if importlib.util.find_spec("torch") is None:
    raise RuntimeError(
        "PyTorch is not installed in this environment. Install torch first, then rerun this cell."
    )

# Install flash-attn in a way that allows the build to import torch.
# This avoids failures during metadata/build steps when pip build isolation is enabled.
subprocess.check_call([sys.executable, "-m", "pip", "install", "flash-attn", "--no-build-isolation"])

# Verify the install
import torch
import flash_attn
print("torch:", torch.__version__, "cuda:", torch.version.cuda)
print("flash_attn:", flash_attn.__version__)



[0m

torch: 2.9.0+cu128 cuda: 12.8
flash_attn: 2.8.3


In [2]:
import os
from pathlib import Path

# Set the environment variables for HuggingFace
# This is done to ensure that the cache directory for HuggingFace is set to a specific location,
# preventing the storage from being overwhelmed with model files and other data.
# SCRATCH = Path.home() / "scratch"
SCRATCH = Path.cwd()/ "scratch"
os.environ["HF_HOME"] = str(SCRATCH / "hf_home")
# os.environ['HF_HOME'] = '/workspace/.cache/huggingface'

### Import the required libraries

In [3]:
import gc
import os
import re
import time
from typing import Any, Dict, List, Tuple, Union

import deepspeed
import numpy as np
import torch
from datasets import load_dataset
from deepspeed import DeepSpeedEngine
from tqdm import trange

# vLLM v1 can run its engine core in a separate process (multiprocessing mode).
# That breaks fast in-memory weight loading from a training model -> vLLM unless you
# allow insecure serialization. We disable v1 multiprocessing for this notebook run.
os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0")

def _safe_import_transformers():
    try:
        from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
        return AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
    except ImportError as e:
        # Some environments ship a broken flash-attn build; remove it and retry.
        message = str(e)
        if ("flash_attn" not in message) and ("flash_attn_2_cuda" not in message):
            raise
        import subprocess, sys
        subprocess.run(
            [sys.executable, "-m", "pip", "uninstall", "-y", "flash-attn", "flash_attn"],
            check=False,
        )
        from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
        return AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel

AutoConfig, AutoModelForCausalLM, AutoTokenizer, PreTrainedModel = _safe_import_transformers()
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

import wandb
from utils import (
    compute_token_log_probs,
    dump_episodes,
    evaluate_on_test_set,
    find_free_port,
    find_last_checkpoint,
    prepare_model_inputs,
    load_model_into_vllm
)

# Needed to stop DeepSpeed from complaining
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(find_free_port())
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


**We do have a few helper functions in `utils.py` that are used to keep the code clean.**

## Hyperparameters

Let's define the hyperparameters for the training. These are mostly taken from [Mini-R1](https://www.philschmid.de/mini-deepseek-r1) implementation.

In [4]:
# Model configuration
MODEL_NAME = "Qwen/Qwen2.5-3B"
MODEL_CHAT_NAME = MODEL_NAME + "-Instruct"

# Dataset configuration
DATASET_NAME = "Jiayi-Pan/Countdown-Tasks-3to4"

# Total number of training iterations
NUM_ITERATIONS = 1000
# Number of episodes to collect per iteration for training
EPISODES_PER_ITERATION = 64
# Number of responses to generate for each input prompt (i.e. group size in GRPO)
GENERATIONS_PER_SAMPLE = 4
# Controls how much the policy can deviate from the reference model
KL_COEFFICIENT = 0.001

# Training hyperparameters
# Batch size for each GPU device during training
PER_DEVICE_BATCH_SIZE = 4
# Learning rate for model updates
LEARNING_RATE = 1e-6

# Sampling parameters
# Maximum number of tokens to generate in each response
MAX_RESPONSE_TOKENS = 1024
# Controls randomness in generation (higher = more random)
TEMPERATURE = 1.0
# Nucleus sampling parameter (1.0 = disabled)
TOP_P = 1.0
# Top-k sampling parameter (-1 = disabled)
TOP_K = -1  # no top k

# DeepSpeed configuration
# DeepSpeed config for the policy model
deepspeed_config = {
    "bf16": {"enabled": True},
    "zero_optimization": {"stage": 2, "overlap_comm": False},
    "train_batch_size": EPISODES_PER_ITERATION,
    "train_micro_batch_size_per_gpu": PER_DEVICE_BATCH_SIZE,
    "gradient_accumulation_steps": EPISODES_PER_ITERATION // PER_DEVICE_BATCH_SIZE,
    "gradient_clipping": 1.0,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": LEARNING_RATE,
            "betas": (0.9, 0.999),
            "eps": 1e-8,
            "weight_decay": 0.0,
            "torch_adam": True,
        },
    },
}
# DeepSpeed config for the reference model
ref_deepspeed_config = {
    "bf16": {"enabled": True},
    # Note that we don't train the reference model
    # These are just for compatibility with DeepSpeed.
    "train_batch_size": EPISODES_PER_ITERATION,
    "train_micro_batch_size_per_gpu": PER_DEVICE_BATCH_SIZE,
    "gradient_accumulation_steps": EPISODES_PER_ITERATION // PER_DEVICE_BATCH_SIZE,
}

RUN_NAME = "r1-zero"
EXP_DIR = SCRATCH / "deepseek_r1z_hackathon" / RUN_NAME
EXP_DIR.mkdir(parents=True, exist_ok=True)
print(f"Logs and Checkpoints will be saved to: {EXP_DIR}")

Logs and Checkpoints will be saved to: /workspace/projs/nano-aha-moment/scratch/deepseek_r1z_hackathon/r1-zero


## Generating the training prompts

For training, we'll use the [Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset, which provides problem statements paired with their final answers (but no reasoning steps).

### The Countdown Task

The Countdown game is a numerical puzzle where the player must reach a target number using a set of randomly chosen numbers and basic arithmetic operations: addition, subtraction, multiplication, and division. Each number must be used exactly once.

Example:

```yaml
Target: 622
Available Numbers: [25, 3, 6, 100]

# Not provided in the dataset
Solution: (100 √ó 6) + (25 ‚àí 3) = 622
```

This task is ideal for training LLMs to practice reasoning, searching, and self-verification.


Since we are using the base version of the model, which has only been pretrained on raw internet data, it has no prior understanding of system prompts or chat formatting. However, we will still use the chat format to make the resulting model compatible with downstream tools and frameworks that expect it.

In [5]:
SYSTEM_MESSAGE = (
    "You are a helpful assistant. You first think about the reasoning process in the mind "
    "and then provide the user with the answer."
)
PROMPT_TEMPLATE = (
    "Using the numbers {numbers}, create an equation that equals {target}. "
    "You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. "
    "Show your work in <think> </think> tags. And return the final equation and answer in "
    "<answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>."
)

Now that we have the system message and prompt template, we can generate the training prompts.

In [6]:
# Load and process dataset
def preprocess_example(example: Dict[str, Any]) -> Dict[str, Any]:
    numbers: List[int] = example["nums"]
    target: int = example["target"]

    prefix = [
        {"role": "system", "content": SYSTEM_MESSAGE},
        {"role": "user", "content": PROMPT_TEMPLATE.format(numbers=numbers, target=target)},
        {"role": "assistant", "content": "Let me solve this step by step.\n<think>"},
    ]
    input_ids = tokenizer.apply_chat_template(
        prefix, tokenize=True, continue_final_message=True
    )
    prompt = tokenizer.decode(
        input_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False
    )
    return {"prompt": prompt, "input_ids": input_ids}

def load_tokenizer(model_id: str):
    try:
        return AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except ModuleNotFoundError as e:
        # Some checkpoints advertise a 'colqwen2' tokenizer/model_type in their HF configs.
        # If your installed Transformers build doesn't ship that model package, fall back to Qwen2.
        if "transformers.models.colqwen2" not in str(e):
            raise
        from transformers.models.qwen2.tokenization_qwen2_fast import Qwen2TokenizerFast
        return Qwen2TokenizerFast.from_pretrained(model_id)

# Note that the base model and "instruct" model have different eos token.
# Here we make sure to use the correct one.
tokenizer = load_tokenizer(MODEL_CHAT_NAME)
EOS_TOKEN_ID = AutoConfig.from_pretrained(MODEL_NAME, trust_remote_code=True).eos_token_id
EOS_TOKEN = tokenizer.convert_ids_to_tokens(EOS_TOKEN_ID)

dataset = load_dataset(DATASET_NAME, split="train")
dataset = dataset.map(preprocess_example, num_proc=6)

# Split dataset
train_test_split = dataset.train_test_split(test_size=500, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

len(train_dataset), len(test_dataset)

(489864, 500)

In [7]:
EOS_TOKEN

'<|endoftext|>'

In [8]:
tokenizer.eos_token_id

151645

Let's look at some examples from the dataset.

In [9]:
print("Target: ", train_dataset[0]["target"])
print("Available Numbers: ", train_dataset[0]["nums"])

Target:  43
Available Numbers:  [4, 27, 12]


Using the system message and prompt template, we generate the following prompt for this example:

In [10]:
print(train_dataset[0]["prompt"])

<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [4, 27, 12], create an equation that equals 43. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>


As you noticed, we also prepend the `<assistant>` tag along with the phrase *"Let me solve this step by step."* to each prompt. This helps guide the model into **answering mode**. Without this, the base model might simply continue the prompt rather than attempting to solve the task, since it has no inherent understanding of instruction-following.

Additionally, we tokenize each prompt and store the result as `input_ids`, which will be used later during training.

In [11]:
print(train_dataset[0]["input_ids"])

[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 1446, 1156, 1744, 911, 279, 32711, 1882, 304, 279, 3971, 323, 1221, 3410, 279, 1196, 448, 279, 4226, 13, 151645, 198, 151644, 872, 198, 16429, 279, 5109, 508, 19, 11, 220, 17, 22, 11, 220, 16, 17, 1125, 1855, 458, 23606, 429, 16819, 220, 19, 18, 13, 1446, 646, 990, 6770, 34784, 7525, 17973, 11, 85922, 11777, 608, 8, 323, 1817, 1372, 646, 1172, 387, 1483, 3055, 13, 6928, 697, 975, 304, 366, 26865, 29, 690, 26865, 29, 9492, 13, 1597, 470, 279, 1590, 23606, 323, 4226, 304, 366, 9217, 29, 690, 9217, 29, 9492, 11, 369, 3110, 366, 9217, 2235, 16, 488, 220, 17, 8, 608, 320, 18, 353, 220, 20, 12533, 9217, 14276, 151645, 198, 151644, 77091, 198, 10061, 752, 11625, 419, 3019, 553, 3019, 624, 13708, 766, 29]


In [12]:
# tokenizer.convert_ids_to_tokens(train_dataset[0]["input_ids"])

## Reward Function


The DeepSeek R1 paper introduced **rule-based rewards** to evaluate whether the model-generated solutions were correct. We'll adopt a similar approach by defining two custom reward functions:

- **Format Reward**: Checks if the output follows the required format:  
  `<think> [thinking] </think><answer> [answer] </answer>`

- **Equation Reward**: Extracts the equation from within the `<answer>` tag, verifies that it evaluates to the target result, and ensures that all available numbers are used exactly once.

The purpose of enforcing the format is mainly to make answer extraction easier. It isn't strictly necessary for the correctness of the answer itself but simplifies parsing during training.

The final reward assigned to an episode/trajectory (prompt+response) is simply the sum of these two components. Importantly, the reward is only computed at the **last token** of the output. From an RL perspective, this means that all intermediate actions receive zero reward. We also do not apply any discounting here (i.e., $\gamma = 1$).

In [13]:
def format_reward_func(completion: str) -> float:
    """
    Format: <think>...</think>\n</answer>...</answer>

    Also checks that the content within <answer>...</answer> conforms to a
    specified pattern (only digits, + - * / ( ) . and whitespace).

    Args:
        completion (str): Generated output

    Returns:
        float: Reward score
    """
    # Define the allowed pattern (only numbers, +, -, *, /, (, ), ., and whitespace)
    allowed_pattern = r"^[\d+\-*/().\s]+$"

    try:
        # add synthetic <think> as its already part of the prompt and prefilled 
        # for the assistant to more easily match the regex
        completion = "<think>" + completion

        # Strip EOS token if present
        if completion.endswith(EOS_TOKEN):
            completion = completion[:-len(EOS_TOKEN)]

        # Check if the format is correct
        # Pattern means:
        # 1) <think>...contents not including other <think> tags...</think>
        # 2) \n
        # 3) <answer>...anything...</answer>
        regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*)<\/think>\n<answer>([\s\S]*?)<\/answer>$"
        match = re.search(regex, completion, re.DOTALL)

        if match is None or len(match.groups()) != 2:
            # Format is incorrect
            return 0.0
        else:
            # Extract the content inside <answer>...</answer>
            answer_content = match.group(2).strip()

            # Check if answer content matches the allowed pattern
            if not re.match(allowed_pattern, answer_content):
                # If it doesn't match, reward is 0.5
                return 0.5
            else:
                # If both format and pattern are correct, reward is 1
                return 1.0
    except Exception:
        # Any error leads to 0 reward
        return 0.0


def equation_reward_func(completion: str, nums: List[int], target: int) -> float:
    """
    Evaluates completion based on mathematical correctness of the answer

    Args:
        completion (str): Generated output
        target (str): Expected answer
        nums (list): Available numbers to use in the equation

    Returns:
        float: Reward score
    """
    try:
        # Check if the format is correct
        match = re.search(r"<answer>(.*?)<\/answer>", completion)
        if match is None:
            return 0.0
        # Extract the "answer" part from the completion
        equation = match.group(1).strip()
        # Extract all numbers from the equation
        used_numbers = [int(n) for n in re.findall(r"\d+", equation)]

        # Check if all numbers are used exactly once
        if sorted(used_numbers) != sorted(nums):
            return 0.0
        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
        allowed_pattern = r"^[\d+\-*/().\s]+$"
        if not re.match(allowed_pattern, equation):
            return 0.0

        # Evaluate the equation with restricted globals and locals
        result = eval(equation, {"__builtins__": None}, {})
        # Check if the equation is correct and matches the ground truth
        if abs(float(result) - float(target)) < 1e-5:
            return 1.0
        else:
            return 0.0
    except Exception:
        # If evaluation fails, reward is 0
        return 0.0
    

def compute_reward(completion: str, sample: Dict[str, Any]) -> Tuple[float, Dict[str, float]]:
    nums = sample["nums"]
    target = sample["target"]

    format_reward = format_reward_func(completion)
    equation_reward = equation_reward_func(
        completion=completion, nums=nums, target=target
    )

    reward = format_reward + equation_reward

    metrics = {
        "format_reward": format_reward,
        "equation_reward": equation_reward,
    }   

    return reward, metrics

In [14]:
# <think> is prefilled in the prompt. So, repeating it in the completion would be incorret.
format_reward_func("<think>I think the answer is </think>\n<answer>1+2</answer>")

0.0

In [15]:
format_reward_func("I think the answer is </think>\n<answer>1+2</answer>")

1.0

In [16]:
format_reward_func("<think>I think the<think>and even more</think> answer is </think>\n<answer>1+2</answer>")

0.0

In [17]:
equation_reward_func("I think the answer is </think>\n<answer>1+2+2</answer>", [1,2], 3)

0.0

In [18]:
compute_reward(
    "I think the answer is </think>\n<answer>1 + 2</answer>",
    {"nums": [1, 2], "target": 3}
)

(2.0, {'format_reward': 1.0, 'equation_reward': 1.0})

## Episode Generation

The goal of episode generation is to create a collection of query-response pairs that will be used for policy training. From the reinforcement learning (RL) perspective, the **query** serves as the initial state, and the generated tokens in the **response** represent the actions taken by the policy.

The `create_training_episodes` function takes a list of prompts (initial states) and their corresponding completions which we generate using the model.  In GRPO, we always generate multiple responses per prompt‚Äîspecifically, `GENERATIONS_PER_SAMPLE` > 1. This means that, after episode generation, we end up with `batch_size √ó GENERATIONS_PER_SAMPLE` episodes in every RL iteration.

### Advantage Computation

In addition to generating episodes, `create_training_episodes` is also responsible for computing the **advantage** for every response token. 

In RL terms, the advantage of a token represents how much better or worse that token's action is compared to the average generate token at that specific state (prompt + prefix). Ideally, we would compute an advantage for every token individually to capture how each step contributes to the overall reward.

However, in GRPO, there's no per-token advantage computation. Instead, we compute a single advantage value per response. This value reflects how good the entire response is relative to other responses generated for the same prompt. We then assign this single advantage value uniformly to all tokens within that response.

GRPO uses a simple formula for this:

1. For each prompt $x$ with a group of generated responses $y_1, y_2, \ldots, y_G \sim \pi(\cdot|x)$, compute their rewards $R_1, R_2, \ldots, R_G$.
2. Compute the group's mean and standard deviation:  
   $ \mu = \text{mean}(R_1, R_2, \ldots, R_G) $  
   $ \sigma = \text{std}(R_1, R_2, \ldots, R_G) $
3. Compute a **relative score** for each response:  
   $ R^*_i = \frac{R_i - \mu}{\sigma} $
4. Assign this relative score $R^*_i$ as the advantage to all tokens of the $i$-th response:  
   $ A_t^{(i)} = R^*_i $

This **per-group normalization** encourages responses that are better than average and penalizes those that are worse.

### Example: Advantage in Action

Consider a binary reward scenario where each response is either correct (1) or incorrect (0):

```python
>>> rewards = np.array([1, 1, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std())
array([ 1.22474487,  1.22474487, -0.81649658, -0.81649658, -0.81649658])
```

Here, the correct responses receive higher advantage scores, promoting them in future updates.


If only one response is correct:

```python
>>> rewards = np.array([1, 0, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std())
array([ 2. , -0.5, -0.5, -0.5, -0.5])
```

This resembles the case where the question in the prompt is too hard and the model is not able to generate a correct response on average.
However, if one of the responses is correct, it will be assigned a higher advantage score, and all incorrect responses will be assigned a negative relative score.

If all responses are incorrect:

```python
>>> rewards = np.array([0, 0, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0., 0., 0., 0., 0.])
```

Since there is no one is better than the average, the model receives no learning signal.

If all responses are correct:

```python
>>> rewards = np.array([1, 1, 1, 1, 1])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0., 0., 0., 0., 0.])
```

Again, no learning signal is provided because there is nothing to improve upon.

In a more mixed case:

```python
>>> rewards = np.array([1, 1, 1, 1, 0])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0.5, 0.5, 0.5, 0.5, -2.])
```

This represents an easier question for the model. Most responses are correct, but occasional incorrect ones are heavily penalized.

In [19]:
rewards = np.array([1, 0, 0, 0, 0])
(rewards - rewards.mean()) / (rewards.std())
# rewards.mean(), rewards.std()

array([ 2. , -0.5, -0.5, -0.5, -0.5])

In [20]:
def create_training_episodes(
    samples: List[Dict[str, Any]],
    all_generations: List[List[int]],
    all_finish_reasons: List[str],
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """
    Process model generations and calculate rewards for training episodes.

    This function processes generated responses and calculates rewards for training episodes by:
    1. Grouping generations by sample (GENERATIONS_PER_SAMPLE responses per input)
    2. Computing rewards and advantages for each response
    3. Processing response tokens

    Args:
        samples: List of input samples, each containing:
            - input_ids: List[int], tokenized input prompt
            - nums: List[int], numbers to use in equation
            - target: int, target value for equation
        all_generations: List of token ID sequences for each generated response
        all_finish_reasons: List of finish reasons for each generation ("stop" or other)

    Returns:
        Tuple containing:
        1. Dictionary with processed data for training:
            - all_query_token_ids: List[List[int]], input token IDs repeated for each generation
            - all_response_token_ids: List[List[int]], response token IDs with EOS tokens added
            - all_advantages: List[List[float]], advantage values repeated for each token
        2. Dictionary with generation statistics:
            - response_lengths: List[int], lengths of generated responses
            - rewards: List[float], raw reward values
            - non_stop_rate: List[bool], whether each generation ended naturally
            - reward_metrics/*: Various reward component metrics

    Example:
        >>> samples = [{"input_ids": [1,2,3], "nums": [1,2,3], "target": 6}]
        >>> generations = [[4,5, EOS_TOKEN_ID], [6,7], [8,9, EOS_TOKEN_ID]]  # 3 generations per sample
        >>> finish_reasons = ["stop", "length", "stop"]
        >>> episodes, stats = create_training_episodes(samples, generations, finish_reasons)
        >>> episodes
        {
            'all_query_token_ids': [[1,2,3], [1,2,3], [1,2,3]],
            'all_response_token_ids': [[4,5,EOS_TOKEN_ID], [6,7], [8,9,EOS_TOKEN_ID]],
            'all_advantages': [[0.5,0.5,0.5], [-1.0,-1.0], [0.5,0.5,0.5]]
        }
    """
    assert len(all_generations) == len(all_finish_reasons)
    assert len(all_generations) == len(samples) * GENERATIONS_PER_SAMPLE

    # Process responses and calculate rewards
    groups = [
        list(range(i, i + GENERATIONS_PER_SAMPLE))
        for i in range(0, len(all_generations), GENERATIONS_PER_SAMPLE)
    ]  # example: [[0, 1, 2], [3, 4, 5], [6, 7, 8]]

    all_query_token_ids, all_responses_token_ids, all_advantages = [], [], []

    stats = {
        "response_lengths": [],
        "rewards": [],
        "non_stop_rate": [],
    }

    for sample, group_indices in zip(samples, groups):
        finish_reasons = [all_finish_reasons[i] for i in group_indices]
        response_token_ids = [all_generations[i] for i in group_indices]
        responses = tokenizer.batch_decode(response_token_ids, skip_special_tokens=False)

        rewards_and_metrics = [compute_reward(resp, sample) for resp in responses]
        rewards, reward_metrics = zip(*rewards_and_metrics)

        rewards = np.array(rewards) # [group_size]
        response_advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
        
        advantages = [
            [resp_adv] * len(resp) 
            for resp_adv, resp in zip(response_advantages, response_token_ids)
        ]

        all_query_token_ids.extend([sample["input_ids"]] * GENERATIONS_PER_SAMPLE)
        all_responses_token_ids.extend(response_token_ids)
        all_advantages.extend(advantages)

        stats["rewards"].extend(rewards)
        stats["non_stop_rate"].extend([fr != "stop" for fr in finish_reasons])
        stats["response_lengths"].extend([len(ids) for ids in response_token_ids])
        for rm in reward_metrics:
            for k, v in rm.items():
                stats.setdefault(f"reward_metrics/{k}", []).append(v)

    episodes = {
        "all_query_token_ids": all_query_token_ids,
        "all_response_token_ids": all_responses_token_ids,
        "all_advantages": all_advantages,
    }

    return episodes, stats

In [21]:
case_0 = {
    "sample": {"input_ids": [1,2,3], "nums": [1,2,3], "target": 6},
    "generations": [[4,5, 22, 33], [6,7], [8,9, 11], [10,11]],
    "finish_reasons": ["stop", "length", "stop", "stop"]
}

case = case_0
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]],
 'all_response_token_ids': [[4, 5, 22, 33], [6, 7], [8, 9, 11], [10, 11]],
 'all_advantages': [[np.float64(0.0),
   np.float64(0.0),
   np.float64(0.0),
   np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)]]}

In [22]:
case_1 = {
    "sample": {"input_ids": [33, 44], "nums": [11, 7, 8], "target": 26},
    "generations": [[1,2], [3,4], [5,6], [7,8]],
    "finish_reasons": ["stop", "stop", "length", "stop"]
}
case = case_1
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[33, 44], [33, 44], [33, 44], [33, 44]],
 'all_response_token_ids': [[1, 2], [3, 4], [5, 6], [7, 8]],
 'all_advantages': [[np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)]]}

In [23]:
case_2 = {
    "sample": {"input_ids": [9, 8, 7, 6, 5, 4], "nums": [1,2,3,4], "target": 10},
    "generations": [[9,10], [11,12], [13,14], [15,16]],
    "finish_reasons": ["length", "length", "stop", "stop"]
}
case = case_2
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4]],
 'all_response_token_ids': [[9, 10], [11, 12], [13, 14], [15, 16]],
 'all_advantages': [[np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)],
  [np.float64(0.0), np.float64(0.0)]]}

As you can see, the `input_ids` of this single exmaple is repeated in all of generated episodes

## Policy Gradient


Now that we have a batch of episodes with corresponding advantages, we can compute the **policy gradient loss** to update the model.

GRPO uses the same loss formulation as PPO, but the key difference lies in how advantages are computed. To understand the implementation in `compute_pg_loss`, let‚Äôs first recall the original PPO objective:

$$
\mathcal{L}_{\text{PPO}} = \mathbb{E}\left[\min\left( 
\frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} A_t, \;
\text{clip}\left(
\frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)}, \;
1 - \epsilon, \; 1 + \epsilon
\right) A_t \right)\right]
$$

where:
- $ \pi_{\theta} $ is the current policy,
- $ \pi_{\theta_{\text{old}}} $ is the policy from the previous iteration (the policy we sampled episodes from),
- $ A_t $ is the advantage.

This objective tries to increase or decrease the probability of tokens based on the advantage $A_t$ only when the ratio between the new and old policy probabilities stays within a small range, controlled by the clipping threshold $\epsilon$. This clipping mechanism prevents large, destabilizing updates during training.

### Fully Online Setting: Simplifying the Objective

In general PPO, multiple gradient steps might be taken using the same batch of episodes. However, in our case, we apply only **one gradient step per iteration** using freshly sampled episodes. That means:

- $ \pi_{\theta} = \pi_{\theta_{\text{old}}} $
- Consequently,  
  $$
  \frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} = 1
  $$
  
Since the ratio is exactly 1:
- The clipping function becomes inactive.
- The $\min(\cdot,\cdot)$ operator simply returns the unclipped term.

So, the objective simplifies **to**:

$$
\mathcal{L}_{\text{PPO}} = \mathbb{E}\left[ \frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} A_t \right]
$$


Taking the gradient of this loss with respect to $\theta$, we get:

$$
\vec{g}_{\text{PPO}} = \nabla_\theta \mathcal{L}_{\text{PPO}} = 2 \underbrace{\mathbb{E}\left[ \nabla_\theta \log \pi_\theta(y_t \mid y_{<t}, x) \cdot A_t \right]}_{\text{vanilla policy gradient with advantage}}
$$

This is the **standard policy gradient** formula, where the log-probabilities are weighted by the advantage. In effect, we recover vanilla REINFORCE-style learning.

> Note: The a constant multiplier (like 2) does not affect the direction of the gradient and can be safely ignored.

In fact, this behavior is not unique to GRPO. In all methods such as PPO, TRPO the very first gradient step after collecting new data will always reduce to this same form. Only after the optimization step the clipping or trust region constraint start to take effect.

### KL Penalty

The final loss also has a **KL penalty** term to ensure the new policy doesn't drift too far from a reference policy:

$$
\mathcal{L} = \mathcal{L}_{\text{PPO}} - \beta \cdot \text{KL}(\pi_\theta \parallel \pi_{\theta_{\text{ref}}})
$$

We estimate the KL divergence using the **k3 estimator** from [this blog post by Schulman](http://joschu.net/blog/kl-approx.html):

$$
\text{KL}(\pi_\theta \parallel \pi_{\theta_{\text{ref}}}) = \mathbb{E}\left[\frac{\pi_{\theta_{\text{ref}}}(y_t \mid y_{<t}, x)}{\pi_\theta(y_t \mid y_{<t}, x)} - \log\left(\frac{\pi_{\theta_{\text{ref}}}(y_t \mid y_{<t}, x)}{\pi_\theta(y_t \mid y_{<t}, x)}\right) - 1\right]
$$

This regularization term softly constrains the updated model to remain close to the reference.


### GRPO vs PPO/VinePPO: Key Difference

The main difference between **GRPO** and methods like **PPO/VinePPO** lies in **how the advantage is computed and applied**:

- In **PPO/VinePPO**, each token/step's advantage is computed individually. This allows for fine-grained credit assignment across the sequence.
- In **GRPO**, a **single scalar advantage** is computed for the entire response and is applied **uniformly to all tokens** in that response.

This distinction is illustrated below:

#### A successful response in GRPO:
<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/grpo_successful.png?raw=true" alt="GRPO vs PPO/VinePPO: successful response" width="500">

#### A failed response in GRPO:
<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/grpo_unsuccessful.png?raw=true" alt="GRPO vs PPO/VinePPO: failed response" width="500">

In GRPO, all tokens in a response are updated with the same magnitude. In contrast, PPO/VinePPO updates each token/step with a different advantage value:

<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/ppo_and_vineppo.png?raw=true" alt="GRPO vs PPO/VinePPO: PPO and VinePPO" width="500">


In [24]:
def compute_pg_loss(
    policy_model: Union[DeepSpeedEngine, PreTrainedModel],
    reference_model: Union[DeepSpeedEngine, PreTrainedModel],
    batch: Dict[str, torch.Tensor],
    total_response_len: int,
) -> Tuple[torch.Tensor, Dict[str, float]]:
    """
    Compute the policy gradient loss with KL penalty between policy and reference models.

    This function:
    1. Computes log probabilities for both policy and reference models
    2. Calculates KL divergence penalty between the models
    3. Computes policy gradient loss using advantages
    4. Combines the losses with KL coefficient

    Args:
        policy_model: The model being trained
        reference_model: The reference model for KL penalty calculation
        batch: Dictionary containing:
            - input_ids: Tensor of shape [batch_size, seq_len]
            - attention_mask: Tensor of shape [batch_size, seq_len]
            - labels: Tensor of shape [batch_size, seq_len] with -100 for ignored positions
            - advantages: Tensor of shape [batch_size, seq_len]

    Returns:
        Tuple containing:
            - loss: Combined policy gradient and KL penalty loss (scalar tensor)
            - metrics: Dictionary with detailed loss components:
                - policy_loss: Pure policy gradient loss
                - kl_penalty: KL divergence penalty
                - entropy: Policy entropy
    """
    input_ids = batch["input_ids"]  # [batch_size, seq_len]
    attention_mask = batch["attention_mask"]  # [batch_size, seq_len]
    labels = batch["labels"]  # [batch_size, seq_len]
    labels_mask = batch["labels_mask"]  # [batch_size, seq_len]
    advantages = batch["advantages"]  # [batch_size, seq_len]

    model_inputs = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
        "labels_mask": labels_mask,
    }

    labels_mask = (labels[..., 1:] != -100).float()  # [batch_size, seq_len-1]

    with torch.no_grad():
        ref_logps = compute_token_log_probs(
            reference_model, model_inputs, TEMPERATURE
        )  # [batch_size, seq_len-1]

    logps = compute_token_log_probs(policy_model, model_inputs, TEMPERATURE)  # [batch_size, seq_len-1]

    kl_penalty = torch.exp(ref_logps - logps) - (ref_logps - logps) - 1  # [batch_size, seq_len-1]
    kl_penalty = kl_penalty * labels_mask  # [batch_size, seq_len-1]

    entropy = -logps.sum() / labels_mask.sum()  # scalar

    policy_loss = -logps * advantages[..., 1:]  # [batch_size, seq_len-1]
    policy_loss = policy_loss * labels_mask  # [batch_size, seq_len-1]

    loss = (policy_loss + KL_COEFFICIENT * kl_penalty).sum() / total_response_len  # scalar

    metrics = {
        "policy_loss": policy_loss.sum().item() / total_response_len,
        "kl_penalty": kl_penalty.sum().item() / total_response_len,
        "entropy": entropy.item() / total_response_len,
    }

    return loss, metrics

## Training

Before starting the RL loop, we need to set up all necessary components:

- **Policy Model**: The main model that will be trained using policy gradients.
- **Reference Model**: A frozen copy of the base model used for KL regularization.
- **DeepSpeed**: Both models are initialized with DeepSpeed.
- **vLLM Inference Engine**: Used for fast, batched inference during episode generation.
- **WandB Logging**: We initialize WandB to track training metrics, hyperparameters, and checkpoints.

Finally, if an existing checkpoint is detected, we automatically resume training from where it left off. 

Couple of remarks:
- We move the reference to CPU and only take back to GPU during policy gradient computation. Because of the relatievely small size of the model, this moving back and forth from GPU to CPU is super fast.
- Despite the entire training being run on a single GPU, we still use DeepSeed Zero stage 2. This is because the stage 2 comes with some optimization that avoid memory fragmentations, allowing to fully utilize GPU memory.
- Flash Attention is required in our setup as it reduces the memory requirement of transformers from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ where $n$ the sequence length.

In [25]:
# Initialize main and reference models
policy_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map=0,
)
reference_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map=0,
)
policy_model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})


# Initialize DeepSpeed engines
policy_model, *_ = deepspeed.initialize(
    model=policy_model,
    config=deepspeed_config,
    model_parameters=policy_model.parameters(),
)
reference_model, *_ = deepspeed.initialize(
    model=reference_model,
    config=ref_deepspeed_config,
)

reference_model.module.cpu()

############################################
# Initialize vLLM (Inference) engine
############################################

inference_engine = LLM(
    model=MODEL_NAME,
    skip_tokenizer_init=False,
    gpu_memory_utilization=0.2,
    enable_prefix_caching=True,
    swap_space=1,
    scheduling_policy="fcfs",
    dtype=torch.bfloat16,
    max_model_len=2048,
    enable_sleep_mode=True,
)

# Wandb for logging
wandb.init(
    project="r1-aha-moment",
    name=RUN_NAME,
    config={
        "model_name": MODEL_NAME,
        "learning_rate": LEARNING_RATE,
        "num_iterations": NUM_ITERATIONS,
        "episodes_per_iteration": EPISODES_PER_ITERATION,
        "rollouts_per_episode": GENERATIONS_PER_SAMPLE,
        "kl_coefficient": KL_COEFFICIENT,
        "temperature": TEMPERATURE,
    },
)

# Load checkpoint if it exists
begin_iter = 0
ckpt_path, ckpt_iter = find_last_checkpoint(EXP_DIR)
if ckpt_path is not None:
    print(f"Resuming from checkpoint {ckpt_path} at iteration {ckpt_iter}")
    out = policy_model.load_checkpoint(ckpt_path / "deepspeed")
    if out is None:
        raise RuntimeError(f"Failed to load checkpoint {ckpt_path}")
    begin_iter = ckpt_iter + 1
    load_model_into_vllm(policy_model, inference_engine)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Before initializing optimizer states
MA 22.99 GB         Max_MA 28.74 GB         CA 34.49 GB         Max_CA 34 GB 
CPU Virtual Memory:  used = 23.67 GB, percent = 1.3%
After initializing optimizer states
MA 22.99 GB         Max_MA 34.49 GB         CA 45.99 GB         Max_CA 46 GB 
CPU Virtual Memory:  used = 23.67 GB, percent = 1.3%
After initializing ZeRO optimizer
MA 22.99 GB         Max_MA 22.99 GB         CA 45.99 GB         Max_CA 46 GB 
CPU Virtual Memory:  used = 23.67 GB, percent = 1.3%
begin bf16_optimizer
MA 22.99 GB         Max_MA 22.99 GB         CA 45.99 GB         Max_CA 46 GB 
CPU Virtual Memory:  used = 23.67 GB, percent = 1.3%
end bf16_ optimizer
MA 22.99 GB         Max_MA 22.99 GB         CA 45.99 GB         Max_CA 46 GB 
CPU Virtual Memory:  used = 23.67 GB, percent = 1.3%
INFO 01-02 03:19:03 [utils.py:253] non-default args: {'dtype': torch.bfloat16, 'max_model_len': 2048, 'enable_prefix_caching': True, 'swap_space': 1, 'gpu_memory_utilization': 0.2, 'disable_log_sta

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 01-02 03:19:13 [default_loader.py:308] Loading weights took 1.21 seconds
INFO 01-02 03:19:14 [gpu_model_runner.py:3659] Model loading took 5.7916 GiB memory and 1.880112 seconds
INFO 01-02 03:19:20 [backends.py:643] Using cache directory: /root/.cache/vllm/torch_compile_cache/25fc343081/rank_0_0/backbone for vLLM's torch.compile
INFO 01-02 03:19:20 [backends.py:703] Dynamo bytecode transform time: 6.16 s
INFO 01-02 03:19:25 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.083 s
INFO 01-02 03:19:25 [monitor.py:34] torch.compile takes 7.24 s in total
INFO 01-02 03:19:26 [gpu_worker.py:375] Available KV cache memory: 8.62 GiB
INFO 01-02 03:19:26 [kv_cache_utils.py:1291] GPU KV cache size: 251,056 tokens
INFO 01-02 03:19:26 [kv_cache_utils.py:1296] Maximum concurrency for 2,048 tokens per request: 122.59x


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51/51 [00:02<00:00, 23.02it/s]
Capturing CUDA graphs (decode, FULL): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:01<00:00, 23.74it/s]


INFO 01-02 03:19:31 [gpu_model_runner.py:4587] Graph capturing finished in 5 secs, took 0.53 GiB
INFO 01-02 03:19:31 [core.py:259] init engine (profile, create kv cache, warmup model) took 17.45 seconds
INFO 01-02 03:19:32 [llm.py:360] Supported tasks: ('generate',)


[34m[1mwandb[0m: Currently logged in as: [33minsop-song2[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


### Training loop

With everything set up, we are ready to start the main training loop. Each iteration of the loop performs the following steps:

1. **Evaluation** (optional): 
Every few iterations, the model is evaluated on a test set to monitor progress.
2. **Episode Generation**
A batch of prompts is sampled, and multiple responses are generated for each prompt using the inference engine. Then we put the inference engine to sleep.
3. **Reward Computation**
Rewards and advantages for each generated episode are computed.
4. **Policy Gradient Training**
Using the computed advantages, we calculate the policy gradient loss and update the model parameters. The training is done using gradient accumulation to handle large batches. Note that we apply single gradient update per iteration.
5. **Inference Engine Update**
The inference engine is woken up and updated with the latest model weights.
6. **Logging**
Training and evaluation metrics are logged using WandB.
7. **Checkpointing**
Every 50 iterations, the model and optimizer states are saved.

This loop continues until the specified number of iterations is completed.

**Sleeping of vLLM**
Before training begins, we put vLLM into sleep mode to free up its KV cache and model weights, ensuring enough GPU memory is available for policy training. After the training step is complete, vLLM is woken up, reinitializing its KV cache and preparing for the next round of sampling using the updated model parameters.

### Testing (remove)

In [26]:
test_dataset

Dataset({
    features: ['target', 'nums', 'prompt', 'input_ids'],
    num_rows: 500
})

In [27]:
# test

test_inputs = tokenizer(
    ["hello world", "how are you?!"], 
    padding=True,
    return_tensors="pt"
)
test_inputs



{'input_ids': tensor([[ 14990,   1879, 151643, 151643],
        [  5158,    525,    498,  25984]]), 'attention_mask': tensor([[1, 1, 0, 0],
        [1, 1, 1, 1]])}

In [28]:
type(policy_model)

deepspeed.runtime.engine.DeepSpeedEngine

In [29]:
test_inputs_ids = test_inputs['input_ids']
test_attention_mask = test_inputs['attention_mask']

In [30]:
test_labels = test_inputs_ids.clone()
# test_labels[test_labels == tokenizer.pad_token_id] = -100
test_labels = torch.where(test_attention_mask == 1, test_labels, -100)
test_labels


tensor([[14990,  1879,  -100,  -100],
        [ 5158,   525,   498, 25984]])

In [31]:
output = policy_model(input_ids=test_inputs_ids.cuda(), 
attention_mask=test_attention_mask.cuda(),
return_dict=True,
enable_cache=True)
output.keys()





odict_keys(['logits', 'past_key_values'])

In [32]:
len(tokenizer)

151665

In [33]:
logits = output['logits']
logits.shape

torch.Size([2, 4, 151936])

In [34]:
test_labels.shape

torch.Size([2, 4])

In [35]:
shift_labels = test_labels[:,1:].contiguous()
shift_labels.shape

torch.Size([2, 3])

In [36]:
shift_logits = logits[:,:-1].contiguous()
shift_logits.shape

torch.Size([2, 3, 151936])

In [37]:
temp = 1.0
logits = logits.float() / temp
logits.shape



torch.Size([2, 4, 151936])

In [38]:
label_mask = shift_labels != -100
shift_labels[~label_mask] = 0
shift_labels

tensor([[ 1879,     0,     0],
        [  525,   498, 25984]])

In [39]:
log_probs = torch.log_softmax(shift_logits, dim=-1)
log_probs.shape

torch.Size([2, 3, 151936])

In [40]:
log_probs.shape

torch.Size([2, 3, 151936])

In [41]:
log_probs = torch.gather(log_probs, dim=-1, index=shift_labels.unsqueeze(-1).cuda())
log_probs

tensor([[[-4.1250],
         [-2.9375],
         [-7.5312]],

        [[-5.0938],
         [-1.2578],
         [-9.1875]]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<GatherBackward0>)

In [42]:
log_probs = log_probs.squeeze(-1)
log_probs.shape

torch.Size([2, 3])

In [43]:
shift_labels.device

device(type='cpu')

In [44]:
shift_labels

tensor([[ 1879,     0,     0],
        [  525,   498, 25984]])

In [45]:
label_mask

tensor([[ True, False, False],
        [ True,  True,  True]])

In [46]:
log_probs = log_probs*label_mask.cuda()

In [47]:
log_probs

tensor([[-4.1250, -0.0000, -0.0000],
        [-5.0938, -1.2578, -9.1875]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<MulBackward0>)

In [None]:
episodes = {
    "all_query_token_ids": [[12,38,28], [12,38,28]],
    "all_response_token_ids": [[28,29,48,59,23,49,382,93], [28,29,48,59,23]],
    "all_advantages": [[0.1,0.1,0.1, 0.1, 0.1, 0.1, 0.1, 0.1], [0.2, 0.2, 0.2, 0.2, 0.2]],
}



In [None]:
inputs = {
    "input_ids": [],
    "attention_mask": [],
    "labels": [],
    "advantages": []
}





In [51]:
max_len = max(

    len(q) + len(r)
    for q, r in zip(episodes['all_query_token_ids'], episodes['all_response_token_ids'])
)
max_len


11

In [None]:
pad_token_id = tokenizer.pad_token_id 
ignore_index = -100

for query_token_ids, response_token_ids, advantages in zip(
    episodes['all_query_token_ids'],
    episodes['all_response_token_ids'],
    episodes['all_advantages']
):
    input_ids = query_token_ids + response_token_ids
    attention_mask = [1] * len(input_ids)
    labels = [ignore_index] * len(query_token_ids) + response_token_ids
    advantages_padded = advantages + [0.0] * (len(query_token_ids))

    # Pad to max_len
    padding_length = max_len - len(input_ids)
    input_ids += [pad_token_id] * padding_length
    attention_mask += [0] * padding_length
    labels += [ignore_index] * padding_length
    advantages_padded += [0.0] * padding_length

    inputs["input_ids"].append(input_ids)
    inputs["attention_mask"].append(attention_mask)
    inputs["labels"].append(labels)
    inputs["advantages"].append(advantages_padded)



### test done ----- ^^^

In [27]:
for iteration in trange(NUM_ITERATIONS):
    print(f"Iteration {iteration}/{NUM_ITERATIONS}")

    metrics = {}

    #########################################################
    # Evaluation
    #########################################################

    eval_stats = None
    if iteration % 25 == 0:
        print("Evaluating on eval set...")
        eval_episodes, eval_stats = evaluate_on_test_set(
            inference_engine=inference_engine,
            test_dataset=test_dataset,
            tokenizer=tokenizer,
            eos_token=EOS_TOKEN,
            eval_sampling_params=SamplingParams(
                temperature=0.3,
                max_tokens=1024,
                n=1,
                detokenize=False,
                stop_token_ids=[EOS_TOKEN_ID],
            ),
            reward_func=lambda completion, sample: compute_reward(
                completion, sample
            ),
        )
        eval_episode_table = dump_episodes(
            episodes=eval_episodes,
            episodes_stats=eval_stats,
            exp_dir=EXP_DIR,
            tokenizer=tokenizer,
            iteration=iteration,
            is_eval=True,
        )
        wandb.log({"eval/episodes": eval_episode_table, "iteration": iteration})


    #########################################################
    # Generate Episodes
    #########################################################

    # Sample training batch
    num_samples = EPISODES_PER_ITERATION // GENERATIONS_PER_SAMPLE
    indices = np.random.choice(
        len(train_dataset), size=num_samples, replace=False
    )
    samples = train_dataset.select(indices)

    samples_list = [
        TokensPrompt(prompt_token_ids=tids) for tids in samples["input_ids"] 
    ]

    # Sample responses
    outputs = inference_engine.generate(
        prompts=samples_list,
        sampling_params=SamplingParams(
            n=GENERATIONS_PER_SAMPLE,
            temperature=TEMPERATURE,
            top_p=TOP_P,
            top_k=TOP_K,
            max_tokens=MAX_RESPONSE_TOKENS,
            detokenize=False,
            stop_token_ids=[EOS_TOKEN_ID],
        )
    )
    all_generations = [list(g.token_ids) for out in outputs for g in out.outputs]
    all_finish_reasons = [g.finish_reason for out in outputs for g in out.outputs]
    inference_engine.sleep(1)

    print(f"Generated {len(all_generations)} responses")
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)

    # Process responses and calculate rewards
    episodes, episodes_stats = create_training_episodes(
        samples,
        all_generations,
        all_finish_reasons,
    )
    for k, v in episodes_stats.items():
        metrics.setdefault(k, []).extend(v)

    episode_table = dump_episodes(
        episodes=episodes,
        episodes_stats=episodes_stats,
        exp_dir=EXP_DIR,
        tokenizer=tokenizer,
        iteration=iteration,
    )

    #########################################################
    # Training
    #########################################################

    # Prepare training batch
    model_inputs = prepare_model_inputs(
        query_token_ids=episodes["all_query_token_ids"],
        response_token_ids=episodes["all_response_token_ids"],
        advantages=episodes["all_advantages"],
        device="cuda"
    )

    # Calculate losses and update model
    policy_model.train()
    reference_model.module.cuda()
    reference_model.eval()

    total_response_len = (model_inputs["labels"] != -100).sum().item()

    for i in trange(0, EPISODES_PER_ITERATION, PER_DEVICE_BATCH_SIZE, desc="Gradient Accumulation"):
        batch = {
            k: v[i : i + PER_DEVICE_BATCH_SIZE]
            for k, v in model_inputs.items()
        }

        # Compute policy gradient loss
        loss, loss_metrics = compute_pg_loss(
            policy_model=policy_model,
            reference_model=reference_model,
            batch=batch,
            total_response_len=total_response_len,
        )

        # Track metrics
        metrics.setdefault("loss", []).append(loss.item())
        grad_norm = policy_model.get_global_grad_norm()
        if grad_norm is not None:
            grad_norm = grad_norm.item()
        metrics.setdefault("grad_norm", []).append(grad_norm)
        for k, v in loss_metrics.items():
            metrics.setdefault(k, []).append(v.item() if isinstance(v, torch.Tensor) else v)

        # Backpropagation and optimization step
        policy_model.backward(loss, scale_wrt_gas=False)
        
        # Free memory
        del loss, loss_metrics
        if policy_model.is_gradient_accumulation_boundary():
            reference_model.module.cpu()

        policy_model.step()

    #########################################################
    # Update inference engine weights
    #########################################################
    
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)

    inference_engine.wake_up()
    load_model_into_vllm(policy_model, inference_engine)

    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)


    #########################################################
    # Log metrics
    #########################################################

    train_metrics = {
        k: np.mean(v) for k, v in metrics.items() if None not in v
    }
    train_metrics["learning_rate"] = policy_model.get_lr()[0]
    logs = {
        "iteration": iteration,
        f"episodes/iter_{iteration:06d}": episode_table,
        **{f"train/{k}": v for k, v in train_metrics.items()},
    }
    if eval_stats is not None:
        eval_metrics = {k: np.mean(v) for k, v in eval_stats.items() if None not in v}
        logs.update({f"eval/{k}": v for k, v in eval_metrics.items()})
    wandb.log(logs)

    selected_keys = [
        "train/kl_penalty",
        "train/rewards",
        "train/reward_metrics/format_reward",
        "train/reward_metrics/equation_reward",
        "eval/rewards",
        "eval/reward_metrics/format_reward",
        "eval/reward_metrics/equation_reward",
    ]
    selected_metrics = {k: logs[k] for k in selected_keys if k in logs}
    print(f"KEY METRICS: {selected_metrics}")

    if iteration % 50 == 0 and iteration != 0:
        policy_model.module.save_pretrained(
            str(EXP_DIR / "checkpoints" / f"ckpt_{iteration:06d}" / "hf_model")
        )
        policy_model.save_checkpoint(
            str(EXP_DIR / "checkpoints" / f"ckpt_{iteration:06d}" / "deepspeed")
        )

  0%|          | 0/1000 [00:00<?, ?it/s]

Iteration 0/1000
Evaluating on eval set...


Adding requests:   0%|          | 0/500 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/500 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:30:06 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:30:13 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:30:13 [gpu_worker.py:132] Sleep mode freed 15.92 GiB memory, 21.82 GiB memory is still in use.
INFO 01-01 23:30:13 [abstract.py:306] It took 6.797080 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 708)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [4, 3, 56, 41], create an equation that equals 97. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) /

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:19<00:00,  1.24s/it]


INFO 01-01 23:30:36 [abstract.py:324] It took 0.324560 seconds to wake up tags {'weights', 'kv_cache'}.


  0%|          | 1/1000 [01:03<17:38:28, 63.57s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0), 'train/rewards': np.float64(0.09375), 'train/reward_metrics/format_reward': np.float64(0.09375), 'train/reward_metrics/equation_reward': np.float64(0.0), 'eval/rewards': np.float64(0.263), 'eval/reward_metrics/format_reward': np.float64(0.257), 'eval/reward_metrics/equation_reward': np.float64(0.006)}
Iteration 1/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:30:45 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:30:46 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:30:46 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:30:46 [abstract.py:306] It took 0.580666 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 266)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [49, 24, 37, 76], create an equation that equals 78. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.00it/s]


INFO 01-01 23:31:05 [abstract.py:324] It took 0.335948 seconds to wake up tags {'weights', 'kv_cache'}.


  0%|          | 2/1000 [01:32<11:53:48, 42.91s/it]

KEY METRICS: {'train/kl_penalty': np.float64(1.5260695378544954e-05), 'train/rewards': np.float64(0.0859375), 'train/reward_metrics/format_reward': np.float64(0.0859375), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 2/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:31:13 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:31:14 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:31:14 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:31:14 [abstract.py:306] It took 0.583595 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 274)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [43, 20, 5, 16], create an equation that equals 17. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.00it/s]


INFO 01-01 23:31:33 [abstract.py:324] It took 0.295742 seconds to wake up tags {'weights', 'kv_cache'}.


  0%|          | 3/1000 [02:00<10:01:46, 36.21s/it]

KEY METRICS: {'train/kl_penalty': np.float64(1.9419874776494465e-05), 'train/rewards': np.float64(0.0390625), 'train/reward_metrics/format_reward': np.float64(0.0390625), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 3/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:31:42 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:31:42 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:31:42 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:31:42 [abstract.py:306] It took 0.570246 seconds to fall asleep.
Generated 64 responses




########## Example 1 (Reward: 0.0, Response Length: 264)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [4, 94, 9, 50], create an equation that equals 74. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` First, let's analyze the numbers we have: [4, 94, 9, 50]. We want to create an equation that equals 74, and we can only use each number once.
We can divide 94 by 9 to get 10.44444. This gives us the form (94/9) * some number = 74.
So, (94/9) * 50 = 74 is the final equation that satisfies the given conditions.
ToShow our work, here are the steps 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.00it/s]


INFO 01-01 23:32:01 [abstract.py:324] It took 0.330808 seconds to wake up tags {'weights', 'kv_cache'}.


  0%|          | 4/1000 [02:28<9:10:53, 33.19s/it] 

KEY METRICS: {'train/kl_penalty': np.float64(1.8676240669759934e-05), 'train/rewards': np.float64(0.0625), 'train/reward_metrics/format_reward': np.float64(0.0625), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 4/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:32:10 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:32:11 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:32:11 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:32:11 [abstract.py:306] It took 0.601215 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 144)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [46, 20, 86], create an equation that equals 60. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.00it/s]


INFO 01-01 23:32:30 [abstract.py:324] It took 0.302277 seconds to wake up tags {'weights', 'kv_cache'}.


  0%|          | 5/1000 [02:57<8:41:50, 31.47s/it]

KEY METRICS: {'train/kl_penalty': np.float64(1.937446244266055e-05), 'train/rewards': np.float64(0.140625), 'train/reward_metrics/format_reward': np.float64(0.140625), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 5/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:32:39 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:32:39 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:32:39 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:32:39 [abstract.py:306] It took 0.569520 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 85)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [62, 69, 26], create an equation that equals 19. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.00it/s]


INFO 01-01 23:32:58 [abstract.py:324] It took 0.297371 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 6/1000 [03:25<8:23:38, 30.40s/it]

KEY METRICS: {'train/kl_penalty': np.float64(3.206362599369679e-05), 'train/rewards': np.float64(0.1171875), 'train/reward_metrics/format_reward': np.float64(0.1171875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 6/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:33:07 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:33:07 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:33:08 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:33:08 [abstract.py:306] It took 0.588115 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 217)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [40, 86, 61], create an equation that equals 65. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:16<00:00,  1.00s/it]


INFO 01-01 23:33:27 [abstract.py:324] It took 0.336968 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 7/1000 [03:53<8:12:26, 29.75s/it]

KEY METRICS: {'train/kl_penalty': np.float64(3.234707920982182e-05), 'train/rewards': np.float64(0.109375), 'train/reward_metrics/format_reward': np.float64(0.078125), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 7/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:33:35 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:33:36 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:33:36 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:33:36 [abstract.py:306] It took 0.604580 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 1024)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [47, 47, 69], create an equation that equals 70. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.01it/s]


INFO 01-01 23:33:55 [abstract.py:324] It took 0.289529 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 8/1000 [04:22<8:04:41, 29.32s/it]

KEY METRICS: {'train/kl_penalty': np.float64(4.4316664047768705e-05), 'train/rewards': np.float64(0.1484375), 'train/reward_metrics/format_reward': np.float64(0.1484375), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 8/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:34:04 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:34:04 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:34:04 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:34:04 [abstract.py:306] It took 0.553430 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 283)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [7, 85, 78], create an equation that equals 14. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.01it/s]


INFO 01-01 23:34:23 [abstract.py:324] It took 0.325363 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 9/1000 [04:50<7:58:21, 28.96s/it]

KEY METRICS: {'train/kl_penalty': np.float64(5.706092426548438e-05), 'train/rewards': np.float64(0.203125), 'train/reward_metrics/format_reward': np.float64(0.203125), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 9/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:34:32 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:34:32 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:34:32 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:34:32 [abstract.py:306] It took 0.587288 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 785)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [78, 49, 17], create an equation that equals 12. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.01it/s]


INFO 01-01 23:34:51 [abstract.py:324] It took 0.307735 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 10/1000 [05:18<7:52:49, 28.66s/it]

KEY METRICS: {'train/kl_penalty': np.float64(8.815371180136711e-05), 'train/rewards': np.float64(0.1875), 'train/reward_metrics/format_reward': np.float64(0.1875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 10/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:35:00 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:35:00 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:35:00 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:35:00 [abstract.py:306] It took 0.583847 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 52)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [36, 99, 33], create an equation that equals 12. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:35:19 [abstract.py:324] It took 0.324204 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 11/1000 [05:46<7:49:11, 28.46s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.00011899101047211294), 'train/rewards': np.float64(0.1328125), 'train/reward_metrics/format_reward': np.float64(0.1328125), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 11/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:35:28 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:35:28 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:35:28 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:35:28 [abstract.py:306] It took 0.590881 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 137)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [51, 90, 35, 35], create an equation that equals 39. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:35:47 [abstract.py:324] It took 0.324481 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|          | 12/1000 [06:14<7:46:10, 28.31s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0001592567278656275), 'train/rewards': np.float64(0.3046875), 'train/reward_metrics/format_reward': np.float64(0.2734375), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 12/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:35:56 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:35:56 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:35:56 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:35:56 [abstract.py:306] It took 0.659323 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 152)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [32, 80, 1, 52], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:36:15 [abstract.py:324] It took 0.328249 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|‚ñè         | 13/1000 [06:42<7:43:10, 28.16s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0002417077192640173), 'train/rewards': np.float64(0.34375), 'train/reward_metrics/format_reward': np.float64(0.34375), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 13/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:36:23 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:36:24 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:36:24 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:36:24 [abstract.py:306] It took 0.615413 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 367)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [20, 60, 7], create an equation that equals 33. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.01it/s]


INFO 01-01 23:36:43 [abstract.py:324] It took 0.292082 seconds to wake up tags {'weights', 'kv_cache'}.


  1%|‚ñè         | 14/1000 [07:10<7:41:58, 28.11s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.000260333923631878), 'train/rewards': np.float64(0.3046875), 'train/reward_metrics/format_reward': np.float64(0.3046875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 14/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:36:50 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:36:50 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:36:50 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:36:50 [abstract.py:306] It took 0.553379 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 120)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [45, 79, 80, 77], create an equation that equals 85. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:13<00:00,  1.19it/s]


INFO 01-01 23:37:07 [abstract.py:324] It took 0.327016 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 15/1000 [07:34<7:20:56, 26.86s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0004303554729248308), 'train/rewards': np.float64(0.3671875), 'train/reward_metrics/format_reward': np.float64(0.3671875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 15/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:37:15 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:37:16 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:37:16 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:37:16 [abstract.py:306] It took 0.557787 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 63)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [37, 35, 89], create an equation that equals 91. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:37:35 [abstract.py:324] It took 0.333457 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 16/1000 [08:01<7:24:43, 27.12s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0005019514617042793), 'train/rewards': np.float64(0.46875), 'train/reward_metrics/format_reward': np.float64(0.4375), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 16/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:37:43 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:37:43 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:37:44 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:37:44 [abstract.py:306] It took 0.564206 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 1024)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [80, 24, 32, 38], create an equation that equals 14. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:38:02 [abstract.py:324] It took 0.325188 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 17/1000 [08:29<7:26:54, 27.28s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.00043777500374650327), 'train/rewards': np.float64(0.53125), 'train/reward_metrics/format_reward': np.float64(0.53125), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 17/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:38:11 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:38:11 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:38:11 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:38:11 [abstract.py:306] It took 0.568994 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 158)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [63, 98, 57], create an equation that equals 22. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:38:30 [abstract.py:324] It took 0.329583 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 18/1000 [08:57<7:29:17, 27.45s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0005223675943473962), 'train/rewards': np.float64(0.453125), 'train/reward_metrics/format_reward': np.float64(0.453125), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 18/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:38:39 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:38:39 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:38:39 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:38:39 [abstract.py:306] It took 0.616513 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 114)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [44, 87, 48, 89], create an equation that equals 20. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:38:58 [abstract.py:324] It took 0.327076 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 19/1000 [09:25<7:30:09, 27.53s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0005099619238052716), 'train/rewards': np.float64(0.4921875), 'train/reward_metrics/format_reward': np.float64(0.4765625), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 19/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:39:04 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:39:05 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:39:05 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:39:05 [abstract.py:306] It took 0.590652 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 222)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [74, 14, 75], create an equation that equals 14. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:13<00:00,  1.21it/s]


INFO 01-01 23:39:21 [abstract.py:324] It took 0.324683 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 20/1000 [09:48<7:08:49, 26.25s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0013174384994106398), 'train/rewards': np.float64(0.5859375), 'train/reward_metrics/format_reward': np.float64(0.5859375), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 20/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:39:29 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:39:30 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:39:30 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:39:30 [abstract.py:306] It took 0.566012 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 174)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [45, 56, 62], create an equation that equals 39. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:39:49 [abstract.py:324] It took 0.331906 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 21/1000 [10:16<7:14:55, 26.66s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0007758285898947751), 'train/rewards': np.float64(0.625), 'train/reward_metrics/format_reward': np.float64(0.625), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 21/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:39:54 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:39:54 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:39:54 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:39:54 [abstract.py:306] It took 0.567323 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 207)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [18, 8, 56], create an equation that equals 11. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:10<00:00,  1.49it/s]


INFO 01-01 23:40:08 [abstract.py:324] It took 0.336051 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 22/1000 [10:35<6:38:36, 24.45s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.001006993028243582), 'train/rewards': np.float64(0.6328125), 'train/reward_metrics/format_reward': np.float64(0.6328125), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 22/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:40:16 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:40:17 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:40:17 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:40:17 [abstract.py:306] It took 0.556172 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 147)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [71, 92, 78, 7], create an equation that equals 81. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:40:36 [abstract.py:324] It took 0.348755 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 23/1000 [11:03<6:54:19, 25.44s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0010495841690758862), 'train/rewards': np.float64(0.6953125), 'train/reward_metrics/format_reward': np.float64(0.6796875), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 23/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:40:42 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:40:43 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:40:43 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:40:43 [abstract.py:306] It took 0.646153 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 123)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [74, 70, 7], create an equation that equals 64. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:12<00:00,  1.30it/s]


INFO 01-01 23:40:58 [abstract.py:324] It took 0.328175 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñè         | 24/1000 [11:25<6:39:12, 24.54s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0007470331402634357), 'train/rewards': np.float64(0.703125), 'train/reward_metrics/format_reward': np.float64(0.671875), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 24/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:41:06 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:41:07 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:41:07 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:41:07 [abstract.py:306] It took 0.567318 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 98)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [16, 6, 32], create an equation that equals 64. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:41:26 [abstract.py:324] It took 0.327873 seconds to wake up tags {'weights', 'kv_cache'}.


  2%|‚ñé         | 25/1000 [11:53<6:53:02, 25.42s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0009283765777471362), 'train/rewards': np.float64(0.78125), 'train/reward_metrics/format_reward': np.float64(0.75), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 25/1000
Evaluating on eval set...


Adding requests:   0%|          | 0/500 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/500 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:41:52 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:41:53 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:41:53 [gpu_worker.py:132] Sleep mode freed 15.27 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:41:53 [abstract.py:306] It took 0.607023 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 316)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [11, 63, 3, 96], create an equation that equals 16. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:11<00:00,  1.38it/s]


INFO 01-01 23:42:08 [abstract.py:324] It took 0.353316 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 26/1000 [12:35<8:14:54, 30.49s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.002169034412861617), 'train/rewards': np.float64(0.8046875), 'train/reward_metrics/format_reward': np.float64(0.7890625), 'train/reward_metrics/equation_reward': np.float64(0.015625), 'eval/rewards': np.float64(0.664), 'eval/reward_metrics/format_reward': np.float64(0.63), 'eval/reward_metrics/equation_reward': np.float64(0.034)}
Iteration 26/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:42:14 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:42:15 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:42:15 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:42:15 [abstract.py:306] It took 0.600137 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 157)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [11, 45, 53], create an equation that equals 88. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:13<00:00,  1.20it/s]


INFO 01-01 23:42:31 [abstract.py:324] It took 0.321170 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 27/1000 [12:58<7:40:03, 28.37s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0014619479843264616), 'train/rewards': np.float64(0.7578125), 'train/reward_metrics/format_reward': np.float64(0.7421875), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 27/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:42:40 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:42:40 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:42:40 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:42:40 [abstract.py:306] It took 0.567650 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 130)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [23, 76, 51, 79], create an equation that equals 31. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:42:59 [abstract.py:324] It took 0.378563 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 28/1000 [13:26<7:36:30, 28.18s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0010690906820776257), 'train/rewards': np.float64(0.8125), 'train/reward_metrics/format_reward': np.float64(0.78125), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 28/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:43:07 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:43:08 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:43:08 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:43:08 [abstract.py:306] It took 0.560047 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 188)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [60, 25, 35, 23], create an equation that equals 73. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.05it/s]


INFO 01-01 23:43:26 [abstract.py:324] It took 0.323353 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 29/1000 [13:53<7:29:43, 27.79s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0011824934045416061), 'train/rewards': np.float64(0.890625), 'train/reward_metrics/format_reward': np.float64(0.859375), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 29/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:43:34 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:43:35 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:43:35 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:43:35 [abstract.py:306] It took 0.568766 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 83)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [43, 49, 41, 41], create an equation that equals 45. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:43:54 [abstract.py:324] It took 0.371932 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 30/1000 [14:21<7:28:23, 27.74s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0014240631429326069), 'train/rewards': np.float64(0.8203125), 'train/reward_metrics/format_reward': np.float64(0.8046875), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 30/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:44:01 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:44:01 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:44:01 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:44:01 [abstract.py:306] It took 0.567198 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 155)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [14, 56, 74, 44], create an equation that equals 76. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:13<00:00,  1.14it/s]


INFO 01-01 23:44:18 [abstract.py:324] It took 0.289836 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 31/1000 [14:45<7:13:46, 26.86s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0018330336842625544), 'train/rewards': np.float64(0.90625), 'train/reward_metrics/format_reward': np.float64(0.890625), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 31/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:44:27 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:44:27 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:44:27 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:44:27 [abstract.py:306] It took 0.569205 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 62)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [6, 88, 66], create an equation that equals 16. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:44:46 [abstract.py:324] It took 0.289943 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 32/1000 [15:13<7:16:05, 27.03s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0012743145694553292), 'train/rewards': np.float64(0.9296875), 'train/reward_metrics/format_reward': np.float64(0.9296875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 32/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:44:51 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:44:52 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:44:52 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:44:52 [abstract.py:306] It took 0.555759 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 135)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [93, 28, 79], create an equation that equals 42. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:10<00:00,  1.47it/s]


INFO 01-01 23:45:06 [abstract.py:324] It took 0.289848 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 33/1000 [15:33<6:40:35, 24.86s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.001519129345338404), 'train/rewards': np.float64(0.9453125), 'train/reward_metrics/format_reward': np.float64(0.9140625), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 33/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:45:14 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:45:14 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:45:15 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:45:15 [abstract.py:306] It took 0.557789 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 107)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [50, 41, 79], create an equation that equals 88. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:45:33 [abstract.py:324] It took 0.296247 seconds to wake up tags {'weights', 'kv_cache'}.


  3%|‚ñé         | 34/1000 [16:00<6:52:52, 25.64s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.002360676108233483), 'train/rewards': np.float64(0.9140625), 'train/reward_metrics/format_reward': np.float64(0.8828125), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 34/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:45:42 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:45:42 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:45:42 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:45:42 [abstract.py:306] It took 0.582333 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 260)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [64, 42, 72, 8], create an equation that equals 58. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:46:01 [abstract.py:324] It took 0.296152 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñé         | 35/1000 [16:28<7:03:19, 26.32s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0015347466046525368), 'train/rewards': np.float64(0.953125), 'train/reward_metrics/format_reward': np.float64(0.9375), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 35/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:46:06 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:46:06 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:46:06 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:46:06 [abstract.py:306] It took 0.557374 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 366)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [35, 72, 23, 3], create an equation that equals 10. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:09<00:00,  1.61it/s]


INFO 01-01 23:46:19 [abstract.py:324] It took 0.291132 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñé         | 36/1000 [16:46<6:23:32, 23.87s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.001563560520361991), 'train/rewards': np.float64(0.96875), 'train/reward_metrics/format_reward': np.float64(0.953125), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 36/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:46:27 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:46:28 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:46:28 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:46:28 [abstract.py:306] It took 0.565822 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 2.0, Response Length: 97)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [62, 18, 98], create an equation that equals 54. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:46:47 [abstract.py:324] It took 0.301128 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñé         | 37/1000 [17:14<6:41:23, 25.01s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0017501661922347698), 'train/rewards': np.float64(0.9609375), 'train/reward_metrics/format_reward': np.float64(0.9296875), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 37/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:46:51 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:46:52 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:46:52 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:46:52 [abstract.py:306] It took 0.607926 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 2.0, Response Length: 86)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [69, 36, 2], create an equation that equals 87. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:09<00:00,  1.68it/s]


INFO 01-01 23:47:04 [abstract.py:324] It took 0.337975 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 38/1000 [17:31<6:04:35, 22.74s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.002218141266299701), 'train/rewards': np.float64(1.03125), 'train/reward_metrics/format_reward': np.float64(0.984375), 'train/reward_metrics/equation_reward': np.float64(0.046875)}
Iteration 38/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:47:13 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:47:13 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:47:13 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:47:13 [abstract.py:306] It took 0.561872 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 100)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [29, 60, 2, 48], create an equation that equals 83. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) 

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:47:32 [abstract.py:324] It took 0.332490 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 39/1000 [17:59<6:27:24, 24.19s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0016433078795401925), 'train/rewards': np.float64(0.9609375), 'train/reward_metrics/format_reward': np.float64(0.9296875), 'train/reward_metrics/equation_reward': np.float64(0.03125)}
Iteration 39/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:47:36 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:47:37 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:47:37 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:47:37 [abstract.py:306] It took 0.571404 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 241)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [70, 11, 87, 81], create an equation that equals 75. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:09<00:00,  1.62it/s]


INFO 01-01 23:47:50 [abstract.py:324] It took 0.300778 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 40/1000 [18:17<5:57:57, 22.37s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0023503628249088816), 'train/rewards': np.float64(0.96875), 'train/reward_metrics/format_reward': np.float64(0.96875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 40/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:47:58 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:47:59 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:47:59 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:47:59 [abstract.py:306] It took 0.560723 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 208)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [3, 27, 72], create an equation that equals 48. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:48:18 [abstract.py:324] It took 0.330370 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 41/1000 [18:45<6:22:52, 23.95s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0020401990103156875), 'train/rewards': np.float64(1.078125), 'train/reward_metrics/format_reward': np.float64(0.984375), 'train/reward_metrics/equation_reward': np.float64(0.09375)}
Iteration 41/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:48:24 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:48:24 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:48:24 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:48:24 [abstract.py:306] It took 0.591926 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 131)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [41, 67, 33, 54], create an equation that equals 35. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:11<00:00,  1.36it/s]


INFO 01-01 23:48:39 [abstract.py:324] It took 0.300719 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 42/1000 [19:06<6:10:19, 23.19s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.002387535856134157), 'train/rewards': np.float64(0.9765625), 'train/reward_metrics/format_reward': np.float64(0.9609375), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 42/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:48:47 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:48:48 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:48:48 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:48:48 [abstract.py:306] It took 0.587441 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 156)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [25, 77, 62, 76], create an equation that equals 66. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.02it/s]


INFO 01-01 23:49:07 [abstract.py:324] It took 0.301930 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 43/1000 [19:34<6:31:13, 24.53s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0017355197699186367), 'train/rewards': np.float64(0.984375), 'train/reward_metrics/format_reward': np.float64(0.96875), 'train/reward_metrics/equation_reward': np.float64(0.015625)}
Iteration 43/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:49:13 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:49:13 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:49:13 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:49:13 [abstract.py:306] It took 0.580614 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 644)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [40, 26, 20], create an equation that equals 86. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:11<00:00,  1.38it/s]


INFO 01-01 23:49:28 [abstract.py:324] It took 0.330455 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 44/1000 [19:55<6:14:44, 23.52s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.002453493882033377), 'train/rewards': np.float64(0.9921875), 'train/reward_metrics/format_reward': np.float64(0.9921875), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 44/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:49:36 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:49:37 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:49:37 [gpu_worker.py:132] Sleep mode freed 14.67 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:49:37 [abstract.py:306] It took 0.565667 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 183)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [85, 93, 18], create an equation that equals 10. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.03it/s]


INFO 01-01 23:49:55 [abstract.py:324] It took 0.340173 seconds to wake up tags {'weights', 'kv_cache'}.


  4%|‚ñç         | 45/1000 [20:22<6:33:07, 24.70s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0020428710309081406), 'train/rewards': np.float64(1.0), 'train/reward_metrics/format_reward': np.float64(0.953125), 'train/reward_metrics/equation_reward': np.float64(0.046875)}
Iteration 45/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:50:03 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:50:04 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:50:04 [gpu_worker.py:132] Sleep mode freed 14.68 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:50:04 [abstract.py:306] It took 0.578193 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 93)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [24, 22, 20, 9], create an equation that equals 57. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) /

Gradient Accumulation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:15<00:00,  1.06it/s]


INFO 01-01 23:50:22 [abstract.py:324] It took 0.326426 seconds to wake up tags {'weights', 'kv_cache'}.


  5%|‚ñç         | 46/1000 [20:49<6:42:23, 25.31s/it]

KEY METRICS: {'train/kl_penalty': np.float64(0.0024279155662501094), 'train/rewards': np.float64(0.9765625), 'train/reward_metrics/format_reward': np.float64(0.9765625), 'train/reward_metrics/equation_reward': np.float64(0.0)}
Iteration 46/1000


Adding requests:   0%|          | 0/16 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

INFO 01-01 23:50:30 [block_pool.py:447] Successfully reset prefix cache
INFO 01-01 23:50:31 [cumem.py:239] CuMemAllocator: sleep freed 14.53 GiB memory in total, of which 5.88 GiB is backed up in CPU and the rest 8.65 GiB is discarded directly.
INFO 01-01 23:50:31 [gpu_worker.py:132] Sleep mode freed 14.62 GiB memory, 44.93 GiB memory is still in use.
INFO 01-01 23:50:31 [abstract.py:306] It took 0.561545 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 617)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [61, 12, 39, 78], create an equation that equals 51. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2)

Gradient Accumulation:  38%|‚ñà‚ñà‚ñà‚ñä      | 6/16 [00:06<00:10,  1.02s/it]
  5%|‚ñç         | 46/1000 [21:04<7:17:14, 27.50s/it]


KeyboardInterrupt: 

Error in callback <bound method _WandbInit._post_run_cell_hook of <wandb.sdk.wandb_init._WandbInit object at 0x7800e178db50>> (for post_run_cell), with arguments args (<ExecutionResult object at 7800dd532750, execution_count=27 error_before_exec=None error_in_exec= info=<ExecutionInfo object at 7800e0983860, raw_cell="for iteration in trange(NUM_ITERATIONS):
    print.." transformed_cell="for iteration in trange(NUM_ITERATIONS):
    print.." store_history=True silent=False shell_futures=True cell_id=vscode-notebook-cell://attached-container%2B7b22636f6e7461696e65724e616d65223a222f7079746f7263682d636f6e7461696e657233227d@ssh-remote%2Btrain16node/workspace/projs/nano-aha-moment/nano_r1.ipynb#Y106sdnNjb2RlLXJlbW90ZQ%3D%3D> result=None>,),kwargs {}:


ConnectionResetError: Connection lost

## Citation

If you use this codebase in your research, please cite us using:

```bibtex
@misc{Kazemnejad2025:NanoAhaMoment,
  author       = {Amirhossein Kazemnejad and Milad Aghajohari and Alessandro Sordoni and Aaron Courville and Siva Reddy},
  title        = {Nano Aha! Moment: Lunch Break Reproduction of DeepSeek R1-Zero from Scratch},
  year         = {2025},
  howpublished = {\url{https://github.com/McGill-NLP/nano-aha-moment}},
  note         = {GitHub repository}
}
```