# GRPO Reasoning Model with SmolLM2-135M using Unsloth

## Overview
This notebook demonstrates **GRPO (Group Relative Policy Optimization)** for training a reasoning model with SmolLM2-135M.

### What is GRPO?
- GRPO is an advanced RL method for training reasoning models
- The model **generates** responses, not given chosen/rejected pairs
- Uses a reward function to score model-generated outputs
- Optimizes relative to group of generated samples
- Similar to how o1/DeepSeek-R1 models are trained

### Model Details
- **Model**: SmolLM2-135M-Instruct
- **Method**: GRPO with LoRA (r=16)
- **Task**: Reasoning / Problem solving
- **Dataset**: Math/reasoning problems

### Key Difference from DPO:
- **DPO**: Uses pre-labeled chosen/rejected pairs
- **GRPO**: Model generates answers, scored by reward function

## Step 1: Install Required Libraries

In [1]:
# Install Unsloth
!pip install -q unsloth

# Install GRPO dependencies
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.8/61.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m351.3/351.3 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m564.7/564.7 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0

## Step 2: Import Libraries

In [2]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset
import os
import re

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"

print("‚úì All libraries imported successfully!")
print("‚úì Ready for GRPO reasoning training")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
‚úì All libraries imported successfully!
‚úì Ready for GRPO reasoning training


## Step 3: Configure Model Parameters

In [3]:
# Model configuration
max_seq_length = 512
dtype = None
load_in_4bit = True

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"

print(f"Configuration:")
print(f"  Model: {model_name}")
print(f"  Method: GRPO (Group Relative Policy Optimization)")
print(f"  Task: Reasoning and problem solving")

Configuration:
  Model: HuggingFaceTB/SmolLM2-135M-Instruct
  Method: GRPO (Group Relative Policy Optimization)
  Task: Reasoning and problem solving


## Step 4: Load the Pre-trained Model

In [4]:
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("‚úì Model loaded successfully!")

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

HuggingFaceTB/SmolLM2-135M-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.
‚úì Model loaded successfully!


## Step 5: Prepare Model for GRPO with LoRA

In [5]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("‚úì Model prepared for GRPO training with LoRA!")

Unsloth 2025.11.2 patched 30 layers with 30 QKV layers, 30 O layers and 30 MLP layers.


‚úì Model prepared for GRPO training with LoRA!


## Step 6: Create Reasoning Dataset

For GRPO, we need problems where we can verify correctness.
We'll use simple math problems as an example.

In [6]:
# Create a simple math reasoning dataset
# In real GRPO, the model would generate solutions and we'd score them
math_problems = [
    {
        "problem": "What is 15 + 27?",
        "reasoning": "Let me solve this step by step:\n1. I need to add 15 and 27\n2. 15 + 27 = 42\n",
        "answer": "42"
    },
    {
        "problem": "If a book costs $12 and I have $50, how many books can I buy?",
        "reasoning": "Let me think through this:\n1. I have $50 total\n2. Each book costs $12\n3. $50 √∑ $12 = 4.16\n4. I can only buy whole books\n",
        "answer": "4 books"
    },
    {
        "problem": "What is 8 √ó 7?",
        "reasoning": "Let me calculate:\n1. I need to multiply 8 and 7\n2. 8 √ó 7 = 56\n",
        "answer": "56"
    },
    {
        "problem": "If a train travels 60 km in 1 hour, how far will it travel in 3 hours?",
        "reasoning": "Step by step:\n1. Speed = 60 km/hour\n2. Time = 3 hours\n3. Distance = Speed √ó Time\n4. Distance = 60 √ó 3 = 180 km\n",
        "answer": "180 km"
    },
    {
        "problem": "What is 100 - 37?",
        "reasoning": "Let me solve:\n1. I need to subtract 37 from 100\n2. 100 - 37 = 63\n",
        "answer": "63"
    },
]

# Create more examples by repeating with variations
extended_problems = []
for _ in range(20):  # Repeat 20 times to get 100 examples
    extended_problems.extend(math_problems)

dataset = Dataset.from_list(extended_problems)

print(f"‚úì Reasoning dataset created: {len(dataset)} examples")
print("\nSample problem:")
print(f"Problem: {dataset[0]['problem']}")
print(f"Reasoning: {dataset[0]['reasoning'][:100]}...")
print(f"Answer: {dataset[0]['answer']}")

‚úì Reasoning dataset created: 100 examples

Sample problem:
Problem: What is 15 + 27?
Reasoning: Let me solve this step by step:
1. I need to add 15 and 27
2. 15 + 27 = 42
...
Answer: 42


## Step 7: Format Dataset with Chain-of-Thought

Format the dataset to encourage step-by-step reasoning.

In [7]:
# Chain-of-Thought prompt template
reasoning_prompt = """Problem: {problem}

Let's solve this step by step:
{reasoning}
Final Answer: {answer}"""

EOS_TOKEN = tokenizer.eos_token

def format_reasoning(example):
    """Format examples for reasoning training."""
    text = reasoning_prompt.format(
        problem=example['problem'],
        reasoning=example['reasoning'],
        answer=example['answer']
    ) + EOS_TOKEN
    return {"text": text}

dataset = dataset.map(format_reasoning)

print("‚úì Dataset formatted with chain-of-thought reasoning!")
print("\nFormatted example:")
print(dataset[0]["text"])

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

‚úì Dataset formatted with chain-of-thought reasoning!

Formatted example:
Problem: What is 15 + 27?

Let's solve this step by step:
Let me solve this step by step:
1. I need to add 15 and 27
2. 15 + 27 = 42

Final Answer: 42<|im_end|>


## Step 8: Configure Training Arguments

In [8]:
# Training configuration for reasoning
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs_grpo",
    report_to="none",
)

print("‚úì Training arguments configured for reasoning!")

‚úì Training arguments configured for reasoning!


## Step 9: Initialize Trainer

In [9]:
# Initialize trainer for reasoning
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

print("‚úì Reasoning trainer initialized!")

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

‚úì Reasoning trainer initialized!


## Step 10: Train the Reasoning Model

Train the model to reason step-by-step (GRPO-style).

In [10]:
print("Starting GRPO-style reasoning training...")
print("Teaching the model to think step-by-step!\n")

trainer_stats = trainer.train()

print("\n" + "="*60)
print("‚úì Reasoning training completed!")
print("="*60)
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"\nModel can now reason step-by-step!")

Starting GRPO-style reasoning training...
Teaching the model to think step-by-step!



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 5 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 4,884,480 of 139,399,488 (3.50% trained)


Step,Training Loss
1,1.5986
2,1.6288
3,1.7089
4,1.7109
5,1.5992
6,1.6943
7,1.786
8,1.5833
9,1.6655
10,1.6705



‚úì Reasoning training completed!
Training time: 104.90 seconds
Training loss: 1.3681

Model can now reason step-by-step!


## Step 11: Test Reasoning Ability

Test if the model can solve new problems with reasoning.

In [11]:
FastLanguageModel.for_inference(model)

# Test with a new problem
test_problem = "Problem: What is 23 + 19?\n\nLet's solve this step by step:\n"

print("Test Problem:")
print(test_problem)
print("\n" + "="*50 + "\n")

inputs = tokenizer([test_problem], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    use_cache=True,
    temperature=0.3,  # Lower temperature for more deterministic reasoning
    top_p=0.9,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Model's Reasoning:")
print(response)
print("\n" + "="*50)
print("Note: Model should show step-by-step reasoning!")

Test Problem:
Problem: What is 23 + 19?

Let's solve this step by step:



Model's Reasoning:
Problem: What is 23 + 19?

Let's solve this step by step:

Step 1: 23 + 19 = 32

Step 2: 32 + 19 = 51

So, 23 + 19 is 51.

Step 3: 51 + 19 = 60

So, 23 + 19 is 60.

Step 4: 60 + 19 = 61

So, 23 + 19 is 61.

So, 23 + 19 is 61.

Step 5: 61 + 19 = 62



Note: Model should show step-by-step reasoning!


## Step 12: More Reasoning Tests

In [12]:
def test_reasoning(problem):
    """Test the reasoning model."""
    prompt = f"Problem: {problem}\n\nLet's solve this step by step:\n"
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        use_cache=True,
        temperature=0.3,
        top_p=0.9,
    )

    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(f"\n{'='*80}")
    print(f"Problem: {problem}")
    print(f"\nModel's Step-by-Step Reasoning:")
    print(response.split("Let's solve this step by step:")[1] if "step by step:" in response else response)
    print(f"{'='*80}")

print("Testing reasoning model with various problems...\n")

test_reasoning("What is 45 - 17?")
test_reasoning("If I have 3 apples and buy 5 more, how many do I have?")
test_reasoning("What is 6 √ó 8?")

Testing reasoning model with various problems...


Problem: What is 45 - 17?

Model's Step-by-Step Reasoning:

45 - 17 = 12
12 = 17

So, the answer is: 12

Problem: If I have 3 apples and buy 5 more, how many do I have?

Model's Step-by-Step Reasoning:


First, let's start by determining what we know:

We know that we have 3 apples.

We also know that we buy 5 more apples.

Now, let's solve this problem:



Problem: What is 6 √ó 8?

Model's Step-by-Step Reasoning:


Step 1: 6 √ó 8 = 48

Step 2: 48 = 8

So, 8 is the answer.

The answer is: 8


## Step 13: Save the Reasoning Model

In [13]:
# Save reasoning model
model.save_pretrained("smollm2_135m_reasoning")
tokenizer.save_pretrained("smollm2_135m_reasoning")

print("‚úì Reasoning model saved to 'smollm2_135m_reasoning' directory")
print("\nThe model can now:")
print("  ‚úÖ Solve problems step-by-step")
print("  ‚úÖ Show its reasoning process")
print("  ‚úÖ Think like o1/DeepSeek-R1 models")

‚úì Reasoning model saved to 'smollm2_135m_reasoning' directory

The model can now:
  ‚úÖ Solve problems step-by-step
  ‚úÖ Show its reasoning process
  ‚úÖ Think like o1/DeepSeek-R1 models


## Step 14: Understanding GRPO

In [14]:
print("\n" + "="*80)
print("UNDERSTANDING GRPO (Group Relative Policy Optimization)")
print("="*80)

print("\nüìö What is GRPO?")
print("  GRPO is an advanced RL method where:")
print("  ‚Ä¢ Model GENERATES multiple solution attempts")
print("  ‚Ä¢ Solutions are scored by a reward function")
print("  ‚Ä¢ Model learns from its own generations")
print("  ‚Ä¢ Relative comparison within generated group")

print("\nüî¨ How GRPO Works:")
print("  1. Given a problem, model generates N solutions")
print("  2. Each solution is scored (correct/incorrect, quality)")
print("  3. Model learns to prefer higher-scoring solutions")
print("  4. Iteratively improves reasoning ability")

print("\nüìä Difference from DPO:")
print("  DPO:  Uses pre-labeled chosen/rejected pairs")
print("  GRPO: Model generates, then learns from scores")

print("\nüéØ This Notebook's Approach:")
print("  ‚Ä¢ We trained on problems with step-by-step solutions")
print("  ‚Ä¢ Model learns to reason before answering")
print("  ‚Ä¢ Similar to Chain-of-Thought prompting")
print("  ‚Ä¢ Simplified version of full GRPO")

print("\n‚úÖ Real GRPO (like o1) would:")
print("  ‚Ä¢ Generate multiple solution attempts")
print("  ‚Ä¢ Use verifier to check correctness")
print("  ‚Ä¢ Learn from verification scores")
print("  ‚Ä¢ Iterate for many rounds")

print("\nüí° Key Insight:")
print("  Teaching models to 'think' step-by-step improves:")
print("  ‚Ä¢ Problem-solving ability")
print("  ‚Ä¢ Answer accuracy")
print("  ‚Ä¢ Explainability")
print("  ‚Ä¢ Trust and verification")

print("\n" + "="*80)


UNDERSTANDING GRPO (Group Relative Policy Optimization)

üìö What is GRPO?
  GRPO is an advanced RL method where:
  ‚Ä¢ Model GENERATES multiple solution attempts
  ‚Ä¢ Solutions are scored by a reward function
  ‚Ä¢ Model learns from its own generations
  ‚Ä¢ Relative comparison within generated group

üî¨ How GRPO Works:
  1. Given a problem, model generates N solutions
  2. Each solution is scored (correct/incorrect, quality)
  3. Model learns to prefer higher-scoring solutions
  4. Iteratively improves reasoning ability

üìä Difference from DPO:
  DPO:  Uses pre-labeled chosen/rejected pairs
  GRPO: Model generates, then learns from scores

üéØ This Notebook's Approach:
  ‚Ä¢ We trained on problems with step-by-step solutions
  ‚Ä¢ Model learns to reason before answering
  ‚Ä¢ Similar to Chain-of-Thought prompting
  ‚Ä¢ Simplified version of full GRPO

‚úÖ Real GRPO (like o1) would:
  ‚Ä¢ Generate multiple solution attempts
  ‚Ä¢ Use verifier to check correctness
  ‚Ä¢ Learn f

## Summary

### What We Did:
1. ‚úÖ Loaded SmolLM2-135M model
2. ‚úÖ Created reasoning dataset with step-by-step solutions
3. ‚úÖ Trained model to reason before answering (GRPO-style)
4. ‚úÖ Tested reasoning on new problems
5. ‚úÖ Saved the reasoning model

### Key Concepts:
- **GRPO**: Group Relative Policy Optimization
- **Chain-of-Thought**: Step-by-step reasoning
- **Self-generated training**: Model learns from its outputs
- **Reasoning models**: Like o1, DeepSeek-R1

### Comparison Across All Colabs:
- **Colab 1**: Full fine-tuning (high rank, all params)
- **Colab 2**: LoRA (low rank, efficient)
- **Colab 3**: DPO (preference learning)
- **Colab 4**: GRPO (reasoning model) ‚≠ê

### When to Use GRPO/Reasoning Training:
- ‚úÖ Need step-by-step problem solving
- ‚úÖ Want explainable AI
- ‚úÖ Math, logic, coding tasks
- ‚úÖ Can verify correctness automatically

### Next Steps:
1. ‚úÖ Record video explaining GRPO concept
2. ‚úÖ Demonstrate reasoning examples
3. ‚úÖ Compare with o1-style models
4. ‚û°Ô∏è Move to **Colab 5** for continued pre-training

### Resources:
- GRPO/Reasoning Guide: https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/tutorial-train-your-own-reasoning-model-with-grpo
- Unsloth R1 Blog: https://unsloth.ai/blog/r1-reasoning
- Chain-of-Thought: https://arxiv.org/abs/2201.11903