<a href="https://colab.research.google.com/github/pastrop/kaggle/blob/master/GRPO_Study_Ntbk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO Study Notebook

This is notebook is created purely as an example.  Don't expect a production quality code here.  This is a byproduct of me reading the Deepseek paper and understanding how the GPRO agorithm works.  I looked at other people's code for it and found it a bit vague  

**Setup**
1.   LLM is used to create training examples for training the Policy (I am using GPT2 and facebook's opt-1.3b just bcs it is easy without GPUs)
2.   There is a pretrained Reward Model.  I Am using a fully trained sentiment classfier bcs I need smth quick and easy.
3. The policy is a trivial 2 layer feed-forward net



In [1]:
%%capture
!pip install torch transformers

# Finetuning LLM with the GRPO

generating training examples using LLM

In [30]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn as nn # Import the torch.nn module and alias it as nn
import torch.optim as optim # Import the torch.optim module and alias it as optim
import torch.nn.functional as F

**Using a trained sentiment model as a Reward Model**

In [25]:
# Generate responses function for Policy Network training
def generate_responses(prompt, num_responses=5):
    # Load GPT model & tokenizer
    model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

    # Set pad token explicitly
    tokenizer.pad_token = tokenizer.eos_token

    # Tokenize the input with padding and attention mask
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)# max_length is specific to facebook/opt-1.3b
    # Generate multiple responses using sampling
    responses = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=50,
        do_sample=True,
        top_k=50,
        top_p=0.9,
        temperature=0.7,
        num_return_sequences=num_responses,
        pad_token_id=tokenizer.pad_token_id  # Ensures padding works correctly
    )

    # Decode responses (remove prompt from output)
    decoded_responses = [
        tokenizer.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=True).strip()
        for output in responses
    ]

    return decoded_responses

In [40]:
#test for generate_reponses()
prompts = ["What people like or dislike about working out?",
           "What people like or dislike about deserts?",
           "Whatpeople like or dislike about traveling?",
           "What people like or dislike about New York City?",
           "What people like or dislike about living in France?"]
responses={}
for prompt in prompts:
    responses[prompt] = generate_responses(prompt)

In [41]:
responses["What people like or dislike about living in France?"]

['If you are thinking of moving to France, it’s important to know what people like or dislike about living there. Here are a few of the most common reasons people move to',
 "It's a great country, I love it. I love the history, the culture, the food, the weather, the wine, the wine and the women.",
 'I love the French, but there is a lot to dislike.\n\nI am in France for 3 months and I am enjoying the country.\n\nI am not sure what I',
 'The weather, the people, the food, the culture, the landscape, the architecture, the history, the art, the history of architecture, the history of architecture, the food, the',
 'I would say that the French are very nice people.']

In [31]:
# Load sentiment classifier
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
reward_model = AutoModelForSequenceClassification.from_pretrained(model_name)
reward_tokenizer = AutoTokenizer.from_pretrained(model_name)

def get_reward_scores(responses):
    """Scores responses using a sentiment classifier."""
    inputs = reward_tokenizer(responses, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():  # No gradient calculation needed
        outputs = reward_model(**inputs)
    scores = torch.softmax(outputs.logits, dim=1)[:, 1]  # Probability of "positive" class
    return scores.detach()  # Detach to prevent computation graph tracking

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [42]:
#Test Run for the Reward Model (This is a test cell just to make sure get_reward_scores works properly)
'''
#Example responses
responses = [
    "I love exercising, it makes me feel amazing!",
    "Exercise is okay, but it's tiring.",
    "I hate exercising, it's the worst!"
]
'''
test_responses = responses["What people like or dislike about living in France?"]
# Get reward scores
reward_scores = get_reward_scores(test_responses)

# Print scores
for i, (test_response, score) in enumerate(zip(test_responses, reward_scores)):
    print(f"Response {i+1}: {test_response}\nScore: {score.item():.3f}\n")


Response 1: If you are thinking of moving to France, it’s important to know what people like or dislike about living there. Here are a few of the most common reasons people move to
Score: 0.995

Response 2: It's a great country, I love it. I love the history, the culture, the food, the weather, the wine, the wine and the women.
Score: 1.000

Response 3: I love the French, but there is a lot to dislike.

I am in France for 3 months and I am enjoying the country.

I am not sure what I
Score: 0.998

Response 4: The weather, the people, the food, the culture, the landscape, the architecture, the history, the art, the history of architecture, the history of architecture, the food, the
Score: 1.000

Response 5: I would say that the French are very nice people.
Score: 1.000



**Define a Policy Network**

In [43]:
# Define a simple policy network that scores responses
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)  # Output: scalar score for each response
        )

    def forward(self, x):
        return self.fc(x).squeeze(-1)  # Ensure output is [batch_size] shape

# Instantiate the policy network
policy_net = PolicyNetwork(input_dim=768)  # GPT-2 embedding size
policy_optimizer = optim.Adam(policy_net.parameters(), lr=1e-5)

training policy network with KL term

In [51]:
# Load GPT model & tokenizer
gpt_model = AutoModelForCausalLM.from_pretrained("gpt2")
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token # Use eos_token as pad_token for gpt2

'''
Logits Reshaping to avoid the dimension mismatch error:
	1.	Reshape new_logits:Inside the compute_kl_divergence function, the new_logits tensor is reshaped using unsqueeze(-1)
    and repeat to match the dimensions of the original_logits. This ensures that both tensors have compatible shapes for calculating the
    KL divergence.
	2.	Apply softmax after reshaping: The softmax function is applied to new_logits after it has been reshaped t
    o match the dimensions of original_logits. This ensures that the KL divergence is calculated between probability distributions
    with the correct shape.
'''



def compute_kl_divergence(original_logits, new_logits):
    """Compute KL divergence between the original GPT-2 outputs and the policy network outputs."""
    # Reshape new_logits to match original_logits
    new_logits = new_logits.unsqueeze(-1).repeat(1, original_logits.shape[1])

    original_probs = F.softmax(original_logits, dim=-1)
    new_probs = F.softmax(new_logits, dim=-1)  # Apply softmax to new_logits after reshaping
    kl_div = F.kl_div(new_probs.log(), original_probs, reduction="batchmean")  # KL(P || Q)
    return kl_div

def train_policy_network_with_kl(responses, reward_scores, beta=0.01):
    """Updates the policy network using GRPO-style policy gradients with KL divergence."""

    # Get response embeddings from GPT-2
    response_inputs = gpt_tokenizer(responses, return_tensors="pt", padding=True, truncation=True)
    # Ensure all token IDs are within the model's vocabulary
    response_inputs["input_ids"] = response_inputs["input_ids"].clip(0, gpt_model.config.vocab_size -1)
    response_embeddings = gpt_model.transformer.wte(response_inputs["input_ids"]).mean(dim=1)  # Mean token embedding

    # Get logits from the original GPT-2 model (before policy updates)
    with torch.no_grad():
        original_logits = gpt_model(response_inputs["input_ids"]).logits.mean(dim=1)

    # Get predicted scores from the policy network
    predicted_scores = policy_net(response_embeddings).squeeze()

    # Compute Advantage (A = reward - baseline)
    baseline = reward_scores.mean()
    advantage = reward_scores - baseline

    # Compute policy loss (GRPO-style clipped loss)
    clip_ratio = 0.2
    policy_loss = -torch.min(
        predicted_scores * advantage,
        torch.clamp(predicted_scores, 1 - clip_ratio, 1 + clip_ratio) * advantage
    ).mean()

    # Compute KL divergence loss
    new_logits = policy_net(response_embeddings)  # Policy's logits
    kl_loss = compute_kl_divergence(original_logits, new_logits) # Calculate KL divergence

    # Final loss with KL penalty
    total_loss = policy_loss + beta * kl_loss

    # Backpropagation
    policy_optimizer.zero_grad()
    total_loss.backward()
    policy_optimizer.step()

    return total_loss.item(), kl_loss.item()

In [52]:
# Training loop with KL regularization
num_epochs = 5
for epoch in range(num_epochs):
    training_responses = responses[prompts[epoch]]
    reward_scores = get_reward_scores(training_responses)
    loss, kl = train_policy_network_with_kl(training_responses, reward_scores)
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {loss:.4f}, KL Divergence: {kl:.4f}")


Epoch 1/5 - Loss: 0.2024, KL Divergence: 4.8082
Epoch 2/5 - Loss: 0.2074, KL Divergence: 4.6271
Epoch 3/5 - Loss: 0.1417, KL Divergence: 4.4520
Epoch 4/5 - Loss: 0.0432, KL Divergence: 4.1723
Epoch 5/5 - Loss: 0.0441, KL Divergence: 4.3544


Train the target model using the Policy Network

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the LLM (GPT-2)
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define a simple policy network
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)  # Output: scalar score for each response
        )

    def forward(self, x):
        return self.fc(x)

# Instantiate the policy network (Assumed Pretrained)
policy_net = PolicyNetwork(input_dim=768)  # GPT-2 embedding size

# Optimizer for fine-tuning GPT-2
optimizer = optim.Adam(model.parameters(), lr=1e-5)

# Simulated training batch
texts = ["What is the capital of France?", "How does photosynthesis work?"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

# Step 1: Generate responses WITH GRADIENT TRACKING
outputs = model.generate(inputs["input_ids"], max_length=50, return_dict_in_generate=True, output_hidden_states=True)

# Step 2: Get response embeddings from GPT-2 last hidden states
response_embeddings = outputs.hidden_states[-1].mean(dim=1)  # Mean over sequence tokens

# Step 3: Compute response scores using the policy network
response_scores = policy_net(response_embeddings)

# Step 4: Compute policy loss (GRPO-style clipped loss)
baseline = response_scores.mean()
advantage = response_scores - baseline
policy_loss = -torch.min(response_scores * advantage, torch.clamp(response_scores, 0.8, 1.2) * advantage).mean()

# 🔥 Step 5: Backpropagate loss INTO GPT-2 parameters 🔥
optimizer.zero_grad()
policy_loss.backward()  # This now updates GPT-2's parameters too!
optimizer.step()

print(f"Policy Loss: {policy_loss.item()}")



# Policy & Reward Models
*random examples & code for testing various functions*

 (this is a untrained neural net used as an example, it will give random rewards) to Score Responses

In [None]:
class RewardModel(nn.Module):
    """Simple reward model that scores responses."""
    def __init__(self, embedding_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(embedding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1),  # Output: a single score per response
            nn.Sigmoid()  # Normalize score between 0 and 1
        )

    def forward(self, response_embedding):
        return self.fc(response_embedding)

# Instantiate reward model
reward_model = RewardModel(embedding_dim=768)  # Assuming GPT's embeddings

# Convert responses to embeddings
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
response_inputs = tokenizer(decoded_responses, return_tensors="pt", padding=True, truncation=True)
response_embeddings = model.transformer.wte(response_inputs["input_ids"]).mean(dim=1)  # Averaging token embeddings

# Score each response
reward_scores = reward_model(response_embeddings).squeeze()

# Print scores
for i, (response, score) in enumerate(zip(decoded_responses, reward_scores.tolist())):
    print(f"Response {i+1}: {response} | Score: {score:.3f}")

toy example of the policy network

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.fc(state)

def compute_advantage(rewards, values, gamma=0.99):
    """Computes advantage estimates using reward and value function."""
    returns = []
    advs = []
    G = 0
    for t in reversed(range(len(rewards))):
        G = rewards[t] + gamma * G  # Compute return
        returns.insert(0, G)
        advs.insert(0, G - values[t])  # Advantage = Return - Value Estimate
    return torch.tensor(advs, dtype=torch.float32)

# Hyperparameters
epsilon = 0.2
gamma = 0.99
learning_rate = 0.01

# Sample data (dummy example)
state_dim = 4
action_dim = 2
states = torch.rand((5, state_dim))  # 5 sample states
actions = torch.tensor([0, 1, 0, 1, 0])  # Actions taken
old_probs = torch.tensor([0.4, 0.6, 0.5, 0.7, 0.5])  # Old policy probabilities
rewards = [1, 0, 1, 1, 0]  # Rewards received
values = [0.5, 0.4, 0.6, 0.7, 0.3]  # Value estimates

# Initialize network and optimizer
policy_net = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)

# Compute advantage
advantages = compute_advantage(rewards, values, gamma)

# Compute new policy probabilities
new_probs = policy_net(states).gather(1, actions.view(-1, 1)).squeeze()

# Compute probability ratio
ratios = new_probs / old_probs

# Compute clipped and unclipped loss
clipped_ratios = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
loss = -torch.min(ratios * advantages, clipped_ratios * advantages).mean()

# Perform gradient ascent
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Policy Loss: {loss.item()}")


Train the Policy with GRPO
Train the policy network to predict higher scores for better responses.
Update it using a reinforcement learning algorithm like GRPO.
Example: Training the Policy Network

Using the Policy for Action Selection
Once the policy is trained, it can be used to select actions in the environment.

Example: Action Selection in a Trained Policy

In [None]:
import torch

def select_action(policy_net, state):
    """Selects an action based on the trained policy."""
    with torch.no_grad():  # No gradients needed for inference
        action_probs = policy_net(state)
        action = torch.multinomial(action_probs, 1)  # Sample from the policy distribution
    return action.item()

# Example usage:
state = torch.rand((1, 4))  # Example state (assuming 4D input)
action = select_action(policy_net, state)
print(f"Selected Action: {action}")


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# Simulated training data (for visualization purposes)
num_epochs = 50
loss_values = []
kl_values = []
grad_norms = []

# Dummy policy network (small for visualization purposes)
policy_net = nn.Sequential(
    nn.Linear(768, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)
policy_optimizer = optim.Adam(policy_net.parameters(), lr=1e-5)

for epoch in range(num_epochs):
    # Simulated loss and KL divergence values
    policy_loss = torch.rand(1).item() * 2  # Randomized for visualization
    kl_loss = torch.rand(1).item() * 0.2
    total_loss = policy_loss + kl_loss

    # Simulate gradient computation
    policy_optimizer.zero_grad()
    total_loss_tensor = torch.tensor(total_loss, requires_grad=True)
    total_loss_tensor.backward()

    # Compute gradient norm
    total_norm = 0
    for param in policy_net.parameters():
        if param.grad is not None:
            total_norm += param.grad.norm().item()
    grad_norms.append(total_norm)

    # Apply gradient step
    policy_optimizer.step()

    # Store values for plotting
    loss_values.append(total_loss)
    kl_values.append(kl_loss)

# Plot the results
fig, ax1 = plt.subplots()
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss / KL Divergence", color="tab:blue")
ax1.plot(loss_values, label="Total Loss", color="tab:red")
ax1.plot(kl_values, label="KL Divergence", color="tab:blue", linestyle="dashed")
ax1.legend(loc="upper right")
ax1.tick_params(axis="y", labelcolor="tab:blue")

ax2 = ax1.twinx()
ax2.set_ylabel("Gradient Norms", color="tab:green")
ax2.plot(grad_norms, label="Gradient Norms", color="tab:green", linestyle="dotted")
ax2.tick_params(axis="y", labelcolor="tab:green")
ax2.legend(loc="lower right")

plt.title("Policy Training: Loss, KL Divergence & Gradient Norms")
plt.show()


Test of using GPT2 to generate multiple responses based on the promt

In [None]:
# Load GPT model & tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Define a question (prompt)
prompt = "What people like or dislike about working out?"

# Tokenize the input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate multiple responses using sampling
num_responses = 5  # Generate 5 different completions
responses = model.generate(
    input_ids,
    max_length=50,
    do_sample=True,  # Enables sampling instead of greedy decoding
    top_k=50,  # Consider top 50 tokens at each step
    top_p=0.9,  # Nucleus sampling: keeps top tokens contributing to 90% probability
    temperature=0.7,  # Controls randomness (lower = more deterministic)
    num_return_sequences=num_responses  # Generates multiple responses
)

# Decode and print responses
#decoded_responses = [tokenizer.decode(output, skip_special_tokens=True) for output in responses]
#Decode $ remove the prompt &print
decoded_responses = [
    tokenizer.decode(output[len(input_ids[0]):], skip_special_tokens=True).strip()
    for output in responses
]
for i, response in enumerate(decoded_responses):
    print(f"Response {i+1}: {response}")