# RL Fine-tuning for Positive Sentiment Generation

## Exercise: Implementing Reward Functions for GRPO

In this exercise, you will learn how to fine-tune a language model using **Reinforcement Learning** to generate text with positive sentiment. Specifically, you'll implement different reward functions and observe how they affect the trained model's behavior.

### Learning Objectives

By the end of this exercise, you will:
1. Understand how reward functions guide RL training
2. Implement a basic sentiment-based reward function
3. Explore reward shaping techniques (exponential)
4. Understand KL divergence regularization (forward vs backward)
5. Compare different training configurations empirically

### Background

We use **GRPO (Group Relative Policy Optimization)** from the TRL library. GRPO works by:
1. Generating multiple completions for each prompt
2. Computing rewards for each completion
3. Computing advantages relative to the group
4. Updating the policy to increase probability of high-advantage completions

The key insight is that the **reward function determines what the model learns**. A simple sentiment classifier reward will push the model toward positive text generation.

## Setup

First, let's install dependencies and import the necessary modules.

In [None]:
# If running in Colab or without conda setup, uncomment to install:
# !pip install transformers trl torch datasets accelerate matplotlib rich
#
# For local setup with conda (recommended), see README.md:
# conda create -n sentiment python=3.10 -y && conda activate sentiment
# pip install torch --index-url https://download.pytorch.org/whl/cu124
# pip install -r requirements.txt

In [None]:
import torch
import math
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Import from our modules
from data import get_train_dataset, get_validation_dataset, VALIDATION_PROMPTS
from sentiment import load_sentiment_model, get_sentiment_scores

# Load the sentiment model (we'll use this throughout)
sentiment_model, sentiment_tokenizer = load_sentiment_model()
print("Sentiment model loaded!")

## Understanding the Sentiment Classifier

Before implementing reward functions, let's understand how the sentiment classifier works. We use `nlptown/bert-base-multilingual-uncased-sentiment`, a 5-star rating model. We compute the expected star rating and rescale to [0, 1]: score = (E[stars] - 1) / 4. This gives more continuous scores compared to binary classifiers.

In [None]:
# Test the sentiment classifier
test_texts = [
    "This movie was absolutely fantastic! I loved every moment.",
    "Terrible film. Complete waste of time and money.",
    "It was okay, nothing special but not bad either.",
    "The acting was superb and the story was compelling.",
    "Boring and predictable. I almost fell asleep."
]

scores = get_sentiment_scores(test_texts)

print("Sentiment Scores (P(positive)):")
print("-" * 60)
for text, score in zip(test_texts, scores):
    sentiment = "POSITIVE" if score > 0.5 else "NEGATIVE"
    print(f"{score:.3f} [{sentiment:8s}]: {text[:50]}...")

---

# Exercise 1: Basic Sentiment Reward

Your first task is to implement a basic sentiment reward function. The reward should simply be the probability of positive sentiment for each completion.

**Task**: Implement `sentiment_reward()` in the cell below.

**Hint**: Use the `get_sentiment_scores()` helper function which returns P(positive) for each text.

In [None]:
def sentiment_reward(completions: list[str], **kwargs) -> list[float]:
    """
    Basic sentiment reward function.
    
    Computes the reward as the probability of positive sentiment for each completion.
    
    Args:
        completions: List of generated text completions
        **kwargs: Additional arguments (not used here)
    
    Returns:
        List of reward values in [0, 1], one per completion
    
    Example:
        >>> rewards = sentiment_reward(["Great movie!", "Terrible film."])
        >>> # rewards should be approximately [0.99, 0.02]
    """
    # =========================================================================
    # YOUR CODE HERE (1-2 lines)
    # Hint: Use get_sentiment_scores(completions)
    # =========================================================================
    
    raise NotImplementedError("Implement sentiment_reward")
    
    # =========================================================================
    # END YOUR CODE
    # =========================================================================

In [None]:
# Test your implementation
test_completions = [
    "This movie was amazing and I loved it!",
    "This movie was terrible and boring.",
    "This movie was okay I guess."
]

try:
    rewards = sentiment_reward(test_completions)
    print("Your sentiment_reward implementation:")
    for text, reward in zip(test_completions, rewards):
        print(f"  {reward:.3f}: {text}")
    
    # Basic validation
    assert len(rewards) == 3, "Should return one reward per completion"
    assert all(0 <= r <= 1 for r in rewards), "Rewards should be in [0, 1]"
    assert rewards[0] > rewards[1], "Positive text should have higher reward"
    print("\n✓ Tests passed!")
except NotImplementedError:
    print("❌ Not implemented yet - complete the function above!")

---

# Exercise 2: Reward Shaping

Raw sentiment probabilities might not be the optimal reward signal. **Reward shaping** transforms the raw reward to potentially improve learning.

**Note on GRPO**: For algorithms like GRPO that use *relative* comparisons within groups, linear transformations (shift and scale) don't change the learning signal - they're mathematically equivalent to no shaping. Only *non-linear* transformations like exponential shaping can change the relative differences between rewards.

In [None]:
# Linear shaping (scale * (score - baseline)) is mathematically equivalent
# to no shaping for GRPO, since it uses relative comparisons.
# We only implement exponential shaping which changes relative differences.
print("Skipping linear shaping - see note above about GRPO.")

In [None]:
# Proceed to exponential shaping below

## Exponential Reward Shaping

Exponential shaping creates a **non-linear** reward curve:

$$\text{reward} = \exp(\text{score} / \text{temperature}) - 1$$

Unlike linear shaping, this changes the *relative* differences between rewards:
- Amplifies differences at the high end (very positive completions get much higher rewards)
- Compresses differences at the low end

The **temperature** parameter controls steepness:
- Lower temperature → sharper exponential curve (more differentiation)
- Higher temperature → flatter curve

In [None]:
def shaped_reward_exponential(
    completions: list[str],
    temperature: float = 1.0,
    **kwargs
) -> list[float]:
    """
    Exponential reward shaping: reward = exp(score / temperature) - 1
    
    Args:
        completions: List of generated text completions
        temperature: Controls steepness (lower = steeper)
    
    Returns:
        List of shaped reward values
    """
    # =========================================================================
    # YOUR CODE HERE (2-3 lines)
    # 1. Get sentiment scores
    # 2. Apply: exp(score / temperature) - 1
    # Hint: Use math.exp() for the exponential
    # =========================================================================
    
    raise NotImplementedError("Implement shaped_reward_exponential")
    
    # =========================================================================
    # END YOUR CODE
    # =========================================================================

In [None]:
# Test exponential shaping
try:
    rewards = shaped_reward_exponential(test_completions)
    print("Exponential shaped rewards (temperature=1.0):")
    for text, reward in zip(test_completions, rewards):
        print(f"  {reward:.3f}: {text}")
    
    # All rewards should be non-negative (exp(x) >= 1 for x >= 0, so exp(x) - 1 >= 0)
    assert all(r >= 0 for r in rewards), "Exponential rewards should be >= 0"
    print("\n✓ Tests passed!")
except NotImplementedError:
    print("❌ Not implemented yet - complete the function above!")

### Visualizing Reward Shaping

Let's visualize how different reward shaping methods transform the sentiment scores.

In [None]:
# Visualize reward shaping curves
x = np.linspace(0, 1, 100)

# Compute different transformations
y_raw = x  # No shaping
y_exp_1 = np.exp(x / 1.0) - 1  # Exponential, temp=1.0
y_exp_05 = np.exp(x / 0.5) - 1  # Exponential, temp=0.5 (sharper)

plt.figure(figsize=(10, 6))
plt.plot(x, y_raw, label='Raw (no shaping)', linewidth=2)
plt.plot(x, y_exp_1, label='Exponential (temp=1.0)', linewidth=2)
plt.plot(x, y_exp_05, label='Exponential (temp=0.5)', linewidth=2, linestyle='--')

plt.axhline(y=0, color='gray', linestyle=':', alpha=0.5)
plt.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5, label='Neutral')

plt.xlabel('Sentiment Score (P(positive))')
plt.ylabel('Shaped Reward')
plt.title('Reward Shaping Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Exercise 3: KL Divergence Penalties (Advanced)

**KL divergence** measures how different two probability distributions are. In RL fine-tuning, we often add a KL penalty to prevent the policy from deviating too far from the original (reference) model.

There are two ways to compute KL divergence:

## Forward KL: $D_{KL}(\pi_{\text{policy}} || \pi_{\text{ref}})$

- Penalizes the policy for putting mass where the reference doesn't
- "Mode-covering" behavior: tries to cover all modes of reference
- Tends to produce more diverse outputs

## Backward KL: $D_{KL}(\pi_{\text{ref}} || \pi_{\text{policy}})$

- Penalizes the policy for NOT having mass where the reference does
- "Mode-seeking" behavior: focuses on main modes of reference
- Can lead to mode collapse but more focused outputs

**Your Task**: Implement `compute_ref_log_probs`, `kl_penalty_forward`, and `kl_penalty_backward` in `rewards.py`. Then train with `--kl_type forward` or `--kl_type backward` to use your implementations!

In [None]:
# For the KL exercises, we need to understand log probabilities
# Let's first see how to compute them

# Load a small GPT-2 model
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
gpt2_model.eval()

# Example: compute log probability of a sequence
text = "This movie was great!"
inputs = gpt2_tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = gpt2_model(**inputs)
    logits = outputs.logits  # Shape: (batch, seq_len, vocab_size)
    
    # Convert to log probabilities
    log_probs = torch.log_softmax(logits, dim=-1)
    
    # Get log prob of each token (shifted by 1 because logits[t] predicts token[t+1])
    token_ids = inputs["input_ids"][0]
    seq_log_prob = 0.0
    
    print(f"Text: {text}")
    print(f"Tokens: {gpt2_tokenizer.convert_ids_to_tokens(token_ids)}")
    print("\nPer-token log probabilities:")
    for t in range(len(token_ids) - 1):
        next_token = token_ids[t + 1]
        token_log_prob = log_probs[0, t, next_token].item()
        seq_log_prob += token_log_prob
        print(f"  P({gpt2_tokenizer.decode(next_token):10s}) = {math.exp(token_log_prob):.4f} (log: {token_log_prob:.2f})")
    
    print(f"\nTotal sequence log prob: {seq_log_prob:.2f}")

### Understanding KL Penalty in Practice

KL regularization prevents the model from drifting too far from the original GPT-2.

You will implement custom KL in the reward function (via `--kl_type` parameter):
- Adds a bonus/penalty to the reward based on reference model probability
- **Forward KL**: `reward += kl_coef * log(P_ref)` (bonus for likely outputs)
- **Backward KL**: `reward -= kl_coef * exp(-log(P_ref))` (penalty for unlikely outputs)

Implement `compute_ref_log_probs`, `kl_penalty_forward`, and `kl_penalty_backward` in `rewards.py`.

In [None]:
# Example: kl_penalty_forward signature (implement in rewards.py)
#
# def kl_penalty_forward(
#     completions: list[str],
#     prompts: list[str],
#     ref_model,
#     tokenizer,
#     kl_coef: float = 0.1,
#     **kwargs
# ) -> list[float]:
#     """
#     Forward KL: reward += kl_coef * log(P_ref)
#     Bonus for outputs likely under reference model.
#     """
#     log_probs = compute_ref_log_probs(completions, prompts, ref_model, tokenizer)
#     return [kl_coef * lp for lp in log_probs]

print('See rewards.py for the full implementation exercise.')
print('Key functions to implement:')
print('  - compute_ref_log_probs(): Compute log P(completion | prompt) under ref model')
print('  - kl_penalty_forward(): Return kl_coef * log_prob (bonus for likely outputs)')
print('  - kl_penalty_backward(): Return -kl_coef * exp(-log_prob) (penalty for unlikely outputs)')

---

# Running Training Experiments

Now let's use your implemented reward functions to train models with different configurations!

In [None]:
# First, let's see what the BASE model generates (before training)
from evaluate import generate_completions, load_model

base_model, base_tokenizer = load_model("gpt2")

sample_prompts = [
    "This movie was",
    "The acting in this film",
    "I watched this yesterday and",
    "The story was",
    "Overall, I think this movie"
]

print("BASE GPT-2 Generations (before training):")
print("=" * 60)

base_completions = generate_completions(base_model, base_tokenizer, sample_prompts)
base_scores = get_sentiment_scores(base_completions)

for prompt, completion, score in zip(sample_prompts, base_completions, base_scores):
    sentiment = "POS" if score > 0.5 else "NEG"
    print(f"\n[{score:.2f} {sentiment}] {completion}")

In [None]:
# Plot base model sentiment distribution
from data import TRAIN_PROMPTS

# Generate more samples for statistics
all_base_completions = generate_completions(base_model, base_tokenizer, TRAIN_PROMPTS[:20])
all_base_scores = get_sentiment_scores(all_base_completions)

plt.figure(figsize=(10, 5))
plt.hist(all_base_scores, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(x=0.5, color='r', linestyle='--', label='Neutral')
plt.axvline(x=np.mean(all_base_scores), color='g', linestyle='-', 
            label=f'Mean: {np.mean(all_base_scores):.2f}')
plt.xlabel('Sentiment Score')
plt.ylabel('Count')
plt.title('Base GPT-2 Sentiment Distribution (Before Training)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Base model statistics:")
print(f"  Mean sentiment: {np.mean(all_base_scores):.3f}")
print(f"  Std: {np.std(all_base_scores):.3f}")
print(f"  Positive ratio: {np.mean(np.array(all_base_scores) > 0.5):.1%}")

## Training with Different Configurations

Now let's train with your reward functions! You can experiment with:

1. **Basic sentiment reward** (your `sentiment_reward`)
2. **Exponential shaping** (your `shaped_reward_exponential`)
3. **Exponential shaping** (your `shaped_reward_exponential`)
4. **Custom KL regularization** (your `kl_penalty_forward` / `kl_penalty_backward`)

Run the cells below to train different configurations.

In [None]:
# Training with basic sentiment reward
# This will take ~5-10 minutes depending on your hardware

from train import train

# Uncomment to run training:
# trainer = train(
#     model_name="gpt2",
#     output_dir="./outputs/basic_sentiment",
#     preset="quick",  # Use "medium" for better results
#     reward_shaping="none",
# )

print("Training code ready. Uncomment and run to start training!")
print("Estimated time: ~3-5 minutes for 'quick' preset")

In [None]:
# Training with exponential shaping

# Uncomment to run training:
# trainer = train(
#     model_name="gpt2",
#     output_dir="./outputs/exponential_shaping",
#     preset="quick",
#     reward_shaping="exponential",
# )

print("Training code ready. Uncomment and run to start training!")

In [None]:
# Training with custom KL regularization (uses your implementation!)

# Uncomment to run training with Forward KL:
# trainer = train(
#     model_name="gpt2",
#     output_dir="./outputs/with_forward_kl",
#     preset="quick",
#     reward_shaping="none",
#     kl_type="forward",  # Uses your kl_penalty_forward!
#     kl_coef=0.1,        # Regularization strength
# )

# Or try Backward KL:
# trainer = train(
#     model_name="gpt2",
#     output_dir="./outputs/with_backward_kl",
#     preset="quick",
#     reward_shaping="none",
#     kl_type="backward",  # Uses your kl_penalty_backward!
#     kl_coef=0.1,
# )

print("Training code ready. Uncomment and run to start training!")

## Evaluating Trained Models

After training, compare the results.

In [None]:
# Compare base vs trained model (after training completes)
# Uncomment and modify paths as needed:

# from evaluate import compare_models, plot_comparison, print_comparison_samples

# comparison = compare_models(
#     base_model="gpt2",
#     trained_model="./outputs/basic_sentiment/final",
#     num_samples=10
# )

# print_comparison_samples(comparison)
# plot_comparison(comparison)

print("Evaluation code ready. Run after training completes!")

---

# Analysis Questions

After completing the exercises, consider these questions:

1. **Reward Function Impact**: How did different reward shaping methods affect:
   - Training speed (reward curve slope)
   - Final sentiment scores
   - Generation diversity

2. **KL Divergence Trade-offs**: When using KL regularization (`--kl_type forward` or `backward`):
   - Does the model stay closer to the base model's behavior?
   - Is there a trade-off between sentiment positivity and text quality?
   - What happens with very high `kl_coef` values?

3. **Mode Collapse**: Did you observe any signs of mode collapse (repetitive outputs)? How did different configurations affect this?

4. **Forward vs Backward KL**: (If implemented) How do forward and backward KL penalties differ in their effects on generation?

Write your observations in the cell below:

### Your Observations

*Double-click to edit this cell and write your analysis...*

**Reward Shaping Observations:**
- 

**KL Penalty Observations:**
- 

**Mode Collapse Observations:**
- 

**Other Findings:**
- 

---

# Bonus Challenges

If you finish early, try these extensions:

1. **Custom Reward Shaping**: Design your own non-linear reward transformation. Can you find one that works better than exponential?

2. **Reward Combination**: What happens if you combine sentiment reward with a length penalty? Implement and test.

3. **Temperature Experiments**: Try different sampling temperatures during generation. How does this affect the sentiment/quality trade-off?

4. **Different Base Models**: Try using `gpt2-medium` instead of `gpt2`. How does model size affect training?

In [None]:
# Bonus: Custom reward function

def my_custom_reward(completions: list[str], **kwargs) -> list[float]:
    """
    Your custom reward function.
    
    Ideas to try:
    - Combine sentiment with length penalty
    - Add bonus for certain keywords
    - Penalize repetition
    """
    # YOUR CODE HERE
    pass

---

# Summary

In this exercise, you learned:

1. **GRPO Algorithm**: How Group Relative Policy Optimization trains LLMs using RL
2. **Reward Functions**: The crucial role of reward design in RL fine-tuning
3. **Reward Shaping**: How exponential transformations affect learning (linear is equivalent to none for GRPO)
4. **KL Regularization**: Using KL divergence to prevent catastrophic forgetting
5. **Practical Training**: Running and evaluating RL fine-tuning experiments

### Key Takeaways

- The reward function is the most important design choice in RL fine-tuning
- Reward shaping can significantly impact training dynamics
- KL penalties help maintain model quality while optimizing for the task
- There's always a trade-off between task performance and generation diversity

### Further Reading

- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeekMath Paper (GRPO)](https://arxiv.org/abs/2402.03300)
- [KL Approximation (Schulman)](http://joschu.net/blog/kl-approx.html)