# Tutorial 4: DPO vs. GRPO vs. Supervised - Choosing Your Algorithm

ToolBrain offers a flexible selection of learning algorithms to fine-tune your agent. The three primary methods are **GRPO**, **DPO**, and **Supervised** learning. Understanding the differences between them and knowing when to use each is key to effective training.

This tutorial provides a conceptual overview of each algorithm and shows you how to switch between them.

## 1. Supervised Fine-Tuning (SFT)

**What it is:** Supervised learning is the most traditional form of fine-tuning. You provide the model with a dataset of high-quality prompt-and-response pairs, and the model learns to imitate those responses. 

**How it works in ToolBrain:** You provide a dataset where each example is a list of `ChatSegment` objects, representing a multi-turn conversation. The model learns to generate the `assistant`'s response given the preceding conversation history.

```python
# From examples/05_supervised_training.py

# The dataset is a list of conversations
training_datasets: List[List[ChatSegment]] = [
    [
         ChatSegment(
                role="other",
                text="You are a Python assistant. Compute the sum of 1..10 and explain briefly.",
            ),
            ChatSegment(
                role="assistant",
                text="The sum of numbers from 1 to 10 is 55, using the formula n(n+1)/2."
            ),
    ],
]

# The Brain is configured for Supervised learning
brain = Brain(
    agent=my_agent,
    algorithm="Supervised"
)

brain.train(training_datasets, num_iterations=5)
```

**When to use it:**
- To teach the model a specific style, persona, or format.
- To instill factual knowledge or canned responses.
- As a pre-training step before RL to get the model into a reasonable starting state.

**Limitations:** It doesn't actively explore or learn from its own mistakes; it only mimics the data it's given.

## 2. GRPO (Group Relative Reward Policy Optimization)

**What it is:** GRPO is ToolBrain's default and recommended Reinforcement Learning (RL) algorithm. Instead of just showing the model the single "correct" answer, GRPO lets the agent explore. It tries to solve a problem multiple times, generating a "group" of different traces.

**How it works:**
1.  For a single query, the agent generates `num_group_members` (e.g., 2 or 4) different execution traces.
2.  The reward function scores each of these traces independently.
3.  The GRPO algorithm then updates the model, teaching it to increase the probability of generating traces that received higher rewards and decrease the probability of those with lower rewards.

```python
# From examples/01_run_hello_world.py

brain = Brain(
    agent,
    algorithm="GRPO",                # Explicitly choosing GRPO
    reward_func=reward_exact_match,
    num_group_members=2              # Collect 2 traces per step
)

brain.train(training_dataset, num_iterations=10)
```

**When to use it:**
- As your primary algorithm for teaching complex, multi-step tool use.
- When the goal is to explore a solution space to find the best approach (e.g., hyperparameter optimization).
- When you have a clear, quantitative reward signal (e.g., accuracy, F1 score, or a custom metric).

## 3. DPO (Direct Preference Optimization)

**What it is:** DPO is another powerful RL algorithm that learns from *preferences*. Instead of scoring each trace on an absolute scale, DPO learns from pairs of traces: one that is "chosen" (preferred) and one that is "rejected" (not preferred).

**How it works:**
1.  Like GRPO, the agent generates multiple traces for a query.
2.  The reward function scores each trace.
3.  ToolBrain automatically creates pairs of `(chosen, rejected)` traces, where the chosen trace has a higher reward than the rejected one.
4.  The DPO algorithm updates the model to increase the likelihood of generating the chosen trace relative to the rejected one.

```python
# From examples/04_lightgbm_hpo_training_with_dpo/run_hpo_training.py

brain = Brain(
    agent=my_agent,
    reward_func=reward_accuracy,
    algorithm="DPO",
    num_group_members=4 # DPO needs at least 2, more is often better
)

brain.train(training_dataset, num_iterations=10)
```

**When to use it:**
- When it's easier to say "this is better than that" than it is to assign a precise numerical score.
- When your reward signal is noisy or relative.
- It is often very effective for improving response quality based on human or LLM-as-a-Judge feedback.

## How to Switch Algorithms

Switching between algorithms is as simple as changing one parameter in the `Brain` constructor:

```python
# To use GRPO
brain = Brain(agent, algorithm="GRPO", ...)

# To use DPO
brain = Brain(agent, algorithm="DPO", ...)

# To use Supervised Fine-Tuning
brain = Brain(agent, algorithm="Supervised", ...)
```

## Summary

| Algorithm  | Learning Style          | Best For                                                                 |
|------------|-------------------------|--------------------------------------------------------------------------|
| **Supervised** | Imitation               | Learning style, facts, and pre-training.                                 |
| **GRPO**     | Exploration & Scoring   | Complex tool use, exploration tasks, and optimizing quantitative metrics. |
| **DPO**      | Preference & Comparison | Tasks where relative quality is more important than an absolute score.     |

Start with **GRPO** for most tool-learning tasks. If your reward signal is more about preference, or if GRPO isn't yielding the desired results, try **DPO**. Use **Supervised** learning for foundational training or when you have a dataset of perfect examples to imitate.