# Tutorial 3: Crafting Custom Reward Functions

Reward functions are the heart of reinforcement learning in ToolBrain. They are the mechanism by which you teach an agent what constitutes a "good" or "bad" outcome. A well-crafted reward function is critical for shaping the agent's behavior toward your desired goal.

This tutorial covers ToolBrain's built-in rewards, the power of LLM-as-a-Judge, and how to create your own custom reward function.

## Built-in Reward Functions

ToolBrain comes with several pre-built reward functions that cover common use cases. You can find them in `toolbrain.rewards`.

- **`reward_exact_match`**: This is the simplest and most common reward function. It gives a reward of `1.0` if the agent's `final_answer` is identical to the `gold_answer` you provide, and `0.0` otherwise. It's great for tasks with a single, definitive correct answer.

    ```python
    # Usage
    brain = Brain(agent, reward_func=reward_exact_match)
    ```

- **`reward_tool_execution_success`**: This function rewards the agent simply for using a tool *without causing an error*. It returns `1.0` if the `tool_output` in a turn does not contain an error message, and `0.0` if it does. This is useful for encouraging the agent to learn the correct syntax and parameters for its tools, even if the final answer isn't perfect yet.

- **`reward_combined`**: This powerful function allows you to mix multiple reward functions together, each with its own weight. This is perfect for balancing different objectives. For example, you might want to reward the agent for both getting the right answer *and* using its tools efficiently.

    ```python
    from toolbrain.rewards import reward_combined

    def custom_combined_reward(trace, **kwargs):
        # 70% weight on correctness, 30% on successful tool use
        weights = {
            "exact_match": 0.7,
            "tool_success": 0.3,
        }
        kwargs["weights"] = weights
        return reward_combined(trace, **kwargs)

    brain = Brain(agent, reward_func=custom_combined_reward)
    ```

### A Special Mention: LLM-as-a-Judge

For complex tasks where a simple `exact_match` is insufficient, you can use another powerful LLM to act as a "judge." The **`reward_llm_judge_via_ranking`** function does exactly this.

When using this reward, the `Brain` collects multiple traces from the agent for the same query. It then presents these traces to a judge model (e.g., GPT-4, Gemini), which ranks them from best to worst. The rewards are then assigned based on this ranking. This is useful for tasks that require nuanced evaluation of quality, relevance, or style.

```python
# From examples/10_llm_as_judge.py
brain = Brain(
    agent,
    algorithm="GRPO",
    reward_func=reward_llm_judge_via_ranking,
    judge_model_id="gemini/gemini-1.5-flash", # Specify the judge model
    num_group_members=3 # Collect 3 traces to be ranked
)
```

## Creating a Custom Reward Function

Creating your own reward function is easy. All you need to do is define a Python function that adheres to the `RewardFunction` protocol.

### The `RewardFunction` Protocol

A valid reward function must:
1.  Accept a `trace: Trace` as its first argument.
2.  Accept `**kwargs: Any` to catch any other data passed by the `Brain` (like `gold_answer`).
3.  Return a `float` value. By convention, this score is often between 0.0 and 1.0, but any real-valued score will work.

### Hands-On: Rewarding Accuracy in HPO

Let's look at the example from `examples/02_lightgbm_hpo_training_with_grpo/run_hpo_training.py`. In this task, the agent performs hyperparameter optimization (HPO) by calling a `run_lightgbm` tool. The tool returns the model's prediction accuracy.

We want to reward the agent for finding parameters that lead to higher accuracy. A simple `exact_match` won't work here. We need a custom reward function.

Here is the function from the example:

```python
from toolbrain import Trace
from typing import Any

# Customised reward function
def reward_accuracy(trace: Trace, **kwargs: Any) -> float:
    for turn in trace:
        try:
            # The tool's output is the accuracy score
            reward = float(turn["action_output"]) 
            return reward
        except:
            # If the tool call failed, the agent gets no reward
            reward = 0.0
    return reward
```

**How it works:**
1.  It iterates through the `trace`.
2.  It inspects the `action_output` (which is the same as `tool_output`) of each `Turn`.
3.  It tries to convert the output to a float. If successful, it means the tool ran correctly and returned an accuracy score. This score is directly used as the reward.
4.  If the `action_output` cannot be converted to a float (e.g., it's an error message), it means the tool call failed, and the agent receives a reward of `0.0`.

### Integrating the Custom Reward

Using your new function is as simple as passing it to the `Brain` during initialization:

```python
brain = Brain(
    agent=my_agent,
    reward_func=reward_accuracy, # Pass the custom function here
    algorithm="GRPO"
)

brain.train(training_dataset, num_iterations=10)
```

With this setup, the GRPO algorithm will favor agent behaviors (i.e., choices of `feature_fraction`) that result in higher accuracy scores, effectively teaching the agent to perform HPO.

---

By mastering custom reward functions, you unlock the full potential of ToolBrain to shape agent behavior for virtually any task you can imagine.
