# What is ToolBrain?

ToolBrain is a lightweight, open-source Python library designed to train **agentic systems** to use tools more effectively. At its core, ToolBrain uses trace-based learning, primarily with Reinforcement Learning (RL), to improve an LLM-powered agent's ability to reason and correctly use a given set of tools to accomplish a task.

## The Problem It Solves

Large Language Models (LLMs) are powerful, but their ability to interact with the outside world is limited. To overcome this, developers equip them with "tools" (e.g., functions, APIs). However, teaching an agent *how* and *when* to use these tools reliably is a significant challenge. Agents might fail to use a tool, use the wrong one, or provide incorrect parameters.

ToolBrain addresses this by providing a structured framework to fine-tune the agent's underlying model, rewarding it for correct tool usage and penalizing it for mistakes.

## Core Concepts

### 1. Trace-Based Learning

The fundamental unit of learning in ToolBrain is the **Trace**. A trace is a detailed log of everything that happens during an agent's attempt to solve a problem. It records every thought, every tool call, and every output.

- **`Trace`**: A list of `Turn` objects, representing the entire lifecycle of a task from the initial query to the final answer.
- **`Turn`**: A single interaction step within a trace. It captures:
    - `prompt_for_model`: The exact input the LLM saw.
    - `model_completion`: The exact raw text the LLM generated.
    - `parsed_completion`: A structured breakdown of the model's output (e.g., thought, tool code).
    - `tool_output`: The result from executing the tool code.

This high-fidelity data is essential for providing a clear and accurate learning signal to the model.

### 2. Reward Functions

A reward function is a Python callable that scores a `Trace` based on its success. The score, or "reward," tells the learning algorithm how well the agent performed. A higher reward signals a more desirable outcome.

ToolBrain offers several built-in reward functions and makes it easy to create your own:
- **`reward_exact_match`**: Rewards the agent if its final answer perfectly matches a "gold" answer.
- **`reward_tool_execution_success`**: Rewards the agent simply for executing a tool without errors.
- **LLM-as-a-Judge**: Uses a powerful external LLM (like GPT-4 or Gemini) to evaluate the agent's performance based on nuanced criteria.

### 3. Reinforcement Learning (and more) for Agents

ToolBrain applies the rewards to the traces using different learning algorithms to fine-tune the agent's base model.

- **GRPO (Group-wise Reward Policy Optimization)**: The default RL algorithm. It collects a "group" of traces for a single task, scores them, and updates the model's policy to favor the strategies that led to higher rewards.
- **DPO (Direct Policy Optimization)**: An alternative RL algorithm that learns from pairs of "chosen" (high-reward) and "rejected" (low-reward) traces.
- **Supervised Learning**: Standard fine-tuning, useful for teaching the model basic conversational patterns or facts.

## High-Level Architecture

The library's design is simple and modular, revolving around a few key components:

1.  **`CodeAgent` (or other agent)**: The agent itself, responsible for executing tasks. ToolBrain is designed to be adaptable to different agent frameworks, but it comes with a tight integration for `smolagents.CodeAgent`.
2.  **`Brain`**: The central orchestrator. You initialize the `Brain` with your agent, a learning algorithm, and a reward function. It manages the entire training loop, from collecting traces to updating the model.
3.  **Traces & Rewards**: The data and feedback loop that powers the learning process.

By combining these components, ToolBrain provides a powerful and flexible system for systematically improving the reliability and intelligence of your tool-using agents.