# Part 4: Efficient Training with ToolBrain

Fine-tuning Large Language Models is a resource-intensive process, often requiring significant GPU memory and long training times. ToolBrain integrates several state-of-the-art optimization techniques to make this process dramatically more efficient. By using them, you can train larger models on smaller GPUs and complete your training runs much faster.

This tutorial covers the three key efficiency techniques available in ToolBrain: **LoRA**, **BitsAndBytes**, and **Unsloth**.

## 1. LoRA: Low-Rank Adaptation

**What it is:** Instead of fine-tuning all the billions of parameters in an LLM, LoRA (Low-Rank Adaptation) freezes the base model and injects small, trainable "adapter" layers into its architecture. During training, only these tiny adapter layers are updated.

**Why it's useful:**
-   **Massive Memory Savings:** Since you are only training a tiny fraction of the parameters (e.g., <1%), the memory required for gradients and optimizer states is drastically reduced.
-   **Fast & Portable:** The final output of the training is just the small adapter weights (a few megabytes), not a full multi-gigabyte model. This makes saving, loading, and sharing your fine-tuned model incredibly fast and easy.
-   **No Performance Loss:** For most tasks, LoRA can match the performance of full fine-tuning.

**How to use it in ToolBrain:**

ToolBrain uses the `peft` library from Hugging Face to handle LoRA. You simply define a `LoraConfig` object and pass it to the appropriate configuration in your `Brain`.

```python
from peft import LoraConfig

# Define the LoRA configuration
LORA_CONFIG = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Target attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# This config is then passed into the Brain's algorithm configuration
# (Example from the email search agent)
GRPO_CONFIG = {
    "learning_rate": 1e-5,
    "lora_config": LORA_CONFIG, # Pass the config here
    ...
}
```

ToolBrain will automatically apply this configuration to the model.

## 2. BitsAndBytes: 4-bit or 8-bit Quantization (GPU only)

**What it is:** Quantization is the process of reducing the numerical precision of the model's weights. BitsAndBytes is a library that allows you to load a model with its weights quantized to 4-bit precision instead of the standard 16-bit or 32-bit.

**Why it's useful:**
-   **Drastic Memory Reduction:** Loading a model in 4-bit precision can reduce its memory footprint by up to 75%. This often makes the difference between a model fitting on your GPU or not.

**How to use it in ToolBrain:**

First, we need to load the model in quantized 4-bit or 8-bit

```python

from smolagents import tool, TransformersModel, CodeAgent
from transformers import BitsAndBytesConfig

# 0. set bitsandbytes config for low precision inference
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")


# 1. Create agent
model = TransformersModel(
    model_id="Qwen/Qwen2.5-0.5B-Instruct",  # use a bigger model for better results
    max_new_tokens=128,
    model_kwargs={"quantization_config": nf4_config},
)
```

Then to enable bitsandbytes in Brain, use with a simple boolean flag in the `Brain` constructor.

```python
brain = Brain(
    agent,
    algorithm="GRPO",
    use_bitsandbytes=True, # Enable 4-bit quantization
    ...
)
```

The complete example on how to use bitsandbytes can be found [here](https://github.com/ToolBrain/ToolBrain/blob/main/examples/12_hello_world_bitsandbytes.py)

## 3. FP16 training (GPU only)

Training large models with full floating-point precision (FP32) can quickly lead to out-of-memory (OOM) errors. To mitigate this, we support both training and inference in FP16. Follow the steps below to enable FP16:

1. load the model in fp16

```python

# 1. Create agent
model = TransformersModel(
    model_id="Qwen/Qwen2.5-0.5B-Instruct",  # use a bigger model for better results
    max_new_tokens=128,
    torch_dtype=torch.float16
)
```

2. enable fp16 in Brain, use with a simple boolean flag in the `Brain` constructor.

```python
brain = Brain(
    agent,
    algorithm="GRPO",
    fp16=True, # FP16 training
    ...
)
```
The complete example on how to use fp16 finetuning can be found [here](https://github.com/ToolBrain/ToolBrain/blob/main/examples/13_hello_world_fp16.py)
## 4. Unsloth: Faster Training Speed (GPU only)

**What it is:** Unsloth is a powerful optimization library that rewrites the model's underlying code (e.g., the attention mechanism) to be significantly faster and more memory-efficient, especially when training with LoRA.

**Why it's useful:**
-   **Up to 2x Faster Training:** Unsloth can double your training speed.
-   **Reduced Memory Usage:** It further reduces memory consumption, allowing you to use larger batch sizes.

**How to use it in ToolBrain:**

Unsloth is enabled when you first create your agent or model, as it needs to patch the model upon loading.

```python
from toolbrain import create_agent

# Enable Unsloth when creating the agent
agent = create_agent(
    model_id="Qwen/Qwen2.5-14B-Instruct",
    tools=[...],
    use_unsloth=True # Enable Unsloth here
)
```

## Putting It All Together: The Ultimate Efficiency Setup

The true power of these techniques comes from combining them. Here is how you would configure a `Brain` for maximum efficiency, using the email search agent from a previous tutorial as a template.

```python
import os
from peft import LoraConfig
from toolbrain import Brain, create_agent
from examples.07_email_search_agent import email_tools, custom_rewards

# 1. Define LoRA Config
LORA_CONFIG = LoraConfig(
    r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"], ...
)

# 2. Create agent with Unsloth enabled
print("Initializing agent with Unsloth...")
agent = create_agent(
    model_id="Qwen/Qwen3-4B-Instruct-2507",
    tools=[email_tools.search_emails, email_tools.read_email],
    use_unsloth=True # Unsloth enabled
)

# 3. Configure Brain with BitsAndBytes and pass LoRA config
print("Configuring Brain with BitsAndBytes and LoRA...")
brain = Brain(
    agent,
    algorithm="GRPO",
    learning_rate=1e-5,
    use_bitsandbytes=True, # BitsAndBytes enabled
    config={"lora_config": LORA_CONFIG} # Pass LoRA config
)

print("âœ… Efficient training setup complete!")

# Now, when you call brain.train(), it will run with all optimizations.
# brain.train(training_data, num_iterations=4)
```

By using LoRA, BitsAndBytes, and Unsloth together, you can fine-tune large, powerful models on consumer-grade GPUs in a fraction of the time it would otherwise take, making advanced agent training accessible to more developers and researchers than ever before.