<a href="https://colab.research.google.com/github/nikhil-1e9/Cool-notebooks/blob/main/mcp-rl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To teach a model to use your MCP server, click **Runtime** > **Run all**. Make sure you've enabled a free Tesla T4 GPU and edit the [configuration](#configuration) cell below!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [GitHub](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**MCPâ€¢RL: Teach you agent how to use any MCP server**

This notebook shows how to train a Qwen 2.5 3B model to effectively use any MCP server. Simply provide an MCP server url and the notebook will:

1. Query the server's tools
2. Generate a set of input tasks that use those tools
3. Train the model on those tasks using automatic RULER evaluation
4. Test the trained model by giving it new tasks to complete

RULER judges response quality purely from the agent's final output - no labeled data required!

*Note: In this notebook we use a local server, but the technique below applies to all MCP servers!*


In [1]:
# @title ðŸ’¿ Installation
# Portions adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks)
# Copyright (c) Unsloth contributors.
# License: GNU LGPL v3.0.
# Modifications by OpenPipe:
# - switched to uv
# - changed vllm/triton pinning logic
# - added protobuf pins
# - adjusted syntax for pushing to HF
# See /licenses/LGPL-3.0.txt and /licenses/GPL-3.0.txt for full text.

%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !uv pip install openpipe-art[backend]==0.5.9 tenacity fastmcp "mcp>=1.11.0" "gql<4" aiohttp --prerelease allow --no-cache-dir
else:
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    try:
        import subprocess

        is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except:
        is_t4 = False
    get_vllm, get_triton = (
        ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    )
    !uv pip install --upgrade \
        openpipe-art[backend]==0.4.11 tenacity fastmcp pillow==11.3.0 protobuf==5.29.5 {get_vllm} {get_numpy} --prerelease allow --no-cache-dir
    !uv pip install -qqq {get_triton}

<a name="configuration"></a>

### ðŸŽ¯ Configuration - Edit These Settings

Add an OpenRouter API key below.

In [2]:
# Required - Used for generating training inputs and RULER evaluation
OPENROUTER_API_KEY = "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"  # Put your OpenRouter key here

# ðŸ”Œ Point to any MCP server
MCP_SERVER_URL = "http://localhost:8900/mcp"

# Optional - Enables metric logging
WANDB_API_KEY = ""

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct"  # Options: "Qwen/Qwen2.5-3B-Instruct", "Qwen/Qwen2.5-7B-Instruct", etc.

In [3]:
# @title Advanced Settings

# Model configuration
MODEL_NAME = "sql-agent-3b"  # Name for your trained model
PROJECT_NAME = "mcp-rl"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 16,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 1,  # Number of times through all data
    "rollouts_per_group": 4,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

MAX_TURNS = 10  # Maximum number of turns for the model to generate during one rollout

NUM_TEST_INPUTS = 8  # Number of test inputs to generate
RULER_MODEL = "openrouter/openai/gpt-4o-mini"  # Model for RULER evaluation
INPUT_GENERATION_MODEL = "openai/gpt-5-nano"

# Colab/T4 specific config to avoid OOM errors
MAX_TURNS = 3  # Decrease the number of turns to avoid OOM errors on a T4
MAX_SEQ_LENGTH = 16384  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.7  # GPU memory usage (0.0-1.0)

In [4]:
# @title Debug utilities

import json
import time
import traceback
from typing import Any

DEBUG_LOG = True  # flip to False to silence logs
LOG_JSON_MAX = 2000  # cap large JSON prints


def _ts() -> str:
    return time.strftime("%H:%M:%S")


def log(msg: str, **kv):
    if not DEBUG_LOG:
        return
    parts = [f"[{_ts()}] {msg}"]
    if kv:
        kv_str = " ".join(f"{k}={repr(v)}" for k, v in kv.items())
        parts.append("| " + kv_str)
    print(" ".join(parts))


def log_json(title: str, payload: Any, max_len: int = LOG_JSON_MAX):
    if not DEBUG_LOG:
        return
    try:
        s = json.dumps(payload, indent=2, default=str)
    except Exception:
        s = str(payload)
    if len(s) > max_len:
        s = s[:max_len] + "\n... (truncated)"
    print(f"[{_ts()}] {title}:\n{s}")

In [18]:
# @title Create MCP server

%%writefile mcp_server.py
"""
FastMCP SQLite Database Server
A simple MCP server that exposes a company database for text-to-SQL agent training.
"""

import sqlite3

# Initialize in-memory SQLite database
DB = sqlite3.connect(":memory:")
DB.row_factory = sqlite3.Row

DB.executescript("""
CREATE TABLE departments (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    location TEXT NOT NULL,
    budget REAL NOT NULL
);

CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    department_id INTEGER REFERENCES departments(id),
    role TEXT NOT NULL,
    salary REAL NOT NULL,
    hire_date TEXT NOT NULL
);

CREATE TABLE projects (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    department_id INTEGER REFERENCES departments(id),
    lead_id INTEGER REFERENCES employees(id),
    status TEXT NOT NULL CHECK(status IN ('active', 'completed', 'on_hold')),
    budget REAL NOT NULL
);

-- Departments
INSERT INTO departments VALUES (1, 'Engineering',  'San Francisco', 2500000);
INSERT INTO departments VALUES (2, 'Marketing',    'New York',      1200000);
INSERT INTO departments VALUES (3, 'Data Science', 'London',        1800000);
INSERT INTO departments VALUES (4, 'Sales',        'New York',       900000);
INSERT INTO departments VALUES (5, 'Operations',   'San Francisco',  750000);

-- Employees
INSERT INTO employees VALUES (1,  'Alice Chen',      1, 'Senior Engineer',     145000, '2020-03-15');
INSERT INTO employees VALUES (2,  'Bob Martinez',     1, 'Staff Engineer',      175000, '2018-07-01');
INSERT INTO employees VALUES (3,  'Carol White',      2, 'Marketing Manager',   120000, '2019-11-20');
INSERT INTO employees VALUES (4,  'David Kim',        3, 'Data Scientist',      135000, '2021-01-10');
INSERT INTO employees VALUES (5,  'Eva Johnson',      1, 'Junior Engineer',      95000, '2023-06-01');
INSERT INTO employees VALUES (6,  'Frank Brown',      4, 'Sales Lead',          110000, '2020-09-15');
INSERT INTO employees VALUES (7,  'Grace Liu',        3, 'Senior Data Scientist',155000, '2019-04-22');
INSERT INTO employees VALUES (8,  'Henry Wilson',     2, 'Content Strategist',   98000, '2022-02-14');
INSERT INTO employees VALUES (9,  'Irene Davis',      5, 'Operations Manager',  115000, '2020-08-30');
INSERT INTO employees VALUES (10, 'James Taylor',     1, 'Engineering Manager',  165000, '2017-05-12');
INSERT INTO employees VALUES (11, 'Karen Patel',      3, 'ML Engineer',         140000, '2021-09-05');
INSERT INTO employees VALUES (12, 'Leo Nguyen',       4, 'Account Executive',    92000, '2023-01-18');
INSERT INTO employees VALUES (13, 'Maria Garcia',     5, 'Logistics Coordinator', 78000, '2022-07-25');
INSERT INTO employees VALUES (14, 'Nathan Scott',     2, 'Brand Designer',      105000, '2021-03-11');
INSERT INTO employees VALUES (15, 'Olivia Reed',      1, 'DevOps Engineer',     130000, '2020-12-01');

-- Projects
INSERT INTO projects VALUES (1, 'Cloud Migration',     1, 2,  'active',    500000);
INSERT INTO projects VALUES (2, 'Brand Refresh',       2, 3,  'completed', 200000);
INSERT INTO projects VALUES (3, 'Recommendation Engine',3, 7,  'active',    350000);
INSERT INTO projects VALUES (4, 'Q4 Sales Push',       4, 6,  'active',    150000);
INSERT INTO projects VALUES (5, 'Warehouse Automation', 5, 9,  'on_hold',   280000);
INSERT INTO projects VALUES (6, 'ML Pipeline v2',      3, 11, 'active',    420000);
INSERT INTO projects VALUES (7, 'Mobile App Redesign',  1, 10, 'active',    300000);
INSERT INTO projects VALUES (8, 'SEO Overhaul',        2, 8,  'completed', 120000);
""")

import json
from fastmcp import FastMCP

# Create the MCP server
mcp = FastMCP("company-db", instructions="You are a database assistant. Use the tools to explore the database schema and run SQL queries to answer questions about the company data.")

# Tool 1: List all tables
@mcp.tool()
def list_tables() -> str:
    """List all tables in the database."""
    cursor = DB.execute(
        "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"
    )
    tables = [row["name"] for row in cursor.fetchall()]
    return json.dumps(tables)


# Tool 2: Describe a table's schema
@mcp.tool()
def describe_table(table_name: str) -> str:
    """Get the column names, types, and constraints for a specific table.

    Args:
        table_name: Name of the table to describe.
    """
    # Validate table name to prevent injection
    cursor = DB.execute(
        "SELECT name FROM sqlite_master WHERE type='table' AND name=?",
        (table_name,),
    )
    if not cursor.fetchone():
        return json.dumps({"error": f"Table '{table_name}' not found."})

    columns = DB.execute(f"PRAGMA table_info({table_name})").fetchall()
    schema = [
        {
            "name": col["name"],
            "type": col["type"],
            "nullable": not col["notnull"],
            "primary_key": bool(col["pk"]),
        }
        for col in columns
    ]
    return json.dumps(schema, indent=2)


# Tool 3: Run a SQL query
@mcp.tool()
def run_query(sql: str) -> str:
    """Execute a read-only SQL query and return the results.

    Args:
        sql: A SELECT SQL query to run against the database.
    """
    # Block write operations
    stripped = sql.strip().upper()
    if not stripped.startswith("SELECT"):
        return json.dumps({
            "error": "Only SELECT queries are allowed."
        })

    try:
        cursor = DB.execute(sql)
        rows = [dict(row) for row in cursor.fetchall()]
        return json.dumps({"row_count": len(rows), "results": rows}, indent=2)
    except Exception as e:
        return json.dumps({"error": str(e)})


# Run the server
if __name__ == "__main__":
    mcp.run(transport="streamable-http", host="0.0.0.0", port=8900)

Overwriting mcp_server.py


In [19]:
# @title Run the server

import subprocess, time

process = subprocess.Popen(
    ["python", "mcp_server.py"],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)
time.sleep(3)

In [20]:
print(process.poll())

None


In [22]:
# @title ðŸ”Œ MCP helpers

from contextlib import asynccontextmanager

import mcp.types as types
from mcp.client.session import ClientSession
from mcp.client.streamable_http import streamablehttp_client

if not MCP_SERVER_URL:
    raise ValueError("MCP_SERVER_URL is empty. Set it in the Configuration cell.")


@asynccontextmanager
async def mcp_session():
    """
    Connects to the MCP server using the full URL.
    """
    async with streamablehttp_client(MCP_SERVER_URL) as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()
            yield session


async def list_tools_and_resources():
    """Return (tools_result, resources_result) from the server."""
    async with mcp_session() as session:
        tools = await session.list_tools()
        try:
            resources = await session.list_resources()
        except Exception:
            # Some servers don't implement resources; keep interface stable
            class _Empty:
                resources = []

            resources = _Empty()
        return tools, resources


async def call_mcp_tool(tool_name: str, arguments: dict):
    """Invoke a tool on the MCP server and return the CallToolResult."""
    async with mcp_session() as session:
        return await session.call_tool(tool_name, arguments)


tools, resources = await list_tools_and_resources()
print("Tools:", [t.name for t in tools.tools])
print(
    "Resources:",
    [getattr(r, "uri", None) for r in getattr(resources, "resources", []) or []],
)

ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

In [8]:
# @title Let's generate our train and validation scenarios!

import os
import random

from dotenv import load_dotenv

# Import the generate_scenarios function from art.mcp and logging utilities
from art.mcp import generate_scenarios
from art.mcp.generate_scenarios import preview_scenarios
from art.utils.logging import info, ok, step, warn, err

load_dotenv()

# required env/key check
# If OPENROUTER_API_KEY exists as a var, use it; otherwise pull from env
_openrouter_key = os.getenv("OPENROUTER_API_KEY")
try:
    _openrouter_key = _openrouter_key if _openrouter_key else OPENROUTER_API_KEY  # noqa: F821 (defined upstream in your notebook)
except NameError:
    pass

if _openrouter_key:
    os.environ["OPENROUTER_API_KEY"] = _openrouter_key
    ok("OPENROUTER_API_KEY found.")
else:
    err("OPENROUTER_API_KEY is required for data generation and RULER evaluation.")
    raise ValueError(
        "OPENROUTER_API_KEY is required for data generation and RULER evaluation."
    )

def get_content_text(result) -> str:
    # Extract text content from tool call result per MCP content schema
    if isinstance(result, str):
        return result
    if hasattr(result, "content") and result.content:
        out = ""
        for item in result.content:
            if isinstance(item, types.TextContent):
                out += item.text
            else:
                out += str(item)
        return out
    if hasattr(result, "structured") and result.structured is not None:
        try:
            return json.dumps(result.structured)
        except Exception:
            return str(result.structured)
    return str(result)

# Convert MCP tools and resources to the expected format
tools_result, resources_result = await list_tools_and_resources()

# Convert tools to the format expected by generate_scenarios
tools_list = []
for tool in tools_result.tools or []:
    tools_list.append({
        "name": tool.name,
        "description": tool.description,
        "parameters": tool.inputSchema,
    })

# Convert resources to the format expected by generate_scenarios
resources_list = []
for resource in getattr(resources_result, "resources", []) or []:
    resources_list.append({
        "uri": str(resource.uri),
        "name": resource.name,
        "description": resource.description,
        "mimeType": resource.mimeType,
    })

# First, get the actual schema from your database
schema_info = ""
for table in ["departments", "employees", "projects"]:
    result = await call_mcp_tool("describe_table", {"table_name": table})
    schema_info += f"\nTable '{table}': {get_content_text(result)}"

# Sample a few rows so the generator knows what kind of data exists
for table in ["departments", "employees", "projects"]:
    result = await call_mcp_tool("run_query", {"sql": f"SELECT * FROM {table} LIMIT 3"})
    schema_info += f"\nSample data from '{table}': {get_content_text(result)}"

# Now enrich the tool descriptions with this context
enriched_tools_list = []
for tool in tools_list:
    enriched = tool.copy()
    if tool["name"] == "run_query":
        enriched["description"] = (
            tool["description"] +
            f"\n\nAvailable database schema:{schema_info}"
        )
    enriched_tools_list.append(enriched)


# Calculate total scenarios needed
try:
    expected_total = TRAINING_CONFIG["num_training_inputs"] + NUM_TEST_INPUTS  # noqa: F821
except NameError:
    err("TRAINING_CONFIG/NUM_TEST_INPUTS not defined in this notebook.")
    raise

info(f"Target total scenarios: {expected_total}")

# Generate scenarios using the art.mcp function
max_attempts = 10
scenarios = None

for attempt in range(1, max_attempts + 1):
    step(f"Attempt {attempt}/{max_attempts} ...")
    t_attempt = time.perf_counter()
    try:
        scenario_collection = await generate_scenarios(
            tools=enriched_tools_list,
            resources=resources_list,
            num_scenarios=expected_total,
            show_preview=False,  # We'll preview separately for train/val
            generator_model=INPUT_GENERATION_MODEL,
            generator_api_key=_openrouter_key,
        )
        # Convert GeneratedScenarioCollection to list of dicts for compatibility
        scenarios = [{"task": s.task, "difficulty": s.difficulty} for s in scenario_collection.scenarios]
        ok(f"Attempt {attempt} succeeded in {time.perf_counter() - t_attempt:.2f}s.")
        break
    except Exception as e:
        warn(f"Attempt {attempt} failed: {e}")
        if attempt < max_attempts:
            time.sleep(min(1.5 * attempt, 6.0))
        else:
            err("All attempts exhausted.")
            raise

# Split into train/val
ok(f"Generated {len(scenarios)} scenarios total.")
step("Shuffling scenarios and splitting into train/val ...")
random.shuffle(scenarios)

train_n = TRAINING_CONFIG["num_training_inputs"]  # noqa: F821
raw_train_scenarios = scenarios[:train_n]
raw_val_scenarios = scenarios[train_n:]

ok(f"Train: {len(raw_train_scenarios)} | Val: {len(raw_val_scenarios)}")

info("Sample (train) preview:")
preview_scenarios(raw_train_scenarios, n=min(5, len(raw_train_scenarios)))

info("Sample (val) preview:")
preview_scenarios(raw_val_scenarios, n=min(5, len(raw_val_scenarios)))

ok("Done.")

[14:12:06] [32mOK[0m    OPENROUTER_API_KEY found.
[14:12:07] [34mINFO[0m  Target total scenarios: 24
[14:12:07] [36mSTEP[0m  Attempt 1/10 ...
[14:12:07] [32mOK[0m    Using model: openai/gpt-5-nano
[14:12:07] [34mINFO[0m  Available: 3 tool(s), 0 resource(s).
[14:12:07] [36mSTEP[0m  Preparing prompt & JSON schema &
[14:12:07] [36mSTEP[0m  Calling model: [1mopenai/gpt-5-nano[0m &
[14:12:57] [32mOK[0m    Model responded in 50.64s.
[14:12:57] [34mINFO[0m  Raw content length: 4912 chars.
[14:12:57] [32mOK[0m    Parsed 24 scenario(s) successfully.
[14:12:57] [34mINFO[0m  Difficulty distribution:
[2m   1/5:   0  [0m
[2m   2/5:   3  â–ˆâ–ˆâ–ˆ[0m
[2m   3/5:  14  â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ[0m
[2m   4/5:   7  â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ[0m
[2m   5/5:   0  [0m
[14:12:57] [32mOK[0m    Generated 24 scenarios in 50.71s total.
[14:12:57] [32mOK[0m    Attempt 1 succeeded in 50.71s.
[14:12:57] [32mOK[0m    Generated 24 scenarios total.
[14:12:57] [36mSTE

In [11]:
raw_train_scenarios[:5]

[{'task': 'Salary distribution: Compute average, minimum, and maximum salary per department and overall totals, then present both per-department and aggregate statistics. Include summary and full analysis/report.',
  'difficulty': 4},
 {'task': 'Stale-records check: Find employees missing critical fields (e.g., hire_date, email) or with nulls in key columns; report and analyze.',
  'difficulty': 3},
 {'task': 'Schema constraints audit: Describe and compare column constraints for the employees table and propose improvements; include summary and analysis/report.',
  'difficulty': 3},
 {'task': 'Hiring trend: Analyze hires by year and provide counts per year; include a summary and thorough analysis/report.',
  'difficulty': 4},
 {'task': "Project status by department: Classify each department's projects by status (active, completed, on-hold) and summarize counts; include summary and analysis/report.",
  'difficulty': 3}]

In [12]:
# @title Run this cell to train your model!

import os
import random
from dataclasses import dataclass

import weave
from dotenv import load_dotenv
from openai import AsyncOpenAI

import art
from art.local import LocalBackend
from art.rewards import ruler_score_group
from art.utils import iterate_dataset

load_dotenv()

# Optional
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
    weave.init(PROJECT_NAME)
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=MAX_SEQ_LENGTH,
        dtype="float16",
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    ),
)

# Initialize the server
backend = LocalBackend(
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend
await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)


@dataclass
class McpScenario:
    """A scenario for MCP agent evaluation against a server."""

    task_description: str
    max_turns: int = MAX_TURNS


@weave.op()
async def rollout(
    model: art.Model,
    scenario: McpScenario,
    debug: bool = False,
) -> art.Trajectory:
    """Run an MCP agent rollout against the MCP server."""
    traj = art.Trajectory(
        messages_and_choices=[],
        reward=0,
        metadata={"task": scenario.task_description},
        metrics={
            "task_completed": False,
            "success": False,
            "ran_out_of_turns": False,
        },
        scenario=scenario,
    )

    # Discover available tools from the remote server
    tools_result, _resources_result = await list_tools_and_resources()
    tool_names = [t.name for t in tools_result.tools]
    log("rollout: discovered tools", count=len(tool_names), names=tool_names)

    # Convert to OpenAI tool format
    tool_schemas = []
    for tool in tools_result.tools:
        tool_schema = {
            "type": "function",
            "function": {
                "name": tool.name,
                "description": tool.description or f"MCP tool: {tool.name}",
                "parameters": tool.inputSchema or {"type": "object", "properties": {}},
            },
        }
        tool_schemas.append(tool_schema)

    # Add completion tool schema
    tool_schemas.append(
        {
            "type": "function",
            "function": {
                "name": "complete_task",
                "description": "Complete the task with a summary",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "summary": {
                            "type": "string",
                            "description": "Summary of accomplishments",
                        }
                    },
                    "required": ["summary"],
                },
            },
        }
    )

    traj.tools = tool_schemas

    # Initialize conversation
    system_prompt = (
        f"You are a database agent. Use tools to explore "
                    f"the schema and run SQL queries to answer questions. "
                    f"Call 'complete_task' when done. "
                    f"You have {scenario.max_turns} turns."
        # NOTE: removing 'Only use tool calls, do not write any content.' â€” some models
        # will freeze if they think plain text is disallowed. Let them output thoughts but
        # we only process tool calls below.
    )

    traj.messages_and_choices = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": f"Please complete this task: {scenario.task_description}",
        },
    ]

    num_turns = 0
    task_completed = False

    # Main interaction loop
    while num_turns < scenario.max_turns and not task_completed:
        num_turns += 1

        try:
            # === Log request ===
            last_user = next(
                (m for m in reversed(traj.messages()) if m["role"] == "user"), None
            )
            log(
                "LLM request",
                step=num_turns,
                model=(model.inference_model_name or model.name),
                tools=len(tool_schemas),
                last_user=(last_user["content"][:160] + "..." if last_user else None),
            )

            # Get LLM response
            async with traj.track_duration("llm_completion"):
                openai_client = AsyncOpenAI(
                    api_key=model.inference_api_key,
                    base_url=model.inference_base_url,
                )

                # We also log the request body (without huge params)
                req_preview = {
                    "model": model.inference_model_name
                    if model.inference_model_name
                    else model.name,
                    "messages_len": len(traj.messages()),
                    "tools_len": len(tool_schemas),
                }
                log_json("LLM request (preview)", req_preview)

                response = await openai_client.chat.completions.create(
                    model=model.inference_model_name
                    if model.inference_model_name
                    else model.name,
                    messages=traj.messages(),
                    tools=tool_schemas,
                    max_completion_tokens=8000,
                )

            # === Log response ===
            choice = response.choices[0]

            finish_reason = getattr(choice, "finish_reason", None)
            msg = choice.message
            has_tools = bool(getattr(msg, "tool_calls", None))
            content_preview = (
                (msg.content[:200] + "...")
                if isinstance(msg.content, str) and msg.content
                else str(msg.content)[:200]
            )
            log(
                "LLM response parsed",
                finish_reason=finish_reason,
                has_tool_calls=has_tools,
                content_preview=content_preview,
            )

            traj.messages_and_choices.append(choice)

            # Handle tool calls
            if msg.tool_calls:
                for tool_call in msg.tool_calls:
                    try:
                        log(
                            "Tool call received",
                            name=tool_call.function.name,
                            raw_args=tool_call.function.arguments,
                        )
                        tool_args = json.loads(tool_call.function.arguments or "{}")

                        if tool_call.function.name == "complete_task":
                            traj.metrics["task_completed"] = True
                            task_completed = True
                            traj.logs.append(
                                f"Task completion attempted with summary: {tool_args.get('summary', '')}"
                            )
                            # We still append a tool message for completeness
                            traj.messages_and_choices.append(
                                {
                                    "role": "tool",
                                    "tool_call_id": tool_call.id,
                                    "content": "Task marked complete.",
                                }
                            )
                        else:
                            # ðŸ”§ Call MCP tool through remote Smithery session
                            result = await call_mcp_tool(
                                tool_call.function.name, tool_args
                            )

                            content_text = get_content_text(result)
                            log(
                                "Tool result",
                                name=tool_call.function.name,
                                len=len(content_text),
                            )

                            if len(content_text) > 20000:
                                # print(
                                #     f"Tool call result for {tool_call.function.name} is too long: {len(content_text)}"
                                # )
                                # print(f"Args: {tool_args}")
                                # print(content_text[:1000])
                                # print(content_text[-1000:])
                                raise Exception(
                                    f"Tool call result for {tool_call.function.name} is too long: {len(content_text)}"
                                )

                            # Add tool response
                            traj.messages_and_choices.append(
                                {
                                    "role": "tool",
                                    "tool_call_id": tool_call.id,
                                    "content": content_text,
                                }
                            )

                    except Exception as e:
                        traceback.print_exc()
                        traj.logs.append(f"Tool call error: {e}")

                        # Add error response
                        traj.messages_and_choices.append(
                            {
                                "role": "tool",
                                "tool_call_id": tool_call.id,
                                "content": f"Error: {str(e)}",
                            }
                        )
            else:
                # No tool calls â€” log and continue (RULER will likely give 0)
                log(
                    "LLM returned no tool_calls; skipping tool execution",
                    turn=num_turns,
                )
                # You can consider breaking here or letting it try another turn
                # break

        except Exception as e:
            traceback.print_exc()
            traj.logs.append(f"Error in turn {num_turns}: {e}")
            break

    if not task_completed and num_turns == scenario.max_turns:
        traj.metrics["ran_out_of_turns"] = True

    traj.metrics["num_turns"] = num_turns

    return traj.finish()


# =============== Training code ===============

print(
    f"Using config: max_turns={MAX_TURNS}, rollouts_per_group={TRAINING_CONFIG['rollouts_per_group']}, "
    f"groups_per_step={TRAINING_CONFIG['groups_per_step']}, num_epochs={TRAINING_CONFIG['num_epochs']}, "
    f"learning_rate={TRAINING_CONFIG['learning_rate']}"
)

await model.register(backend)

train_scenarios = [
    McpScenario(
        task_description=scenario["task"],
        max_turns=MAX_TURNS,
    )
    for scenario in raw_train_scenarios
]

# Create dataset iterator using raw scenarios
train_iterator = iterate_dataset(
    train_scenarios,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),  # Resume from checkpoint
)

# Main training loop using iterate_dataset
for batch in train_iterator:
    print("Gathering trajectory groups with RULER scoring...")

    # Use gather_trajectory_groups with ruler_score_group
    groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(model, scenario, False)
                for _ in range(TRAINING_CONFIG["rollouts_per_group"])
            )
            for scenario in batch.items
        ),
        pbar_desc=f"train gather step {batch.step}",
    )

    scored_groups = []
    for group in groups:
        # Use RULER to assign relative scores to each trajectory
        judged_group = await ruler_score_group(
            group, judge_model=RULER_MODEL, debug=True, swallow_exceptions=True
        )
        scored_groups.append(judged_group)

    print("starting train")
    await model.train(
        scored_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
    )



WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.


  * regex for parameter names, must start with `re:`, e.g. `re:language\.layers\..+\.q_proj.weight`.


INFO 02-16 14:14:57 [__init__.py:244] Automatically detected platform cuda.
ERROR 02-16 14:15:00 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8



Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.6: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.9.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 78.23%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.56 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 9.17 GB

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 02-16 14:15:46 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-16 14:15:46 [cuda.py:360] Using XFormers backend.
INFO 02-16 14:15:47 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 02-16 14:15:47 [model_runner.py:1171] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 02-16 14:15:48 [bitsandbytes_loader.py:499] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 02-16 14:15:49 [weight_utils.py:292] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 02-16 14:16:03 [weight_utils.py:308] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 14.176608 seconds
INFO 02-16 14:16:04 [weight_utils.py:345] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-16 14:16:06 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 02-16 14:16:07 [model_runner.py:1203] Model loading took 2.2550 GiB and 18.474283 seconds
INFO 02-16 14:16:19 [worker.py:294] Memory profiling takes 10.88 seconds
INFO 02-16 14:16:19 [worker.py:294] the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.70) = 10.19GiB
INFO 02-16 14:16:19 [worker.py:294] model weights take 2.25GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.25GiB; the rest of the memory reserved for KV Cache is 6.66GiB.
INFO 02-16 14:16:20 [executor_base.py:113] # cuda blocks: 12132, # CPU blocks: 0
INFO 02-16 14:16:20 [executor_base.py:118] Maximum concurrency for 8192 tokens per request: 23.70x
INFO 02-16 14:16:20 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 12.24 seconds
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'q_norm', 'k_norm', 'pre_feedforward_layernorm']
Unsloth: 

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.8.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


Model created!
Base model: Qwen/Qwen2.5-3B-Instruct
Model name: sql-agent-3b
Project name: mcp-rl
Using config: max_turns=3, rollouts_per_group=4, groups_per_step=2, num_epochs=1, learning_rate=1e-05


Iterating dataset:   0%|          | 0/8 [00:00<?, ?batch/s]

Gathering trajectory groups with RULER scoring...


train gather step 0:   0%|          | 0/8 [00:00<?, ?it/s]

 (subsequent messages of this type will be suppressed)


[14:16:51] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:16:51] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Turnover proxy by department: Calculate hires in the last 12 months by department to approximate turnover and highlight departments w...'
[14:16:51] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:16:51] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:16:51] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Turnover proxy by department: Calculate hires in the last 12 months by department to approximate turnover and highlight departments w...'
[14:16:51] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:16:51] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:16:51] LLM reque

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8446 tokens (446 in the messa

starting train


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 0 to 1 (no training occurred)
Gathering trajectory groups with RULER scoring...


train gather step 1:   0%|          | 0/8 [00:00<?, ?it/s]

[14:17:06] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:06] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Integrity check for departments: Find employees whose department_id does not exist in the departments table and report the count and ...'
[14:17:06] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:06] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:06] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Integrity check for departments: Find employees whose department_id does not exist in the departments table and report the count and ...'
[14:17:06] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:06] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:06] LLM reque

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8443 tokens (443 in the messa

starting train
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 1 to 2 (no training occurred)
Gathering trajectory groups with RULER scoring...


train gather step 2:   0%|          | 0/8 [00:00<?, ?it/s]

[14:17:14] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:14] LLM request | step=1 model='sql-agent-3b' tools=4 last_user="Please complete this task: Department managers mapping: List each department's manager and flag departments without an assigned manager; include summary and ana..."
[14:17:14] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:14] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:14] LLM request | step=1 model='sql-agent-3b' tools=4 last_user="Please complete this task: Department managers mapping: List each department's manager and flag departments without an assigned manager; include summary and ana..."
[14:17:14] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:14] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:14] LLM reque

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8435 tokens (435 in the messa

[14:17:14] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}


Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8435 tokens (435 in the messa

starting train
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 2 to 3 (no training occurred)
Gathering trajectory groups with RULER scoring...


train gather step 3:   0%|          | 0/8 [00:00<?, ?it/s]

[14:17:22] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:22] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Hiring trend: Analyze hires by year and provide counts per year; include a summary and thorough analysis/report....'
[14:17:22] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:22] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:22] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Hiring trend: Analyze hires by year and provide counts per year; include a summary and thorough analysis/report....'
[14:17:22] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:22] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:22] LLM request | step=1 model='sql-agent-3b' tools=4 l

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8434 tokens (434 in the messa

[14:17:23] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:23] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:23] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Hiring trend: Analyze hires by year and provide counts per year; include a summary and thorough analysis/report....'
[14:17:23] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:23] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:23] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: NULL-prone payroll: Identify NULL salary entries and estimate their impact on payroll totals; include summary and analysis/report....'


Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8434 tokens (434 in the messa

[14:17:23] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}


Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8434 tokens (434 in the messa

starting train
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 3 to 4 (no training occurred)
Gathering trajectory groups with RULER scoring...


train gather step 4:   0%|          | 0/8 [00:00<?, ?it/s]

[14:17:32] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:32] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Headcount by department: Compute how many employees are in each department and produce a per-department headcount report. Include a s...'
[14:17:32] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:32] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:32] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Headcount by department: Compute how many employees are in each department and produce a per-department headcount report. Include a s...'
[14:17:32] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:32] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:32] LLM reque

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8448 tokens (448 in the messa

[14:17:32] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}


starting train
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 4 to 5 (no training occurred)
Gathering trajectory groups with RULER scoring...


train gather step 5:   0%|          | 0/8 [00:00<?, ?it/s]

[14:17:43] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:43] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Department expense leadership: Determine the top 5 departments by total salary expense (sum of salaries for employees in each departm...'
[14:17:43] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:43] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:43] LLM request | step=1 model='sql-agent-3b' tools=4 last_user='Please complete this task: Department expense leadership: Determine the top 5 departments by total salary expense (sum of salaries for employees in each departm...'
[14:17:43] LLM request (preview):
{
  "model": "sql-agent-3b",
  "messages_len": 2,
  "tools_len": 4
}
[14:17:43] rollout: discovered tools | count=3 names=['list_tables', 'describe_table', 'run_query']
[14:17:43] LLM reque

Traceback (most recent call last):
  File "/tmp/ipython-input-2398351722.py", line 185, in rollout
    response = await openai_client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 2028, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1584, in request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 8192 tokens. However, you requested 8442 tokens (442 in the messa

ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

In [None]:
# @title Test Your Model!

# Generate test inputs
print("Generating test inputs...")
val_scenarios = [
    McpScenario(
        task_description=scenario["task"],
        max_turns=MAX_TURNS,
    )
    for scenario in raw_val_scenarios
]

print(f"\nðŸ§ª Testing the trained model on {len(val_scenarios)} new inputs:\n")
print("=" * 80)

for i, scenario in enumerate(val_scenarios):
    print(f"\nTest {i + 1}:")
    print(f"Input: {scenario.task_description}")

    # Run the model
    result_trajectory = await rollout(model, scenario)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]["content"] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\nðŸŽ‰ Testing completed!")
print(
    f"\nYour model '{MODEL_NAME}' has been trained to use the MCP server at:"
)
print(MCP_SERVER_URL)
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print(
    "3. Or continue training with more examples by adjusting the configuration at the top"
)

In [None]:
# @title Upload to Hugging Face ðŸ¤—

# Adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks), licensed under GNU LGPL v3.0.
# Â© Unsloth contributors. Modifications Â© 2025 OpenPipe, Inc.
# See THIRD-PARTY-NOTICES and licenses/LGPL-3.0.txt for details.

import torch
from unsloth import FastLanguageModel

lora_model_path = (
    f".art/{model.project}/models/{model.name}/checkpoints/{await model.get_step():04d}"
)

peft_model, peft_tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

UPLOAD_MODEL = False  # Set True when you're ready to upload your model to Hugging Face
HF_ACCOUNT = "your_hf_account"
HF_TOKEN = "your_hf_token"

if UPLOAD_MODEL:
    peft_model.push_to_hub_merged(
        f"{HF_ACCOUNT}/{model.name}", peft_tokenizer, token=HF_TOKEN
    )

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A local MCP server
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **MCP server refinement**: Add better tools and resources to the server
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your MCP server alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).