# The SmolAgents Framework

## Introduction to smolagents

Advantages of `smolagents`
- **Simplicity**
- **Flexible LLM Support**
- **Code-First Approach**
- **HF Hub Integration**

Unlike other frameworks where agents write actions in JSON, `smolagents` **focuses on tool calls in code**, simplifying the execution process, because there is no need to parse the JSON in order to build code that calls the tools: the output can be exuceted directly.

Agents in `smolagents` operate as **multi-step agents**. Each [`MultiStepAgent`](https://huggingface.co/docs/smolagents/main/en/reference/agents#smolagents.MultiStepAgent) performs:
- one thought
- one tool call and execution


In addition to using [`CodeAgent`](https://huggingface.co/docs/smolagents/main/en/reference/agents#smolagents.CodeAgent) as the primary type of agent, `smolagents` also supports [`ToolCallingAgent`](https://huggingface.co/docs/smolagents/main/en/reference/agents#smolagents.ToolCallingAgent), which write tool calls in JSON.



`smolagents` supports flexible LLM integration, allowing us to use any callable model that meets certain criteria. The framework provides several predefined classes to simplify model connections:
- `TransformersModel` - implements a local `transformers` pipeline
- `InferenceClientModel` - supports serverless inference calls through HuggingFace's infrastructure
- `LiteLLMModel` - leverages [`LiteLLM`](https://www.litellm.ai/) for lightweight model interactions
- `OpenAIServerModel` - connects to any service that offers an OpenAI API interface
- `AzureOpenAIServerModel` - supports integration with any Azure OpenAI deployment

## Building Agents that Use Code

Code agents are the default agent type in `smolagents`. They generate Python tool calls to perform actions, achieving action representations that are efficient, expressive, and accurate.

In a multi-step agent process, the LLM writes and executes actions, typically involving external tool calls. Traditional approaches use a JSON format to specify tool names and arguments as strings, which **the system must parse to determine which tool to execute.**

However, research shows that **tool-calling LLMs work more effectively with code directly**. This is a core principle of `smolagents`. Writing actions in code rather than JSON offers several advantages:
- **Composability** - easily combine and reuse actions
- **Object management** - work directly with complex structures like images
- **Generality** - express any computationally possible task
- **Natural for LLMs** - high-quality code is already present in LLM training data



A `CodeAgent` performs actions through a cycle of steps, with existing variables and knowledge being incorporated into the agent's context, which is kept in an execution log:
- The system prompt is stored in a `SystemPromptStep`, and the user query is logged in a `TaskStep`.
- Then the following while loop is exeucted:
  - The `agent.write_memory_to_messages()` method writes the agent's logs into a list of LLM-readable chat messages.
  - These messages are sent to a `Model`, which generates a completion.
  - The completion is parsed to extract the action, which, in this case, should be a code snippet since we work with a `CodeAgent`.
  - The action is executed.
  - The results are logged into memory in an `ActionStep`.

At the end of each step, if the agent includes any function calls (in `agent.step_callback`), they are executed.

### Example

In [None]:
!pip install -qU smolagents

#### Selecting a playlist for the party using smolagents

We can build an agent capable of searching the web using DuckDuckGo. For the model, we will rely on `InferenceClientModel`. The default model is `Qwen/Qwen2.5-Coder-32B-Instruct`.

In [None]:
from smolagents import CodeAgent, DuckDuckGoSearchTool, InferenceClientModel

agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[DuckDuckGoSearchTool()]
)

In [None]:
# test
agent.run(
    "Search for the best music recomendations for a party at the Wayne's mansion."
)

#### Using a custom tool to prepare the menu

We can use the `@tool` decorator to define a custom function that acts as a tool.

In [None]:
from smolagents import CodeAgent, tool, InferenceClientModel

# Tool to suggest a menu based on the occasion
@tool
def suggest_menu(occasion: str) -> str:
    """Suggests a menu based on the occasion.
    Args:
        occation (str): The type of occasion for the party. Allowed values are:
                        - "causal": Menu for causal party.
                        - "formal": Menu for formal party.
                        - "superhero": Menu for superhero party.
                        - "custom": Custom menu
    """
    if occasion == "causal":
        return "Pizza, snacks, and drinks."
    elif occasion == "formal":
        return "3-course dinner with wine and dessert."
    elif occasion == "superhero":
        return "Buffet with high-energy and healthy food."
    else:
        return "Custom menu for the butler."



agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[suggest_menu]
)

In [None]:
agent.run(
    "Prepare a formal menu for the party."
)

#### Using python imports inside the agent

After the playlist and menu ready, our agent needs to calculate when everything would be ready if he starts preparing now.

`smolagents` specializes in agents that write and execute Python code snippets, offering sandboxed execution for security.

**Code execution has strict security measures** - imports outside a predefined safe list are blocked by default. However, we can authorize additional imports by passing them as strings in `additional_authorized_imports`:

In [None]:
from smolagents import CodeAgent, InferenceClientModel
import numpy as np
import time
import datetime

agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[],
    additional_authorized_imports=['datetime']
)

In [None]:
agent.run(
    """
    Alfred needs to prepare for the party. Here are the tasks:
    1. Prepare the drinks - 30 minutes
    2. Decorate the mansion - 60 minutes
    3. Set up the menu - 45 minutes
    4. Prepare the music and playlist - 45 minutes

    If we start right now, at what time will the party be ready?
    """
)

#### Sharing our custom party preparator aent to the Hub

In [None]:
agent.push_to_hub('<your_username>/AlfredAgent')

In [None]:
alfred_agent = agent.from_hub('<your_username>/AlfredAgent', trust_remote_code=True)

alfred_agent.run("Give me the best playlist for a party at Wayne's mansion. The party idea is a 'villain masquerade' theme")

To complete and enpower the agent more, we have the following implementation

In [None]:
from smolagents import CodeAgent, DuckDuckGoSearchTool, FinalAnswerTool, InferenceClientModel, Tool, tool, VisitWebpageTool

@tool
def suggest_menu(occasion: str) -> str:
    """
    Suggests a menu based on the occasion.
    Args:
        occasion: The type of occasion for the party.
    """
    if occasion == "casual":
        return "Pizza, snacks, and drinks."
    elif occasion == "formal":
        return "3-course dinner with wine and dessert."
    elif occasion == "superhero":
        return "Buffet with high-energy and healthy food."
    else:
        return "Custom menu for the butler."

@tool
def catering_service_tool(query: str) -> str:
    """
    This tool returns the highest-rated catering service in Gotham City.

    Args:
        query: A search term for finding catering services.
    """
    # Example list of catering services and their ratings
    services = {
        "Gotham Catering Co.": 4.9,
        "Wayne Manor Catering": 4.8,
        "Gotham City Events": 4.7,
    }

    # Find the highest rated catering service (simulating search query filtering)
    best_service = max(services, key=services.get)

    return best_service


class SuperheroPartyThemeTool(Tool):
    name = "superhero_party_theme_generator"
    description = """
    This tool suggests creative superhero-themed party ideas based on a category.
    It returns a unique party theme idea."""

    inputs = {
        "category": {
            "type": "string",
            "description": "The type of superhero party (e.g., 'classic heroes', 'villain masquerade', 'futuristic Gotham').",
        }
    }

    output_type = "string"

    def forward(self, category: str):
        themes = {
            "classic heroes": "Justice League Gala: Guests come dressed as their favorite DC heroes with themed cocktails like 'The Kryptonite Punch'.",
            "villain masquerade": "Gotham Rogues' Ball: A mysterious masquerade where guests dress as classic Batman villains.",
            "futuristic Gotham": "Neo-Gotham Night: A cyberpunk-style party inspired by Batman Beyond, with neon decorations and futuristic gadgets."
        }

        return themes.get(category.lower(), "Themed party idea not found. Try 'classic heroes', 'villain masquerade', or 'futuristic Gotham'.")



In [None]:
agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[
        DuckDuckGoSearchTool(),
        VisitWebpageTool(),
        suggest_menu,
        catering_service_tool,
        SuperheroPartyThemeTool(),
        FinalAnswerTool()
    ],
    max_steps=10,
    verbosity_level=2
)

In [None]:
agent.run(
    "Give me the best playlist for a party at the Wayne's mansion. The party idea is a 'villain masquerade' theme"
)

#### Inspecting our party preparator agent with OpenTelemetry and Langfuse

In [None]:
!pip install -qU opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents

As the Party Preparator Agent is fine-tuned, this is difficult to inspect. We need robust traceability for future monitoring and analysis.

`smolagents` embraces the [OpenTelemetry](https://opentelemetry.io/) standard for instrumenting agent runs, allowing seamless inspection and logging. With the help of [Langfuse](https://langfuse.com/) and the `SmolagentsInstrumentor`.

Make sure to set up API keys in Langfuse.

In [None]:
import os
import base64
from google.colab import userdata

LANGFUSE_PUBLIC_KEY = userdata.get('LANGFUSE_PUBLIC_KEY')
LANGFUSE_SECRET_KEY = userdata.get('LANGFUSE_SECRET_KEY')
LANGFUSE_AUTH = base64.b64encode(f"{LANGFUSE_PUBLIC_KEY}:{LANGFUSE_SECRET_KEY}".encode()).decode()

#os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://cloud.langfuse.com/api/public/otel" # EU data region
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "https://us.cloud.langfuse.com/api/public/otel" # US data region

os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {LANGFUSE_AUTH}"

Next we will initialize the `SmolagentsInstrumentor` and start tracking the agent's performance.

In [None]:
from opentelemetry.sdk.trace import TracerProvider

from openinference.instrumentation.smolagents import SmolagentsInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

trace_provider = TracerProvider()
trace_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter()))

SmolagentsInstrumentor().instrument(tracer_provider=trace_provider)

The agent is now connected and the runs from `smolagents` are being logged in Langfuse, giving full visibility into the agent's behavior.

In [None]:
from smolagents import CodeAgent, InferenceClientModel

agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[],
)

In [None]:
alfred_agent = agent.from_hub('sergiopaniego/AlfredAgent', trust_remote_code=True)
alfred_agent.run("Give me the best playlist for a party at Wayne's mansion. The party idea is a 'villain masquerade' theme")

Now we can check the trace in Langfuse cloud.

## Writing Actions as Code Snippets or JSON Blobs

Tool Calling Agents are the second type of agent available in `smolagents`.

Unlike Code Agents that use Python snippets, these agents **use the built-in tool-calling capabilities of LLM providers** to generate tool calls as **JSON structures**. This is the standard approach used by OpenAI, Anthropic, and many other providers.

Suppose that we want to search for catering services and party ideas, a `CodeAgent` would generate and run Python code:

```python
for query in [
    "Best catering services in Gotham City",
    "Party theme ideas for superheroes"
]:
    print(web_search(f"Search for: {query}"))
```

A `ToolCallingAgent` would instead create a JSON structure:
```json
[
    {"name": "web_search", "arguments": "Best catering services in Gotham City"},
    {"name": "web_search", "arguments": "Party theme ideas for superheroes"}
]
```
This JSON blob is then used to execute the tool calls.


Tool Calling Agents follow the same multi-step workflow as Code Agents. The key difference is in **how they structure their actions**: instead of executable code, they **generate JSON objects that specify tool names and arguments**. The system then **parses these instructions** to execute the appropriate tools.

### Example

For the party preparations, instead of using `CodeAgent`, we will use `ToolCallingAgent`:

In [None]:
from smolagents import ToolCallingAgent, DuckDuckGoSearchTool, InferenceClientModel

agent = ToolCallingAgent(
    model=InferenceClientModel(),
    tools=[DuckDuckGoSearchTool()]
)

In [None]:
agent.run(
    "Search for the best music recommendations for a party at the Wayne's mansion."
)

The agent generates a structured tool call that the system processes to produce the output, rather than directly executing code like a `CodeAgent`.

## Tools

In `smolagents`, tools are treated as **functions that an LLM can call within an agent system**.

To interact with a tool, the LLM needs an **interface description** with
- **Name** - what the tool is called
- **Tool description** - what the tool does
- **Input types and descriptions** - what arguments the tool accepts
- **Output type** - what the tool returns

For example, a simple search tool interface may have
- **Name** - `web_search`
- **Tool description** - Searches the web for specific queries
- **Input**- `query (string): the search term to look up`
- **Output** - string containing the search results

### Tool Creation Methods

In `smolagents`, tools can be defined in two ways:
- Using the `@tool` decorator for simple function-based tools
- Creating a subclass of `Tool` for more complex functionality

#### The `@tool` decorator

Using the `@tool` decorator, `smolagents` will parse basic information about the function from Python. Using this approach, we define a function with
- **a clear and descriptive function name** that helps the LLM understand its purpose.
- **Type hints for both inputs and outputs** to ensure proper usage.
- **A detailed description**, including `Args:` section where each argument is explicitly described. These description provide valuable context for the LLM.

We will use `@tool` decorator to implement a tool to search for the best catering services for a large number of guests.

In [None]:
from smolagents import CodeAgent, InferenceClientModel, tool

# Assume that we have a function that fetches the highest-rated catering services
@tool
def catering_service_tool(query: str) -> str:
    """This tool returns the highest-rated catering service in Gotham City.

    Args:
        query: A search term for finding catering services.
    """

    # Example list of catering services and their ratings
    services = {
        "Gotham Catering Co.": 4.9,
        "Wayne Manor Catering": 4.8,
        "Gotham City Events": 4.7
    }

    # Find the highest rated catering service (simulating search query filtering)
    best_service = max(services, key=services.get)

    return best_service

In [None]:
agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[catering_service_tool]
)

In [None]:
# Run the agent to find the best catering service
result = agent.run(
    "Can you give me the name of the highest-rated catering service in Gotham City?"
)
print(result)

#### `Tool` class

For complex tool, we can implement a class instead of a Python function. The class wraps the function with metadata that helps the LLM understand how to use it effectively.

In this class, we define
- `name` - the tool's name
- `description` - a description used to populate the agent's system prompt
- `inputs` - a dictionary with keys `type` and `description`, providing information to help the Python interpreter process inputs
- `output_type` - specifying the expected output type
- `forward` - the method containing the inference logic to execute


As an example, we will implement an agent that generates superhero-themed party ideas based on a given category.

In [None]:
from smolagents import Tool, CodeAgent, InferenceClientModel


class SuperheroPartyThemeTool(Tool):
    name = "superhero_party_theme_generator"
    description = """
    This tool suggests creative superhero-themed party ideas based on a category.
    It returns a unique party theme idea."""

    inputs = {
        "category": {
            "type": "string",
            "description": "The type of superhero party (e.g., 'classic heroes', 'villain masquerade', 'futuristic Gotham')."
        }
    }

    output_type = "string"

    def forward(self, category: str):
        themes = {
            "classic heroes": "Justice League Gala: Guests come dressed as their favorite DC heroes with themed cocktails like 'The Kryptonite Punch'.",
            "villain masquerade": "Gotham Rogues' Ball: A mysterious masquerade where guests dress as classic Batman villains.",
            "futuristic Gotham": "Neo-Gotham Night: A cyberpunk-style party inspired by Batman Beyond, with neon decorations and futuristic gadgets."
        }

        return themes.get(category.lower(), "Themed party idea not found. Try 'classic heroes', 'villain masquerade', or 'futuristic Gotham'.")

In [None]:
# Instantiate the tool
party_theme_tool = SuperheroPartyThemeTool()

agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[party_theme_tool]
)

In [None]:
result = agent.run(
    "What would be a good superhero party idea for a 'villain masquerade' theme?"
)
print(result)

### Default Toolbox

`smolagents` comes with a set of pre-built tools that can be directly injected into our agent. The default toolbox includes
- `PythonInterpreterTool`
- `FinalAnswerTool`
- `UserInputTool`
- `DuckDuckGoSearchTool`
- `GoogleSearchTool`
- `VisitWebpageTool`

### Sharing and Import Tools

#### Sharing a Tool to the Hub

Sharing a custom tool with the community,

In [None]:
party_theme_tool.push_to_hub("{your_username}/party_theme_tool", token="<YOUR_HUGGINGFACEHUB_API_TOKEN>")

#### Importing a Tool from the Hub

We can import tools created by other users using the `load_tool()` function.

In [None]:
from smolagents import load_tool, CodeAgent, InferenceClientModel

image_generation_tool = load_tool(
    'm-ric/text-to-image',
    trust_remote_code=True
)

agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[image_generation_tool]
)

In [None]:
agent.run(
    "Generate an image of a luxurious superhero-themed party at Wayne Manor with made-up superheros."
)

#### Importing a HuggingFace Space as a Tool

We can also import a HF Space as a tool using `Tool.from_space()`. The tool will connect with the spaces Gradio backend using the `gradio_client`.

In [None]:
from smolagents import Tool, CodeAgent, InferenceClientModel

image_generation_tool = Tool.from_space(
    'black-forest-labs/FLUX.1-schnell',
    name='image_generator',
    description="Generate an image from a prompt"
)

model = InferenceClientModel("Qwen/Qwen2.5-Coder-32B-Instruct")

agent = CodeAgent(
    model=model,
    tools=[image_generation_tool]
)

In [None]:
agent.run(
    "Improve this prompt, then generate an image of it."
    additional_args={
        "user_prompt": "A grand superhero-themed party at Wayne Manor, with Alfred overseeing a luxurious gala."
    }
)

#### Importing a LangChain Tool

We can load LangChain tools using the `Tool.from_langchain()` method

In [None]:
from langchain.agents import load_tools
from smolagents import CodeAgent, InferenceClientModel, Tool

search_tool = Tool.from_langchain(
    load_tools(['serpapi'][0])
)

model = InferenceClientModel("Qwen/Qwen2.5-Coder-32B-Instruct")

agent = CodeAgent(
    model=model,
    tools=[search_tool]
)

In [None]:
agent.run(
    "Search for luxury entertainment ideas for a superhero-themed event, such as live performances and interactive experiences."
)

#### Importing a tool collection from any MCP server

`smolagents` also allows importing tools from the MCP servers available on [glama.ai](https://glama.ai/mcp/servers) and [smithery.ai](https://smithery.ai/).

We first need to install the `mcp` integration for `smolagents`.

In [None]:
!pip install -qU smolagents[mcp]

In [None]:
import os
from smolagents import ToolCollection, CodeAgent, InferenceClientModel
from mcp import StdioServerParameters

model = InferenceClientModel("Qwen/Qwen2.5-Coder-32B-Instruct")

# set up mcp server
server_parameters = StdioServerParameters(
    command='uvx',
    args=['--quiet', 'pubmedmcp@0.1.3'],
    env={"UV_PYTHON": "3.12", **os.environ}
)

In [None]:
with ToolCollection.from_mcp(server_parameters, trust_remote_code=True) as tool_collection:
    agent = CodeAgent(
        model=model,
        tools=[*tool_collection.tools],
        add_base_tools=True
    )

    agent.run("Please find a remedy for hangover.")

## Retrieval Agents

Retrieval Augmented Generation (RAG) systems combine the capabilities of data retrieval and generation models to provide context-aware responses. Agentic RAG extends traditional RAG systems by **combining autonomous agents with dynamic knowledge retrieval**.

While traditional RAG systems use an LLM to answer queries based on retrieved data, **agentic RAG enables intelligent control of both retrieval and generation processes**, improving efficiency and accuracy.

Traditional RAG systems face key limitations, such as relying on a single retrieval step and focusing on direct semantic similarity with the user query. **Agentic RAG address these issues by allowing the agent to autonomously formulate search queries, critique retrieved results, and conduct multiple retrieval steps for a more tailored and comprehensive output.

### Basic Retrieval with DuckDuckGo

We will implement an agent to retrieve information and sythesize responses to answer queries. With Agentic RAG, the agent can
- Search for latest information
- Refine results to include more keywords
- Synthesize information into a complete answer

In [None]:
from smolagents import CodeAgent, DuckDuckGoSearchTool, InferenceClientModel

# Initialize a search tool
search_tool = DuckDuckGoSearchTool()

# Initialize the model
model = InferenceClientModel()

agent = CodeAgent(
    model=model,
    tools=[search_tool]
)

In [None]:
response = agent.run(
    "Search for luxury superhero-themed party ideas, including decorations, entertainment, and catering."
)
print(response)

The agent will
- **Analyze the request** - identify the key elements of the query
- **Perform retrieval** - leverage DuckDuckGo to search for the most relevant and up-to-date information, ensuring it aligns with the user query
- **Synthesizes information** - after gathering the results, process them into a cohesive, actionable response
- **Store for future reference** - store the retrieved information for easy access

### Custom Knowledge Base Tool

A vector database stores numerical representation (embeddings) of text or other data, created by ML models. It enables semantic search by identifying similar meanings in high-dimensional space.

We will create a tool that retrieves relevant information from a custom knowledge base. We will use a BM25 retriever to search the knowledge base and return the top results, and a `RecursiveCharacterTextSplitter` to split the documents into smaller chunks for more efficient search.

In [None]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever

from smolagents import Tool, CodeAgent, InferenceClientModel


class PartyPlanningRetrieverTool(Tool):
    name: "party_planning_retriever"
    description = "Uses semantic search to retrieve relevant party planning ideas for Alfred’s superhero-themed party at Wayne Manor."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be a query related to party planning or superhero themes.",
        }
    }
    output_types = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        self.retriever = BM25Retriever.from_documents(
            docs,
            k=5, # retrieve the top 5 documents
        )

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.retriever.invoke(query)

        return "\nRetrieved ideas:\n" + "".join(
            [
                f"\n\n===== Idea {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

In [None]:
# Simulate a knowledge base about party planning
party_ideas = [
    {"text": "A superhero-themed masquerade ball with luxury decor, including gold accents and velvet curtains.", "source": "Party Ideas 1"},
    {"text": "Hire a professional DJ who can play themed music for superheroes like Batman and Wonder Woman.", "source": "Entertainment Ideas"},
    {"text": "For catering, serve dishes named after superheroes, like 'The Hulk's Green Smoothie' and 'Iron Man's Power Steak.'", "source": "Catering Ideas"},
    {"text": "Decorate with iconic superhero logos and projections of Gotham and other superhero cities around the venue.", "source": "Decoration Ideas"},
    {"text": "Interactive experiences with VR where guests can engage in superhero simulations or compete in themed games.", "source": "Entertainment Ideas"}
]

source_docs = [
    Document(page_content=doc['text'], metadata={'source': doc['source']})
    for doc in party_ideas
]

# Split the documents into smaller chunks for more efficient search
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""]
)
docs_processed = text_splitter.split_documents(source_docs)

# Create the retriever tool
party_planning_retriever = PartyPlanningRetrieverTool(docs=docs_processed)

In [None]:
agent = CodeAgent(
    model=InferenceClientModel(),
    tools=[party_planning_retriever]
)

In [None]:
response = agent.run(
    "Find ideas for a luxury superhero-themed party, including entertainment, catering, and decoration options."
)
print(response)

When building agentic RAG systems, the agent can employ sophisticated strategies like
- **Query reformulation** - Instead of using the raw user query, the agent can craft optimized search terms that better match the target documents.
- **Multi-step retrieval** - The agent can perform multiple searches, using initial results to inform subsequent queries.
- **Source integration** - Information can be combined from multiple sources like web search and local documentation.
- **Result validation** - Retrieved content can be analyzed for relevance and accuracy before being included in responses.



Effective agentic RAG systems require careful consideration of several key aspects. The agent should select between available tools based on the query type and context. Memory systems help maintain conversation history and avoid repetitive retrievals. Having fallback strategies ensures the system can still provide value even when primary retrieval methods fail. Additionally, implementing validation steps helps ensure the accuracy and relevance of retrieved information.

## Multi-Agent Systems

*Multi-agent systems* enable **specialized agents to collaborate on complex tasks**, improving modularity, scalability, and robustness. Instead of relying on a single agent, tasks are distributed among agents with distinct capabilities.

In `smolagents`, different agents can be combined to generate Python code, call external tools, perform web searches, and more. A typical orchestration might include
- a **Manager Agent** for task delegation
- a **Code Interpreter Agent** for code execution
- a **Web Search Agent** for information retrieval


A multi-agent system consists of multiple specialized agents working together under the coordination of an **Orchestrator Agent**. This approach enables complex workflows by distributing tasks among agents with distinct roles.

### Solving a Complex Task with a Multi-aAgent Hierarchy

In [None]:
!pip install -qU smolagents[litellm] plotly geopandas shapely kaleido

In [None]:
import math
from typing import Optional, Tuple
from smolagents import tool


@tool
def calculate_cargo_travel_time(
        origin_coords: Tuple[float, float],
        destination_coords: Tuple[float, float],
        cruising_speed_kmh: Optional[float] = 750., # average speed for cargo planes
) -> float:
    """Calculate the travel time for a cargo plane between two points on Earch using great-circle distance.

    Args:
        origin_coords: Tuple of (latitude, longitude) for the starting point
        destination_coords: Tuple of (latitude, longitude) for the destination
        cruising_speed_kmh: Optional cruising speed in km/h (defaults to 750 km/h for typical cargo planes)

    Returns:
        float: The estimated travel time in hours

    Example:
        >>> # Chicago (41.8781° N, 87.6298° W) to Sydney (33.8688° S, 151.2093° E)
        >>> result = calculate_cargo_travel_time((41.8781, -87.6298), (-33.8688, 151.2093))
    """

    def to_radians(degrees: float) -> float:
        return degrees * (math.pi / 180)

    # Extract coordinates
    lat1, lon1 = map(to_radians, origin_coords)
    lat2, lon2 = map(to_radians, destination_coords)

    # Earth's radius in kilometers
    EARTH_RADIUS_KM = 6371.

    # Calculate great-circle distance using the haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = (
        math.sin(dlat / 2) ** 2
        + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
    )
    c = 2 * math.asin(math.sqrt(a))
    distance = EARTH_RADIUS_KM * c

    # Add 10% to account for non-direct routes and air traffic controls
    actual_distance = distance * 1.1

    # Calculate flight time
    # Add 1 hour for takeoff and landing procedures
    flight_time = (actual_distance / cruising_speed_kmh) + 1.0

    # Format the results
    return round(flight_time, 2)


# test
print(calculate_cargo_travel_time((41.8781, -87.6298), (-33.8688, 151.2093)))

For the model provider, we will use Together AI. The `GoogleSearchTool` uses the Serper API to search the web, so this requires either having setup ENV variable `SERPER_API_KEY` and passing `provider="serper"`.

If we do not have any `SERPER_API_KEY`,  we can use `DuckDuckGoSearchTool` but it has a rate limit.

In [None]:
import os
from PIL import Image
from smolagents import CodeAgent, GoogleSearchTool, VisitWebpageTool, InferenceClientModel

model = InferenceClientModel(
    model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
    provider='together'
)

In [None]:
agent = CodeAgent(
    model=model,
    tools=[
        GoogleSearchTool('serper'),
        VisitWebpageTool(),
        calculate_cargo_travel_time
    ],
    additional_authorized_imports=['pandas'],
    max_steps=20
)

In [None]:
task = """Find all Batman filming locations in the world, calculate the time to transfer via cargo plane to here (we're in Gotham, 40.7128° N, 74.0060° W), and return them to me as a pandas dataframe.
Also give me some supercar factories with the same cargo plane transfer time."""

result = agent.run(task)

In [None]:
print(result)

We could improve this by throwing a dedicated planning step and adding more prompting.

Planning step allows the agent to think ahead and plan its next steps, which can be useful for more complex tasks.

In [None]:
agent.planning_interval = 4

detailed_report = agent.run(
f"""
You're an expert analyst. You make comprehensive reports after visiting many websites.
Don't hesitate to search for many queries at once in a for loop.
For each data point that you find, visit the source url to confirm numbers.

{task}
"""
)

In [None]:
print(detailed_report)

Multi-agent structures allow to separate memories between different sub-tasks:
- Each agent is more focused on its core task, thus more performant
- Separating memories reduces the count of input tokens at each step, thus reducing latency and cost.

We will create a team with a dedicated web search agent, managed by another agent.

In [None]:
# Web search agent

model = InferenceClientModel(
    model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
    provider='together',
    max_tokens=8096
)

web_agent = CodeAgent(
    model=model,
    tools=[
        GoogleSearchTool(provider='serper'),
        VisitWebpageTool(),
        calculate_cargo_travel_time
    ],
    name='web_agent',
    description='Browses the web to find information',
    verbosity_level=0,
    max_steps=10
)

The manager agent should have plotting capabilities to write its final report. Since the manager agent will do the heavy lifting, we will give it a stronger model [`DeepSeek-R1`](https://huggingface.co/deepseek-ai/DeepSeek-R1), and add a `planning_interval`.

In [None]:
from smolagents.utils import encode_image_base64, make_image_url
from smolagents import OpenAIServerModel


def check_reasoning_and_plot(final_answer, agent_memory):
    multimodal_model = OpenAIServerModel(
        'gpt-4o',
        max_tokens=8096
    )
    filepath = "saved_map.png"
    assert os.path.exists(filepath), "Make sure to save the plot under saved_map.png!"

    image = Image.open(filepath)
    prompt = (
        f"Here is a user-given task and the agent steps: {agent_memory.get_succinct_steps()}. Now here is the plot that was made."
        "Please check that the reasoning process and plot are correct: do they correctly answer the given task?"
        "First list reasons why yes/no, then write your final decision: PASS in caps lock if it is satisfactory, FAIL if it is not."
        "Don't be harsh: if the plot mostly solves the task, it should pass."
        "To pass, a plot should be made using px.scatter_map and not any other method (scatter_map looks nicer)."
    )

    messages = [
        {
            'role', 'user',
            'content': [
                {
                    'type': 'text',
                    'text': prompt
                },
                {
                    'type': 'image_url',
                    'image_url': {'url': make_image_url(encode_image_base64(image))}
                }
            ]
        }
    ]
    output = multimodal_model(messages).content
    print("Feedback: ", output)
    if 'FAIL' in output:
        raise Exception(output)
    return True

In [None]:
# Manager Agent
manager_model = InferenceClientModel(
    'deepseek-ai/DeepSeek-R1',
    provider='together',
    max_tokens=8096
)

manager_agent = CodeAgent(
    model=manager_model,
    tools=[calculate_cargo_travel_time],
    managed_agents=[web_agent],
    additional_authrozed_imports=[
        'geopandas',
        'plotly',
        'shapely',
        'json',
        'pandas',
        'numpy'
    ],
    planning_interval=5,
    verbosity_level=2,
    final_answer_checks=[check_reasoning_and_plot],
    max_steps=15
)

We can inspect and visualize what this team looks like

In [None]:
manager_agent.visualize()

In [None]:
manager_agent.run("""
Find all Batman filming locations in the world, calculate the time to transfer via cargo plane to here (we're in Gotham, 40.7128° N, 74.0060° W).
Also give me some supercar factories with the same cargo plane transfer time. You need at least 6 points in total.
Represent this as spatial map of the world, with the locations represented as scatter points with a color that depends on the travel time, and save it to saved_map.png!

Here's an example of how to plot and return a map:
import plotly.express as px
df = px.data.carshare()
fig = px.scatter_map(df, lat="centroid_lat", lon="centroid_lon", text="name", color="peak_hour", size=100,
     color_continuous_scale=px.colors.sequential.Magma, size_max=15, zoom=1)
fig.show()
fig.write_image("saved_image.png")
final_answer(fig)

Never try to process strings using code: when you have a string to read, just print it and you'll see it.
""")

In [None]:
manager_agent.python_executor.state["fig"]

## Vision and Browser Agents

Empowering agents with visual capabilities is crucial for solving tasks that go beyond text processing.

`smolagents` provides built-in support for Vision-Language Models (VLMs), enabling agents to process and interpret images effectively.

In this approach, images are passed to the agent at the start and stored as `task_images` alongside the task prompt. The agent then processes these images throughout its execution.

In [None]:
from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg", # Joker image
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg" # Joker image
]

images = []
for url in image_urls:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    image = Image.open(BytesIO(response.content)).convert('RGB')
    images.append(image)

Now that we have the images, the agent will process the user query with images.

In [None]:
from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel('gpt-4o')

agent = CodeAgent(
    model=model,
    tools=[],
    max_steps=20,
    verbosity_level=2
)

In [None]:
response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

In [None]:
print(response)

#### Providing Images with Dynamic Retrieval

In this approach, images are dynamically added to the agent's memory during execution. Agents in `smolagents` are based on the `MultiStepAgent` class, which is an abstraction of the ReAct framework. This class operates in a structured cycle where various variables and knowledge are logged at different stages:
- **SystemPromptStep** - stores the system prompt
- **TaskStep** - logs the user query and any provided input
- **ActionStep** - captures logs from the agent's actions and results

This structured approach allows agents to incorporate visual information dynamically and respond adaptively to evolving tasks. When browsing the webpage, the agent can take screenshots and save them as `observation_images` in the `ActionStep`.

In [None]:
!pip install -qU smolagents[all] helium selenium python-dotenv

We will need a set of agent tools specifically designed for browsing, such as `search_item_ctrl_f`, `go_back`, and `close_popups` to act like a person navigating the web.

In [None]:
import helium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

In [None]:
def initialize_driver():
    """Initialize the Selenium WebDriver."""
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--force-device-scale-factor=1")
    chrome_options.add_argument("--window-size=1000,1350")
    chrome_options.add_argument("--disable-pdf-viewer")
    chrome_options.add_argument("--window-position=0,0")
    return helium.start_chrome(headless=False, options=chrome_options)

driver = initialize_driver()

In [None]:
@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.

    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()


@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()

We also need function to save screenshots, as this will be an essential part of what our VLM agent uses to complete the task.

In [None]:
from PIL import Image
from smolagents import CodeAgent, DuckDuckGoSearchTool, tool
from smolagents.agents import ActionStep

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    # Let JavaScript animations happen before taking the screenshot
    sleep(1.0)
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # Remove previous screenshots from logs for lean processing
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # Create a copy to ensure it persists, important!

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "\n" + url_info
    return

This function will be passed tothe agent as `step_callback`, as it is triggered at the end of each step during the agent's execution. This allows the agent to dynamically capture and store screenshots throughout its process.

Next, we can generate our vision agent for browsing the web, providing it with the tools we created, along with the `DuckDuckGoSearchTool` to explore the web.

In [None]:
model = OpenAIServerModel('gpt-4o')

agent = CodeAgent(
    model=model,
    tools=[
        DuckDuckGoSearchTool(),
        go_back,
        close_popups,
        search_item_ctrl_f
    ],
    additional_authorized_imports=['helium'],
    step_callback=[save_screenshot],
    max_steps=20,
    verbosity_level=2
)

In [None]:
helium_instructions = """
Use your web_search tool when you want to get Google search results.
Then you can use helium to access websites. Don't use helium for Google search, only for navigating websites!
Don't bother about the helium driver, it's already managed.
We've already ran "from helium import *"
Then you can go to pages!
Code:
```py
go_to('github.com/trending')
```<end_code>
You can directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```<end_code>
If it's a link:
Code:
```py
click(Link("Top products"))
```<end_code>
If you try to interact with an element and it's not found, you'll get a LookupError.
In general stop your action after each button click to see what happens on your screenshot.
Never try to login in a page.
To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # This will scroll one viewport down
```<end_code>
When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
Just use your built-in tool `close_popups` to close them:
Code:
```py
close_popups()
```<end_code>
You can use .exists() to check for the existence of an element. For example:
Code:
```py
if Text('Accept cookies?').exists():
    click('I accept')
```<end_code>
Proceed in several steps rather than trying to solve the task in one shot.
And at the end, only when you have your answer, return your final answer.
Code:
```py
final_answer("YOUR_ANSWER_HERE")
```<end_code>
If pages seem stuck on loading, you might have to wait, for instance `import time` and run `time.sleep(5.0)`. But don't overuse this!
To list elements on page, DO NOT try code-based element searches like 'contributors = find_all(S("ol > li"))': just look at the latest screenshot you have and read it visually, or use your tool search_item_ctrl_f.
Of course, you can act on buttons like a user would do when navigating.
After each code blob you write, you will be automatically provided with an updated screenshot of the browser and the current browser url.
But beware that the screenshot will only be taken at the end of the whole action, it won't see intermediate states.
Don't kill the browser.
When you have modals or cookie banners on screen, you should get rid of them before you can click anything else.
"""

In [None]:
result = agent.run("""
I am Alfred, the butler of Wayne Manor, responsible for verifying the identity of guests at party. A superhero has arrived at the entrance claiming to be Wonder Woman, but I need to confirm if she is who she says she is.

Please search for images of Wonder Woman and generate a detailed visual description based on those images. Additionally, navigate to Wikipedia to gather key details about her appearance. With this information, I can determine whether to grant her access to the event.
""" + helium_instructions)

In [None]:
print(result)

## Summary

All available scripts in this Session are under [HERE](https://huggingface.co/agents-course/notebooks/tree/main/unit2/smolagents).