# Building Effective Agents

This is based on [this article](https://www.anthropic.com/engineering/building-effective-agents) by [Anthropic](https://www.anthropic.com/). We will explore common patterns for building effective LLM agentic systems using pure Python around LLM APIs. In particular, we use the [Python SDK](https://github.com/openai/openai-python/tree/main) for the [OpenAI API](https://platform.openai.com/docs/api-reference/introduction). 

## OpenAI API client

First, we need to load the API key in the environmental variables. The client expects the variable name `OPENAI_API_KEY` which we load from the `.env` file. This is easy to implement:

In [1]:
import inspect
from notebooks.utils import load_dotenv
print(inspect.getsource(load_dotenv))

def load_dotenv(verbose=False):
    with open(".env") as f:
        for line in f.readlines():
            k, v = line.split("=")
            os.environ[k] = v.strip().strip('"')
            if verbose:
                print(f"Loaded env variable: {k}")



In [2]:
load_dotenv(verbose=True)

Loaded env variable: OPENAI_API_KEY


Then the API key is automatically read by the **client**:

In [3]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system", 
            "content": "You're a helpful assistant."},
        {
            "role": "user",
            "content": "Tell me about the history of GPT in a single paragraph.",
        },
    ],
)

response_text = completion.choices[0].message.content

:::{.callout-note}
We can have **multiple completions** of the same prompt. This allows choosing between the responses.
For example, we can set `temperature=0.9` to get more varied outputs, so that choosing becomes nontrivial. 
By default only 1 completion by default, hence `[0]`.
:::

In [4]:
from pprint import pprint
pprint(response_text)

('Generative Pre-trained Transformer (GPT) is a series of language models '
 'developed by OpenAI, beginning with the release of GPT-1 in 2018. GPT-1 '
 'introduced the concept of pre-training a transformer model on a large corpus '
 'of text before fine-tuning it on specific tasks. This approach demonstrated '
 'significant improvements in natural language processing tasks. GPT-2, '
 'released in 2019, scaled up the model significantly, boasting 1.5 billion '
 'parameters and showcasing impressive text generation capabilities. Due to '
 'concerns about misuse, OpenAI initially withheld its full release. GPT-3, '
 'released in 2020, further scaled the model to 175 billion parameters, '
 'becoming notable for its ability to generate human-like text across diverse '
 'tasks with minimal prompt input. The series continued to evolve with '
 'advancements like GPT-3.5 and ChatGPT, which improved interaction '
 'capabilities. In 2023, GPT-4 was released, offering enhanced performance, '
 'be

Entire model response:

In [5]:
from pprint import pprint
pprint(completion.model_dump())

{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'annotations': [],
                          'audio': None,
                          'content': 'Generative Pre-trained Transformer (GPT) '
                                     'is a series of language models developed '
                                     'by OpenAI, beginning with the release of '
                                     'GPT-1 in 2018. GPT-1 introduced the '
                                     'concept of pre-training a transformer '
                                     'model on a large corpus of text before '
                                     'fine-tuning it on specific tasks. This '
                                     'approach demonstrated significant '
                                     'improvements in natural language '
                                     'processing tasks. GPT-2, released in '
                                     '

:::{.callout-note}
We are using the **chat completions** API where an autoregressive process that's running under the hood. Here the prompt to be completed is:

```python
messages=[
    {"role": "system", "content": "You are a poetic but terse assistant."},   # prompt
    {"role": "user", "content": "What is the color of the sky?"}              # prompt
]
```

And the completion is given by the API's output:

```python
{
  "role": "assistant",
  "content": "The sky's color shifts from azure to amber, a canvas for sun's daily journey."
}
```

**Remark.** The generated response is statistically the most likely continuation of the prompt text sequence.
:::

## Structured outputs

[Structured outputs](https://platform.openai.com/docs/guides/structured-outputs?api-mode=responses) is a feature that ensures that a model generates responses that adhere to a supplied **schema** (e.g. a Pydantic model). As such, the output can then be parsed using the same Pydantic model. Structured outputs makes prompting significantly simpler: no more need for strongly worded prompts to achieve consistent formatting, no explicitly having to retry incorrectly formatted responses, or having invalid hallucinated values (can specify **enums**).

In [6]:
# https://platform.openai.com/docs/guides/structured-outputs?api-mode=responses
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {
            "role": "user",
            "content": "Alice and Bob are going to a science fair on Friday.",
        },
    ],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
event

CalendarEvent(name='Science Fair', date='Friday', participants=['Alice', 'Bob'])

## Building block: The augmented LLM

This is the foundational building block of agentic systems. The **augmented LLM** is a language model enhanced with **retrieval**, **tools**, and **memory**. Current models, due to their reasoning and understanding capabilities, are able to effectively generate their own search queries, select appropriate tools, and determinine what information to retain to solve problems or perform tasks.

![**Augmented LLM.** Attaching retrieval, tools, and memory capabilities to the LLM.](../img/augmented-llm.png){#fig-augmentedllm}

### Memory

The following is a naive example that demonstrates an agent deciding whether or not to **write** to a file that represents its **long-term memory**.

In [7]:
import json
from typing import List
from openai import OpenAI
from pydantic import BaseModel


class MemoryItem(BaseModel):
    tag: str
    info: str
    reason: str

class RetainResponse(BaseModel):
    items: List[MemoryItem]


class MemoryFile:
    def __init__(self, path="agent_memory.json"):
        """Load memory from JSON file in local path."""
        self.path = path
        self.data = None
        self.load()

    def load(self):
        try:
            self.data = json.load(open(self.path))
        except FileNotFoundError:
            self.reset()

    def save(self):
        with open(self.path, "w") as f:
            json.dump(self.data, f, indent=2)

    def reset(self):
        self.data = []
        self.save()
    
    def add(self, item: dict[str, str]):
        self.data.append(item)

    def __repr__(self):
        return str(self.data)


client = OpenAI()
mem = MemoryFile()

The agent reads the *entire* memory and decides what information to remember from the input.

In [8]:
def agent(user_input):
    system_prompt = (
        f"You have access to a memory store: {mem} for single user. "
        "Decide what new information should be remembered. Atomic facts are good. "
        "NOTE: The `tag` should not be overly specific. It should be widely applicable. Follow <snake_case>. "
        "NOTE: Reuse existing tags when applicable. Minimize number of unique tags. "
        "NOTE: The `info` should compress information from input. Try to be concise. "
        "NOTE: Zero or multiple items can be retained from a single input. "
        "NOTE: Each item is added to a global memory list. There should be minimal dependence between memory items."
    )
    
    completion = client.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ],
        response_format=RetainResponse,
    )

    return completion.choices[0].message.parsed


def action(user_input: str) -> RetainResponse:
    """Store key-value pair obtained from input text to memory."""
    response = agent(user_input)

    for item in response.items:
        mem.add({item.tag: item.info})
        mem.save()

    return response

:::{.callout-note}
The agent only expresses the intent to write to a file via the structured output. It is up to the main program to perform the actual writing to disk. This allows further [guardrails](https://cookbook.openai.com/examples/how_to_use_guardrails) to be implemented.
:::

Examples:

In [9]:
import pandas as pd

logs = [
    "Woke up, showered, and left for work at the usual time. Had toast for breakfast before heading out.",
    "Traffic was smooth, arrived at work earlier than usual.",
    "Took the train, had to stand the whole ride since it was packed.",
    "Stopped by the bakery on the way and picked up bread for the team.",
    "Opened my email first thing at the office, mostly routine messages.",
    "Woah! Some guy just came out of nowhere and darted into traffic. That was pretty shocking. Crazy.",
    "Listened to a podcast while walking to the subway.",
    "Grabbed a pen from my desk drawer because mine ran out of ink.",
]

memory_items = sum([[d.model_dump() for d in action(input_text).items] for input_text in logs], [])
df_resp = pd.DataFrame(memory_items)
mem.reset()

The agent decides whether to reuse a tag or create a new one based on its memory:

In [10]:
#| code-fold: true
import warnings
warnings.simplefilter("ignore")
pd.set_option('display.max_colwidth', None)

print("tags:", f"({len(df_resp.tag.unique())}/{len(df_resp)})")
pprint(list(df_resp.tag.unique()))
df_resp

tags: (7/9)
['morning_routine',
 'breakfast_choice',
 'commute_observation',
 'email_morning_routine',
 'unexpected_incident',
 'commute_habit',
 'stationery_action']


Unnamed: 0,tag,info,reason
0,morning_routine,User usually showers and leaves for work at the same time.,This information highlights a consistent morning routine.
1,breakfast_choice,User had toast for breakfast.,This offers insight into typical breakfast choices.
2,commute_observation,User arrived at work earlier than usual due to smooth traffic.,This information provides insight into the User's commute experience and may relate to the User's routine or schedule.
3,commute_observation,User took the train and had to stand due to packed ride.,"Details about the user's commute experience, including the mode of transport and crowd conditions, are useful for understanding daily travel habits and challenges."
4,commute_observation,User stopped by bakery for bread for the team.,This builds on the user's commuting habits and provides insight into a possible routine or occasional activity.
5,email_morning_routine,User opens email first thing at office mainly for routine messages.,It captures a consistent behavior regarding email checking at the office.
6,unexpected_incident,"User witnessed someone darting into traffic, found it shocking.",It captures a notable event that might influence the user's perception or behavior in similar situations.
7,commute_habit,User listens to a podcast while walking to the subway.,This provides insights into how the user spends their commute time and audio preferences.
8,stationery_action,User grabbed a pen from desk drawer as the current one ran out of ink.,This is a routine action related to desk management and resource use.


### Tool calling

Also known as **function calling**, this technique provides a powerful and flexible way for models to interface with external systems and access data outside their training data. Function calling give models access to new functionality and data they can use to follow instructions and respond to prompts. As we saw above, the LLM cannot actually execute functions &mdash; the main program listens to the LLM hallucinate, and executes the commands based on that (@fig-brainvat).

![(**right**) LLM as brain in a vat that hallucinates outputs from information contained in inputs. It tells us *what* function to execute with *what* arguments. (**left**) The computer listens to the LLM and performs computation based on it.](../img/llm-brain-vat.png){#fig-brainvat width=80%}

To show tool calling, we find the weather in [Quisao](https://www.philatlas.com/luzon/r04a/rizal/pililla/quisao.html). We want the agent to call the following API:

In [11]:
import json
import requests

def get_weather(latitude, longitude):
    """
    Get current weather data for provided coordinates with units:
    temperature (celsius), wind speed (kph), & precipitation (mm).
    """
    response = requests.get((
        f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&"
        "current=temperature_2m,wind_speed_10m,relative_humidity_2m,precipitation,precipitation_probability"
    ))
    data = response.json()
    return data["current"]


get_weather(latitude=14.4779, longitude=121.3214)  # true coordinates

{'time': '2025-08-23T16:15',
 'interval': 900,
 'temperature_2m': 28.4,
 'wind_speed_10m': 2.6,
 'relative_humidity_2m': 79,
 'precipitation': 0.0,
 'precipitation_probability': 1}

**Tool definition.** For the LLM to understand a specific tool, we have to define a schema that informs the model of what the tool does and its expected (required and optional) arguments. The following is the function definition for `get_weather`:

In [12]:
tools = [
    {
        "type": "function", # <1>
        "function": {
            "name": "get_weather",  # <2>
            "description": "Get current weather data for provided coordinates with units: temperature (celsius), wind speed (kph), & precipitation (mm).",    # <3>
            "parameters": {
                "type": "object", # <4>
                "properties": { # <5>
                    "latitude": {
                        "type": "number"
                    },
                    "longitude": {
                        "type": "number"
                    },
                },
                "required": ["latitude", "longitude"],
                "additionalProperties": False,  # <6>
            },
            "strict": True, # <7>
        },
    }
]

1. Should always be function.
2. Function's name (i.e. `get_weather`).
3. Usually just the docstring. Should describe when and how to use the function.
4. Parameters for LLMs are naturally JSON objects. Because `parameters` is defined by a JSON schema, you can leverage many of its rich features like property types, enums, descriptions, nested objects, and so on. For example, we can have: 
    ```
    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "unit of measure for temperature"}
    ```
5. List of arguments. Clearly the two arguments are required.
6. Part of JSON schema that determines whether extra fields are valid or not.
7. Not part of JSON schema, but OpenAI function calling option that *guides* the model to strictly follow the schema (i.e. not improvise). Setting `additionalProperties` to `False` and `strict` to `True` works together to ensure that the model generates the correct parameters schema.

In [13]:
system_prompt = "You are a helpful weather assistant. Feel free to determine coordinates given location."
user_prompt = "What's the weather like in Quisao, Pililla, Rizal right now?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=messages,
    tools=tools,
)

pprint(completion.choices[0].message.model_dump())

{'annotations': [],
 'audio': None,
 'content': None,
 'function_call': None,
 'refusal': None,
 'role': 'assistant',
 'tool_calls': [{'function': {'arguments': '{"latitude":14.4777,"longitude":121.3172}',
                              'name': 'get_weather'},
                 'id': 'call_5HxvMhXUuVUHMnr0ApE5XCQS',
                 'type': 'function'}]}


Impressive that it's able to get fairly accurate coordinates without using web search. Also notice that whenever we have `tools`, then the model only has `tool_calls` as nonempty (e.g. `content` is empty). 
We will iterate over tool calls and process them separately. Each function call and its result is then logged as part of the sequence of **chat messages** as well as its result. (This explains why we had messages defined outside the API call unlike the usual setup.)

In [14]:
def call_function(name, args):
    fn = {
        "get_weather": get_weather  
    }
    return fn[name](**args)


assistant_message = completion.choices[0].message
messages.append(assistant_message.model_dump())  # <1>

for tool_call in assistant_message.tool_calls:
    args = json.loads(tool_call.function.arguments)
    name = tool_call.function.name
    tool_output = call_function(name, args)
    messages.append({    # <2>
        "role": "tool", 
        "tool_call_id": tool_call.id, 
        "content": json.dumps(tool_output)
    })

1. The tool call is logged with role `assistant`.
2. Next, we log the result of the function call with role `tool`. 

In [15]:
pprint(messages)

[{'content': 'You are a helpful weather assistant. Feel free to determine '
             'coordinates given location.',
  'role': 'system'},
 {'content': "What's the weather like in Quisao, Pililla, Rizal right now?",
  'role': 'user'},
 {'annotations': [],
  'audio': None,
  'content': None,
  'function_call': None,
  'refusal': None,
  'role': 'assistant',
  'tool_calls': [{'function': {'arguments': '{"latitude":14.4777,"longitude":121.3172}',
                               'name': 'get_weather'},
                  'id': 'call_5HxvMhXUuVUHMnr0ApE5XCQS',
                  'type': 'function'}]},
 {'content': '{"time": "2025-08-23T16:15", "interval": 900, "temperature_2m": '
             '28.6, "wind_speed_10m": 2.6, "relative_humidity_2m": 79, '
             '"precipitation": 0.0, "precipitation_probability": 5}',
  'role': 'tool',
  'tool_call_id': 'call_5HxvMhXUuVUHMnr0ApE5XCQS'}]


Next, we pass this thread to another API call (possibly to a different LLM) which will process the outputs of the tool calls along with earlier chat messages. To take advantage of structured outputs we again define a response format. Here we use Pydantic `Field` with a description that helps the LLM.

In [None]:
from pydantic import Field

class WeatherResponse(BaseModel):
    response: str = Field(description="A natural language response to the user's question.")
    temperature: float = Field(description="Current temperature in celsius for the given location.")


completion_weather = client.chat.completions.parse(
    model="gpt-4o",
    messages=messages,
    tools=tools,    # <!>
    response_format=WeatherResponse,
)

:::{.callout-caution}
The aggregator also needs access to tools to understand the context of each tool call!
:::

Final output:

In [18]:
parsed_weather = completion_weather.choices[0].message.parsed
print(parsed_weather.temperature)
pprint(parsed_weather.response)

28.6
('The current weather in Quisao, Pililla, Rizal is warm with a temperature of '
 '28.6°C. The wind speed is light at 2.6 kph, and there is no precipitation at '
 'the moment, with a low chance of rain.')


### Retrieval