## Production-Ready Agent Engineering: From MCP to RL

### Lecture 2: Production-Grade Agents

**Instructor: Will Brown**

*Date: June 19, 2025*

#### Agents as Software Programs
- Typing
- Testing Practices
- Tools as Functions
- Async Processing
- Parallel Tool Execution

#### System Architecture
- Logging + Observability
- Databases
- Is RAG dead?
- Client-Server (FastAPI, MCP, A2A)

#### Security + Reliability
- Environment Variables
- Tools as Action Whitelists
- Error handling + Retries
- Code sandboxes
    - Docker
    - E2B
    - Morph, Modal, Lambda
- Auth + Permissioning



## Agents as Software Programs

### Typing

- Hints not mandatory in Python, but **strongly recommended**
- "Any" as an escape hatch if needed, but "Union" preferable
- Ensures reliable composability
- Enable Pylance (Pyright) in IDE
- Linting for code quality (Pylance, Ruff)
    - Great for LLM-assisted development, many IDEs will show linter errors to LLMs, catches many bugs early

In [19]:
# "sloppy" example

def cumulative_returns(daily_returns):
    """
    Return the running product of daily returns.
    e.g. [1.1, 1.03, 1.04]  ->  [1.1, 1.133, 1.1783]
    """
    acc = 1
    totals = []
    for ret in daily_returns:
        acc *= ret
        totals.append(acc)
    return totals

rets = [1.1, 1.5, 2]
cumulative_returns(rets)

[1.1, 1.6500000000000001, 3.3000000000000003]

In [20]:
from openai import OpenAI
prompt = """
Value yesterday: 100

Value today: 300

Return the growth multiplier (number only).
"""

oai = OpenAI()

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": prompt}],
)

mult = response.choices[0].message.content
print(f"Growth multiplier: {mult}")

Growth multiplier: 3


In [21]:
rets = [1, 3, 2]
rets.append(mult) # type: ignore
cumulative_returns(rets)

[1, 3, 6, '333333']

In [22]:
6 * '3'

'333333'

In [23]:
# fix: enforce type hints

def cumulative_returns(daily_returns: list[float]) -> list[float]:
    """
    Return the running product of daily returns.
    e.g. [1.1, 1.03, 1.04]  ->  [1.1, 1.133, 1.1783]
    """
    acc = 1
    totals = []
    for ret in daily_returns:
        acc *= ret
        totals.append(acc)
    return totals

rets = [1.1, 1.03, 1.04]
cumulative_returns(rets)

[1.1, 1.1330000000000002, 1.1783200000000003]

In [24]:
# pytest cases for cumulative_returns

import pytest

def test_cumulative_returns_basic():
    """Test basic cumulative returns calculation"""
    result = cumulative_returns([1.1, 1.03, 1.04])
    expected = [1.1, 1.133, 1.17832]
    for actual, expected_val in zip(result, expected):
        assert actual == pytest.approx(expected_val, rel=1e-10)

def test_cumulative_returns_single():
    """Test with single return value"""
    assert cumulative_returns([1.5]) == [1.5]

def test_cumulative_returns_empty():
    """Test with empty list"""
    assert cumulative_returns([]) == []

def test_cumulative_returns_identity():
    """Test with identity returns (1.0)"""
    assert cumulative_returns([1.0, 1.0, 1.0]) == [1.0, 1.0, 1.0]

def test_cumulative_returns_negative():
    """Test with negative returns"""
    result = cumulative_returns([0.9, 0.8])
    expected = [0.9, 0.72]
    for actual, expected_val in zip(result, expected):
        assert actual == pytest.approx(expected_val, rel=1e-10)

def test_cumulative_returns_mixed():
    """Test with mixed positive and negative returns"""
    result = cumulative_returns([1.1, 0.9, 1.2])
    expected = [1.1, 0.99, 1.188]
    for actual, expected_val in zip(result, expected):
        assert actual == pytest.approx(expected_val, rel=1e-10)

def test_cumulative_returns_type_error():
    """Test with string input"""
    with pytest.raises(TypeError):
        cumulative_returns(["1.1", "1.03", "1.04"]) # type: ignore

def test_cumulative_returns_type_error_mixed():
    """Test with type error"""
    with pytest.raises(TypeError):
        cumulative_returns([1.1, "1.03", 1.04]) # type: ignore

# Notebook-friendly test runner
def run_tests():
    """Run all test functions and report results"""
    test_functions = [
        test_cumulative_returns_basic,
        test_cumulative_returns_single,
        test_cumulative_returns_empty,
        test_cumulative_returns_identity,
        test_cumulative_returns_negative,
        test_cumulative_returns_mixed,
        test_cumulative_returns_type_error,
        test_cumulative_returns_type_error_mixed,
    ]
    passed = 0
    failed = 0
    print("Running tests...\n")

    for test_func in test_functions:
        try:
            test_func()
            print(f"✅ {test_func.__name__}: PASSED")
            passed += 1
        except Exception as e:
            print(f"❌ {test_func.__name__}: FAILED - {str(e)}")
            failed += 1

    print(f"\n{'='*50}")
    print(f"Results: {passed} passed, {failed} failed")

    if failed == 0:
        print("🎉 All tests passed!")

    return failed == 0

# Run the tests
run_tests()

Running tests...

✅ test_cumulative_returns_basic: PASSED
✅ test_cumulative_returns_single: PASSED
✅ test_cumulative_returns_empty: PASSED
✅ test_cumulative_returns_identity: PASSED
✅ test_cumulative_returns_negative: PASSED
✅ test_cumulative_returns_mixed: PASSED
✅ test_cumulative_returns_type_error: PASSED
✅ test_cumulative_returns_type_error_mixed: PASSED

Results: 8 passed, 0 failed
🎉 All tests passed!


True

In [25]:
# structured output
from pydantic import BaseModel
import instructor

"""
alternatives:
- outlines
- openai.beta.chat.completions.parse
- openai.responses.create
- JSON mode
"""

class GrowthMultiplier(BaseModel):
    growth_multiplier: float

prompt = """
Value yesterday: 100

Value today: 300

Return the growth multiplier (number only).
"""

oai = OpenAI()
instructor_oai = instructor.from_openai(oai)

response = instructor_oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": prompt}],
    response_model=GrowthMultiplier, # type: ignore
)

print(f"Growth multiplier: {response.growth_multiplier}")

Growth multiplier: 3.0


In [26]:
print(type(response.growth_multiplier))

<class 'float'>


### Tools as Functions

Test your tools! Your agent's reliability is only as good as its tools' reliability

In [28]:
# openai agents sdk
# uv add openai-agents

from agents import Agent, Runner, function_tool

def calculator(expression: str) -> str:
    """Evaluates a single line of Python math expression. No imports or variables allowed.

    Args:
        expression (str): A mathematical expression using only numbers and basic operators (+,-,*,/,**,())

    Returns:
        The result of the calculation or an error message

    Examples:
        "2 + 2" -> "4"
        "3 * (17 + 4)" -> "63"
        "100 / 5" -> "20.0"
    """
    allowed = set("0123456789+-*/.() ")
    if not all(c in allowed for c in expression):
        return "Error: Invalid characters in expression"

    try:
        # eval is a dangerous function, use with caution
        result = eval(expression, {"__builtins__": {}}, {})
        return str(result)
    except Exception as e:
        return f"Error: {str(e)}"

print(calculator("2 + 2"))
print(calculator("3 x (17 + 4)"))
print(calculator("(62565374 + 265345356) / 425634563456"))

prompt = """
What is (62565374 + 265345356) / 425634563456?
"""


@function_tool
async def calculator_tool(expression: str) -> str:
    """Evaluates a single line of Python math expression. No imports or variables allowed.

    Args:
        expression (str): A mathematical expression using only numbers and basic operators (+,-,*,/,**,())

    Returns:
        The result of the calculation or an error message

    Examples:
        "2 + 2" -> "4"
        "3 * (17 + 4)" -> "63"
        "100 / 5" -> "20.0"
    """
    return calculator(expression)

agent = Agent(
    model="gpt-4.1-nano",
    name="calculator_agent",
    tools=[calculator_tool],
)
print("---")

result = Runner.run_sync(agent, prompt)
print(result.final_output)

agent = Agent(
    model="gpt-4.1-nano",
    name="calculator_agent",
    tools=[],
)
print("---")
result = Runner.run_sync(agent, prompt)
print(result.final_output)

4
Error: Invalid characters in expression
0.0007704043753812719
---
The value of \(\frac{62565374 + 265345356}{425634563456}\) is approximately 0.0007704.
---
Let's perform the calculations step-by-step:

First, add the numerators:
62565374 + 265345356 = 327910730

Next, divide this sum by the denominator:
327910730 / 425634563456

Now, compute the division:
≈ 7.693 x 10⁻⁴

So, the answer is approximately **0.0007693**.


### Async Processing

In [30]:
# naive synchronous calls

prompts = []

keywords = ["basketball", "tennis", "soccer", "baseball", "hockey", "golf", "tennis", "soccer", "baseball", "hockey", "golf"]

for keyword in keywords:
    prompts.append(f"Who are the 10 most famous {keyword} players?")

for prompt in prompts:
    print(prompt)

Who are the 10 most famous basketball players?
Who are the 10 most famous tennis players?
Who are the 10 most famous soccer players?
Who are the 10 most famous baseball players?
Who are the 10 most famous hockey players?
Who are the 10 most famous golf players?
Who are the 10 most famous tennis players?
Who are the 10 most famous soccer players?
Who are the 10 most famous baseball players?
Who are the 10 most famous hockey players?
Who are the 10 most famous golf players?


In [31]:
class Player(BaseModel):
    name: str

class Players(BaseModel):
    players: list[Player]


for prompt in prompts:
    response = instructor_oai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
        response_model=Players,
    )
    print(response.players)

[Player(name='Michael Jordan'), Player(name='LeBron James'), Player(name='Kobe Bryant'), Player(name="Shaquille O'Neal"), Player(name='Tim Duncan'), Player(name='Larry Bird'), Player(name='Magic Johnson'), Player(name='Kareem Abdul-Jabbar'), Player(name='Bill Russell'), Player(name='Stephen Curry')]
[Player(name='Roger Federer'), Player(name='Rafael Nadal'), Player(name='Novak Djokovic'), Player(name='Serena Williams'), Player(name='Steffi Graf'), Player(name='Martina Navratilova'), Player(name='Pete Sampras'), Player(name='Bjorn Borg'), Player(name='Andre Agassi'), Player(name='Venus Williams')]
[Player(name='Lionel Messi'), Player(name='Cristiano Ronaldo'), Player(name='Neymar Jr'), Player(name='Kylian Mbappe'), Player(name='Mohamed Salah'), Player(name='Kevin De Bruyne'), Player(name='Robert Lewandowski'), Player(name='Harry Kane'), Player(name='Erling Haaland'), Player(name='Sadio Mane')]
[Player(name='Babe Ruth'), Player(name='Jackie Robinson'), Player(name='Hank Aaron'), Player(n

In [32]:
# async version of the above
from openai import AsyncOpenAI
import asyncio
import nest_asyncio
nest_asyncio.apply() # needed for jupyter notebooks

oai = AsyncOpenAI()
instructor_oai = instructor.from_openai(oai)

async def get_players(keyword: str) -> list[Player]:
    response = await instructor_oai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": f"Who are the 10 most famous {keyword} players?"}],
        response_model=Players,
    )
    return response.players

# run all in parallel
async def main():
    tasks = [get_players(keyword) for keyword in keywords]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(main())
print(results)

[[Player(name='Michael Jordan'), Player(name='LeBron James'), Player(name='Kobe Bryant'), Player(name="Shaquille O'Neal"), Player(name='Larry Bird'), Player(name='Magic Johnson'), Player(name='Tim Duncan'), Player(name='Kareem Abdul-Jabbar'), Player(name='Bill Russell'), Player(name='Kevin Durant')], [Player(name='Roger Federer'), Player(name='Rafael Nadal'), Player(name='Serena Williams'), Player(name='Novak Djokovic'), Player(name='Pete Sampras'), Player(name='Steffi Graf'), Player(name='Venus Williams'), Player(name='Andy Murray'), Player(name='Maria Sharapova'), Player(name='Bjorn Borg')], [Player(name='Pelé'), Player(name='Diego Maradona'), Player(name='Lionel Messi'), Player(name='Cristiano Ronaldo'), Player(name='Johan Cruyff'), Player(name='Zinedine Zidane'), Player(name='David Beckham'), Player(name='Ronaldinho'), Player(name='Michel Platini'), Player(name='Franz Beckenbauer')], [Player(name='Babe Ruth'), Player(name='Willie Mays'), Player(name='Hank Aaron'), Player(name='Ted 

In [33]:
# semaphore version of the above

async def get_players_sem(keyword: str, semaphore: asyncio.Semaphore) -> list[Player]:
    async with semaphore:
        response = await instructor_oai.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[{"role": "user", "content": f"Who are the 10 most famous {keyword} players?"}],
            response_model=Players,
        )
        return response.players

# run all in parallel
async def main():
    semaphore = asyncio.Semaphore(5) # limit concurrent requests
    tasks = [get_players_sem(keyword, semaphore) for keyword in keywords]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(main())
print(results)

[[Player(name='Michael Jordan'), Player(name='LeBron James'), Player(name='Kobe Bryant'), Player(name="Shaquille O'Neal"), Player(name='Tim Duncan'), Player(name='Larry Bird'), Player(name='Magic Johnson'), Player(name='Kareem Abdul-Jabbar'), Player(name='Wilt Chamberlain'), Player(name='Bill Russell')], [Player(name='Roger Federer'), Player(name='Rafael Nadal'), Player(name='Novak Djokovic'), Player(name='Serena Williams'), Player(name='Venus Williams'), Player(name='Pete Sampras'), Player(name='Andre Agassi'), Player(name='Martina Navratilova'), Player(name='Steffi Graf'), Player(name='Bjorn Borg')], [Player(name='Lionel Messi'), Player(name='Cristiano Ronaldo'), Player(name='Neymar Jr.'), Player(name='Kylian Mbappe'), Player(name='Mohamed Salah'), Player(name='Kevin De Bruyne'), Player(name='Robert Lewandowski'), Player(name='Luka Modric'), Player(name='Virgil van Dijk'), Player(name='Karim Benzema')], [Player(name='Babe Ruth'), Player(name='Willie Mays'), Player(name='Hank Aaron'),

In [34]:
# same idea for tool calls

class ToolCall(BaseModel):
    name: str
    args: dict

class ToolCalls(BaseModel):
    thinking: str
    tool_calls: list[ToolCall]


prompt = """
What is (62565374 + 265345356) / 425634563456?

What is ((426336 * 23423563563456) + 55363563) / 3456345636?

What is (62565374 + 265345356) / 425634563456?

What is ((426336 * 23423563563456) + 55363563) / 3456345636?

What is (62565374 + 265345356) / 425634563456?

What is ((426336 * 23423563563456) + 55363563) / 3456345636?

Use the calculator tool to calculate the answers to the questions.
- name: calculator
- args:
    - expression: str
"""

oai = OpenAI()
instructor_oai = instructor.from_openai(oai)

response = instructor_oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": prompt}],
    response_model=ToolCalls,
)

print(response.tool_calls)


[ToolCall(name='calculator', args={'expression': '(62565374 + 265345356) / 425634563456'}), ToolCall(name='calculator', args={'expression': '((426336 * 23423563563456) + 55363563) / 3456345636'}), ToolCall(name='calculator', args={'expression': '(62565374 + 265345356) / 425634563456'}), ToolCall(name='calculator', args={'expression': '((426336 * 23423563563456) + 55363563) / 3456345636'}), ToolCall(name='calculator', args={'expression': '(62565374 + 265345356) / 425634563456'}), ToolCall(name='calculator', args={'expression': '((426336 * 23423563563456) + 55363563) / 3456345636'})]


In [35]:
response.thinking


"""
<thinking>
Let's think step by step.
...
</thinking>
"""

'Calculate the given expressions using the calculator tool with correct arguments.'

In [36]:
# asynchronously execute tool calls

async def execute_tool(func, args) -> str:
    return func(**args)

# run all in parallel
async def main():
    tasks = [execute_tool(calculator, tool_call.args)
             for tool_call in response.tool_calls]
    results = await asyncio.gather(*tasks)
    return results

results = asyncio.run(main())
print(results)

['0.0007704043753812719', '2889267870.5020986', '0.0007704043753812719', '2889267870.5020986', '0.0007704043753812719', '2889267870.5020986']


### Pre-Fetch RAG vs Agentic RAG

#### RAG = Retrieval-Augmented Generation

- MCP-aided search agents = RAG
- Deep Research = RAG

#### Pre-Fetch RAG
- search *before* LLM calls
- example from previous lecture:
    - "helper agent" = pre-fetch RAG
    - common confusion: vector DBs != RAG
- good if:
    - docs aren't too long
    - you want to cache docs for multiple Qs
    - don't need "multi-hop" search, just "easy lookup"

#### Agentic RAG
- retrieved info isn't determined by "always on" program logic
- good for:
    - leveraging existing DB indexes/search tools
    - retries / adaptive queries are "native"
    - combining multiple data sources
- Pro tips:
    - Markdown docs are LLM-friendly, OCR and/or markdownify are great
    - Leverage natural file system + link structures
    - LLMs are great at clever plaintext search!
- Generally recommended as default pattern 

## System Architecture

### Logging + Monitoring

Popular options
- PydanticAI Logfire
    - https://ai.pydantic.dev/
- W&B Weave
    - https://github.com/wandb/weave 
- MLFlow Tracing
    - https://mlflow.org/docs/latest/genai/tracing
- Arize Phoenix 
    - https://github.com/Arize-ai/phoenix 



In [1]:
import logfire

logfire.configure()
logfire.info("Hello, world!")

17:01:02.355 Hello, world!


[1mLogfire[0m project URL: 
]8;id=948297;https://logfire-us.pydantic.dev/williambrown97/starter-project\[4;36mhttps://logfire-us.pydantic.dev/williambrown97/starter-project[0m]8;;\


In [37]:
from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
import asyncio

import nest_asyncio
nest_asyncio.apply()


logfire.instrument_asyncpg()

class CalculatorDependencies(BaseModel):
    expression: str

class CalculatorResponse(BaseModel):
    float_result: float

calculator_agent = Agent(
    "openai:gpt-4.1-mini",
    deps_type=CalculatorDependencies,
    output_type=CalculatorResponse,
    system_prompt="""
    You are a helpful assistant that can calculate the result of a mathematical expression.
    """,
    instrument=True
)

@calculator_agent.tool
async def calculator_tool(context: RunContext[CalculatorDependencies]) -> float:
    return float(calculator(context.deps.expression))

async def main():
    deps = CalculatorDependencies(expression="2 + 2")
    response = await calculator_agent.run("What is the answer?", deps=deps)
    print(response)

result = asyncio.run(main())
print(result)

Attempting to instrument while already instrumented


19:50:43.860 calculator_agent run
19:50:43.861   chat gpt-4.1-mini
19:50:44.630   running 1 tool
19:50:44.630     running tool: calculator_tool
19:50:44.631   chat gpt-4.1-mini
AgentRunResult(output=CalculatorResponse(float_result=4.0))
None


### MCP + Client-Server Architectures

#### API Servers

- see `servers/fetch_wiki.py` + `tests/test_fetch_wiki_search.py`

#### MCP as "API Servers for LLMs"
- see https://github.com/modelcontextprotocol/servers/blob/main/src/fetch/src/mcp_server_fetch/server.py

related projects of mine:
- https://github.com/willccbb/claude-code-mcp
- https://github.com/willccbb/claude-deep-research
- https://github.com/willccbb/mcp-client-server

MCP repositories
- official page: https://github.com/modelcontextprotocol/servers 
- https://smithery.ai/ 
- https://mcp.so/

Popular clients:
- Cursor, Windsurf
- Claude Code
- ChatGPT

```bash
# installable js/ts servers
claude mcp add filesystem -- npx -y @modelcontextprotocol/server-filesystem $CLAUDE_FILESYSTEM_PATH
claude mcp add brave-search -e BRAVE_API_KEY=$BRAVE_API_KEY -- npx -y @modelcontextprotocol/server-brave-search
claude mcp add e2b -e E2B_API_KEY=$E2B_API_KEY -- npx -y @e2b/mcp-server 

# installable python servers
claude mcp add fetch uvx mcp-server-fetch

claude
```

Transport protocols:
- stdio: local-friendly
- Streamable HTTP: primary remote server connection protocol (good support for streaming results, long-lived connections, availability, interrupts)
- SSE: original remote protocol, but MCP is moving away from it
- more info: [SSE vs Streamable HTTP](https://brightdata.com/blog/ai/sse-vs-streamable-http)


A2A:
- "Multi-Agent Layer", similar to MCP, but compatible with it
- Less useful in the "agents as tools" paradigm
- Worth being aware of, but perhaps a bit early to go all-in

#### Databases

- File systems (e.g. Docker)
    - tools: grep, sed, ls, cd, pwd, etc.
- SQL databases
    - tools: SQL query access (or limited wrappers, e.g. for read-only, id lookup, no joins, etc.)
    - SQLite, Postgres, 
- Vector databases
    - tools: querying for embedding similarity, ids
    - Chroma, Weaviate, Pinecone, Milvus, MongoDB Atlas, etc.

In [38]:
import requests
BASE_URL = "http://localhost:8000"


def test_article_endpoint():
    """Test getting an article as markdown."""
    print("=== Testing Article Endpoint (Markdown) ===")
    try:
        response = requests.get(f"{BASE_URL}/article", params={"title": "Python (programming language)"})
        print(f"Status: {response.status_code}")
        content = response.text
        print(f"Content length: {len(content)} characters")
        print("First 2000 characters:")
        print(content[:5000] + "..." if len(content) > 5000 else content)
    except Exception as e:
        print(f"Error: {e}")
    print()

test_article_endpoint()

=== Testing Article Endpoint (Markdown) ===
Status: 200
Content length: 167288 characters
First 2000 characters:
# Python (programming language)

General-purpose programming language

**Python** is a [high-level](/wiki/High-level_programming_language "High-level programming language"), [general-purpose programming language](/wiki/General-purpose_programming_language "General-purpose programming language"). Its design philosophy emphasizes [code readability](/wiki/Code_readability "Code readability") with the use of [significant indentation](/wiki/Significant_indentation "Significant indentation").

Python is [dynamically type-checked](/wiki/Type_system#DYNAMIC "Type system") and [garbage-collected](/wiki/Garbage_collection_(computer_science) "Garbage collection (computer science)"). It supports multiple [programming paradigms](/wiki/Programming_paradigm "Programming paradigm"), including [structured](/wiki/Structured_programming "Structured programming") (particularly [procedural](/wik

## Security

### Tools as Action Whitelists

Tempting:
- Give your agent a terminal

Challenges: 
- Workspace management (excess scripts)
- Bad practices
- Deleting important files

Workaround:
- "Whitelist" certain code paths
- Explicit tools for common actions (e.g. fetching a website and converting to markdown)

### Error Handling + Retries

- Decide when to warn, retry, hard fail

### Code Sandboxes
- e2b ([MCP server](https://github.com/e2b-dev/mcp-server))
- Morph, Modal, AWS Lambda, etc

### Authorization + Permissioning
- MCP is auth-optional by default
- Typical auth best practices still apply
- Default: agent is acting on behalf of a user, treated as user program requests
https://modelcontextprotocol.io/specification/draft/basic/authorization