# NPS Agent with MLflow Tracing & Agent-as-a-Judge

Query the National Parks Service using LlamaStack + MCP, with MLflow tracing and automated evaluation.

**Prerequisites:**
- LlamaStack server on `localhost:8321`
- NPS MCP server on `localhost:3005`
- `OPENAI_API_KEY` in environment

In [24]:
import mlflow
import os
from mlflow.entities import SpanType, AssessmentSource, AssessmentSourceType
from mlflow.genai.judges import make_judge
from llama_stack_client import LlamaStackClient
from typing import Literal


## Configuration

In [25]:
# Configuration
LLAMA_STACK_URL = "http://localhost:8321/"
NPS_MCP_URL = "http://localhost:3005/sse/"
MODEL_ID = "openai/gpt-4o"
JUDGE_MODEL = "openai:/gpt-4o"

In [26]:

db_path = os.path.join(os.getcwd(), "mlflow.db")
mlflow.set_tracking_uri(f"sqlite:///{db_path}")
mlflow.set_experiment("nps-agent")
print(f"MLflow database: {db_path}")

MLflow database: /Users/nnarendr/Documents/Repos/agent_eval_report/nps/mlflow.db


## Agent Function

Queries NPS via LlamaStack with MCP tools attached. The `@mlflow.trace` decorator captures the execution.

In [27]:
@mlflow.trace(name="query_nps", span_type=SpanType.AGENT)
def query_nps(prompt: str, model: str = MODEL_ID) -> str:
    """Query the National Parks Service agent."""
    client = LlamaStackClient(base_url=LLAMA_STACK_URL)
    
    with mlflow.start_span(name="mcp_tool_call", span_type=SpanType.LLM) as span:
        span.set_inputs({"model": model, "prompt": prompt})
        response = client.responses.create(
            model=model,
            input=prompt,
            tools=[{"type": "mcp", "server_url": NPS_MCP_URL, "server_label": "NPS tools"}]
        )
        span.set_outputs({"response_id": response.id, "status": response.status})
    
    # Extract text response
    for output in response.output:
        if output.type in ("text", "message") and hasattr(output, 'content') and output.content:
            return output.content[0].text
    return ""

## Agent-as-a-Judge

An Agent that evaluates the agent's trace after execution. Instead of just looking at inputs/outputs, it uses tools to inspect the full execution:
- What spans were created
- What tools were called
- How long each step took

The `{{ trace }}` in the instructions tells MLflow to give the judge these inspection tools.

In [28]:
# Agent-as-a-Judge scorer
nps_judge = make_judge(
    name="nps_agent_evaluator",
    instructions=(
        "Evaluate the NPS agent's performance in {{ trace }}.\n\n"
        "Check for:\n"
        "1. Response Quality: Did the agent correctly identify parks and provide accurate information?\n"
        "2. Tool Usage: Were the correct NPS MCP tools used (search_parks, get_park_events, etc.)?\n"
        "3. Completeness: Did the agent answer all parts of the user's question?\n\n"
        "Rate as: 'good', 'acceptable', or 'poor'"
    ),
    feedback_value_type=Literal["good", "acceptable", "poor"],
    model=JUDGE_MODEL,
)

In [29]:
def evaluate_trace(trace):
    """Run Agent-as-a-Judge evaluation and log to MLflow."""
    feedback = nps_judge(trace=trace)
    
    trace_id = trace.info.trace_id
    mlflow.log_feedback(
        trace_id=trace_id,
        name="nps_agent_evaluation",
        value=feedback.value,
        rationale=feedback.rationale,
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id=f"agent-as-a-judge/{JUDGE_MODEL}",
        ),
    )
    
    print(f"\nEvaluation: {feedback.value}")
    print(f"Rationale: {feedback.rationale}")
    return feedback


## Run Agent & Evaluate

1. Send a query to the NPS agent
2. Get the MLflow trace from the execution
3. Pass the trace to the judge for evaluation
4. Log the feedback to MLflow (visible in Assessments panel)

In [30]:
prompt = "Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them."

result = query_nps(prompt)
print(f"Response:\n{result}")

# Evaluate the trace
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)
evaluate_trace(trace)


[92m12:01:10 - LiteLLM:INFO[0m: utils.py:3872 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai


Response:
Sure, here are some parks in Rhode Island along with any upcoming events at them:

1. **Blackstone River Valley National Historical Park**
   - **Description**: Known for powering America's entry into the Age of Industry with Samuel Slater's cotton spinning mill in Pawtucket, RI.
   - **Website**: [Blackstone River Valley National Historical Park](https://www.nps.gov/blrv/index.htm)
   - **Upcoming Events**:
     - **Revolutionary War Pension Files Transcription Event**: Several events are scheduled at different locations, such as the Carpenter Museum, Sutton Senior Center, Upton Community Center, and the Blackstone River Valley Heritage Center at Worcester. These events allow participants to transcribe historical documents related to Revolutionary War veterans. [Details Here](https://www.nps.gov/blrv/planyourvisit/event-details.htm).

2. **Roger Williams National Memorial**
   - **Description**: Dedicated to religious freedom, the memorial explores the life and legacy of Rog

[92m12:01:11 - LiteLLM:INFO[0m: utils.py:1621 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m12:01:11 - LiteLLM:INFO[0m: utils.py:3872 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai
[92m12:01:12 - LiteLLM:INFO[0m: utils.py:1621 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m12:01:12 - LiteLLM:INFO[0m: utils.py:3872 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai
[92m12:01:16 - LiteLLM:INFO[0m: utils.py:1621 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m12:01:16 - LiteLLM:INFO[0m: utils.py:3872 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = opena


Evaluation: good
Rationale: The NPS agent performed well based on the following criteria:

1. **Response Quality**: The agent correctly identified several parks in Rhode Island, providing detailed descriptions and accurate information about each park. For example, it mentioned the Blackstone River Valley National Historical Park and Roger Williams National Memorial, along with their historical significance and website links. It also included details about upcoming events, such as the Revolutionary War Pension Files Transcription Event, demonstrating an understanding of both the park themes and event scheduling.

2. **Tool Usage**: The agent effectively utilized the correct NPS MCP tools to gather the required park information and events. While the specific tool calls are encapsulated in the 'mcp_tool_call' span with a successful status code, suggesting that appropriate tool usage was conducted without errors during execution.

3. **Completeness**: The agent addressed both parts of the

Feedback(name='nps_agent_evaluator', source=AssessmentSource(source_type='LLM_JUDGE', source_id='openai:/gpt-4o'), trace_id='tr-12bc23adfd3aeb09a929d3c229ec252b', run_id=None, rationale="The NPS agent performed well based on the following criteria:\n\n1. **Response Quality**: The agent correctly identified several parks in Rhode Island, providing detailed descriptions and accurate information about each park. For example, it mentioned the Blackstone River Valley National Historical Park and Roger Williams National Memorial, along with their historical significance and website links. It also included details about upcoming events, such as the Revolutionary War Pension Files Transcription Event, demonstrating an understanding of both the park themes and event scheduling.\n\n2. **Tool Usage**: The agent effectively utilized the correct NPS MCP tools to gather the required park information and events. While the specific tool calls are encapsulated in the 'mcp_tool_call' span with a success

## View Traces in MLflow UI

Start the MLflow UI to view traces and assessments:

```bash
mlflow ui --port 5001
```

Then open http://localhost:5001 in your browser.

**What you'll see:**
- **Traces tab** - All agent executions with timing and status
- **Trace Details** - Span hierarchy, inputs/outputs for each step
- **Assessments panel** - Agent-as-a-Judge evaluation results (rating + rationale)