<center>
<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# **Arize Agent Mastry Course: Evaluating Your Agent**

So far, we have built our agent, added tooling, and implemented a RAG system that allows it to access information. Now, we are ready to run the agent and evaluate its outputs. Evaluations can take different forms such as LLM, code, or human, and can be applied at various scopes including trace, span, and session.

In this lab, we will demonstrate how to run evaluations in code and log the results to the Arize UI. We will also show you how to set up and run evaluations directly within the Arize UI.



# Set Up

In [None]:
!pip install -qqqqqqqq arize-otel arize agno openai openinference-instrumentation-agno openinference-instrumentation-openai httpx chromadb sentence-transformers arize-phoenix

In [None]:
import os
from getpass import getpass
from google.colab import userdata

os.environ["ARIZE_SPACE_ID"] = userdata.get("ARIZE_SPACE_ID") or getpass("üîë Enter your Arize Space ID: ")

os.environ["ARIZE_API_KEY"] = userdata.get("ARIZE_API_KEY") or getpass("üîë Enter your Arize API Key: ")

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY") or getpass("üîë Enter your OpenAI API Key: ")

os.environ["TAVILY_API_KEY"] = userdata.get("TAVILY_API_KEY") or getpass("üîë Enter your Tavily API Key: ")

Note that we are tracing our agent outputs to a different project from previous labs here:

In [None]:
from arize.otel import register, Transport
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.agno import AgnoInstrumentor

model_id = "evaluate-travel-agent"
tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name=model_id,
    set_global_tracer_provider=True,
    log_to_console=True,
    endpoint="https://otlp.ca-central-1a.arize.com/v1/traces",
    transport=Transport.HTTP
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
AgnoInstrumentor().instrument(tracer_provider=tracer_provider)

# Define Tools

In [None]:
# --- Helper functions for tools ---
import httpx
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.chain(name="search-api")
def _search_api(query: str) -> str | None:
    """Try Tavily search first, fall back to None."""
    tavily_key = os.getenv("TAVILY_API_KEY")
    if not tavily_key:
        return None
    try:
        resp = httpx.post(
            "https://api.tavily.com/search",
            json={
                "api_key": tavily_key,
                "query": query,
                "max_results": 3,
                "search_depth": "basic",
                "include_answer": True,
            },
            timeout=8,
        )
        data = resp.json()
        answer = data.get("answer") or ""
        snippets = [r.get("content", "") for r in data.get("results", [])]
        combined = " ".join([answer] + snippets).strip()
        return combined[:400] if combined else None
    except Exception:
        return None

def _compact(text: str, limit: int = 200) -> str:
    """Compact text for cleaner outputs."""
    cleaned = " ".join(text.split())
    return cleaned if len(cleaned) <= limit else cleaned[:limit].rsplit(" ", 1)[0]


In [None]:
# --- APIs for Essential Info Tool ---
import httpx
from urllib.parse import quote
from typing import Optional

@tracer.chain(name="wiki-summary-api")
def _wiki_summary(dest: str) -> str:
    if not dest:
        return ""
    encoded_dest = quote(dest)

    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{encoded_dest}"
    HEADERS = { 'User-Agent': 'MyArizeApp/1.0 (ExampleContac@example.com)'}

    try:
        r = httpx.get(url, headers = HEADERS, timeout=5)
        r.raise_for_status()

        data = r.json().get("extract")
        return data if data else ""

    except httpx.HTTPStatusError as e:
        if e.response.status_code == 404:
            return ""
        return ""
    except httpx.RequestError as e:
        return ""
    except Exception as e:
        return ""

@tracer.chain(name="weather-api")
def _weather(dest):
    g = httpx.get(f"https://geocoding-api.open-meteo.com/v1/search?name={dest}")
    if g.status_code != 200 or not g.json().get("results"):
        return ""
    lat, lon = g.json()["results"][0]["latitude"], g.json()["results"][0]["longitude"]
    w = httpx.get(f"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}&current_weather=true").json()
    cw = w.get("current_weather", {})
    return f"Weather now: {cw.get('temperature')}¬∞C, wind {cw.get('windspeed')} km/h."

In [None]:
from agno.tools import tool

@tool
def essential_info(destination: str) -> str:
    """Get essential info (summary and weather) using APIs"""
    parts = []
    wiki = _wiki_summary(destination)
    if wiki: parts.append(wiki)
    weather = _weather(destination)
    if weather: parts.append(weather)
    return f"{destination} essentials:\n" + "\n".join(parts)

@tool
def budget_basics(destination: str, duration: str) -> str:
    """Summarize travel cost categories."""
    q = f"{destination} travel budget average daily costs {duration}"
    s = _search_api(q)
    if s:
        return f"{destination} budget ({duration}): {_compact(s)}"
    return f"Budget for {duration} in {destination} depends on lodging, meals, transport, and attractions."

# Create RAG System for Local Flavor Tool

In [None]:
import chromadb
from sentence_transformers import SentenceTransformer

chroma_client = chromadb.Client()
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create collection for local guides
collection = chroma_client.create_collection(
    name="local_guides",
    metadata={"hnsw:space": "cosine"}
)

print("‚úÖ RAG system initialized with ChromaDB and sentence-transformers")

Upload `local_flavor.json` file

In [None]:
from google.colab import files
guide = files.upload()

In [None]:
import json

def load_and_index_guides():

    with open('local_guides.json', 'r') as f:
      guides = json.load(f)

    # Prepare data for ChromaDB
    documents = []
    metadatas = []
    ids = []

    for i, guide in enumerate(guides):
        # Create a rich text representation for embedding
        text = f"City: {guide['city']}. Interests: {', '.join(guide['interests'])}. Experience: {guide['description']}"

        documents.append(text)
        metadatas.append({
          "city": guide["city"],
          "interests": ", ".join(guide["interests"]),
          "source": guide["source"],
          "description": guide["description"]
        })
        ids.append(f"guide_{i}")

    # Add to ChromaDB collection
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )

    print(f"‚úÖ Indexed {len(documents)} experiences in vector database")
    return len(documents)

# Load the data
num_guides = load_and_index_guides()


In [None]:
from sentence_transformers import SentenceTransformer
from openinference.semconv.trace import SpanAttributes, DocumentAttributes

# Initialize embedding model (same one you used for indexing)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

@tool
def local_flavor(destination: str, interests: str = "local culture") -> str:
    """Suggest authentic local experiences using vector retrieval from Chroma."""
    with tracer.start_as_current_span(name="RAG", attributes={SpanAttributes.OPENINFERENCE_SPAN_KIND: "RETRIEVER"}) as span:
      # Construct the query text
      query_text = f"{destination} {interests} authentic experiences"
      span.set_attribute(SpanAttributes.INPUT_VALUE, query_text)

      # Embed the query
      query_embedding = embedding_model.encode([query_text])

      # Search in Chroma collection
      results = collection.query(
          query_embeddings=query_embedding,
          n_results=3  # how many guides to retrieve
      )

      if not results or not results.get("documents"):
          return f"Explore {destination}'s unique {interests} through markets, neighborhoods, and local eateries."

      # Extract retrieved guides
      retrieved_docs = results["documents"][0]
      retrieved_meta = results["metadatas"][0]
      for i, doc in enumerate(retrieved_docs):
        span.set_attribute(f"retrieval.documents.{i}.document.id", f"doc_{i}")
        span.set_attribute(f"retrieval.documents.{i}.document.content", doc)

      # Format a summary
      suggestions = []
      for doc, meta in zip(retrieved_docs, retrieved_meta):
          suggestion = f"üìç **{meta['city']}** ‚Äî {meta['description']} (Interests: {meta['interests']})"
          suggestions.append(suggestion)

      response = f"Here are some authentic {interests} experiences near {destination}:\n\n" + "\n\n".join(suggestions)
      span.set_attribute(SpanAttributes.OUTPUT_VALUE, response)

      return response


# Define Agent

In [None]:
from agno.agent import Agent
from agno.models.openai import OpenAIChat

# --- Main Agent ---
trip_agent = Agent(
    name="TripPlanner",
    role="AI Travel Assistant",
    model=OpenAIChat(id="gpt-4.1"),
    instructions=(
        "You are a friendly and knowledgeable travel planner. "
        "Combine multiple tools to create a trip plan including essentials, budget, and local flavor. "
        "Keep the tone natural, clear, and under 1000 words."
    ),
    markdown=True,
    tools=[essential_info, budget_basics, local_flavor],
)

# Evaluate the Agent

The first step before evaluating our agent is to generate multiple runs using different query types. This way, our evaluation will cover many different cases. Running the two cells below will send requests to the agent and create traces that will appear in Arize. Unlike previous labs, these traces will be logged under a new project titled ‚Äú**evaluate-travel-agent**‚Äù.

In [None]:
queries = [
    "Plan a 5-day trip to Dubai. Focus on history, wellness. Include essential info, budget breakdown, and local experiences.",
    "Plan a 6-day trip to Dubai. Focus on art, heritage sites, and sustainability. Include recommendations for cultural districts and green hotels.",
    "Plan a 4-day trip to Bangkok. Focus on history, floating markets, and photography spots. Include essential travel info.",
    "Plan a 6-day trip to Bangkok. Focus on art, hidden caf√©s, and authentic experiences. Include budget options and local insights.",
    "Plan a 5-day trip to Prague. Focus on history, beer culture, and architecture. Include daily breakdown and cultural tips.",
    "Plan a 3-day trip to Prague. Focus on castles, local cuisine, and romantic spots. Include estimated costs and best walking routes.",
    "Plan a 3-day trip to Barcelona. Focus on food tours and Gaud√≠ landmarks. Include costs and top attractions.",
    "Plan a 4-day trip to Barcelona. Focus on wellness, yoga, and beach relaxation. Include daily schedule and spa recommendations.",
    "Plan a 5-day trip to Tokyo. Focus on history, modern tech, and wellness. Include itinerary and budget details.",
    "Plan a 6-day trip to Tokyo. Focus on innovation, culture, and hidden gems. Include budget summary and cultural etiquette.",
    "Plan a 3-day trip to Rome. Focus on ancient ruins, espresso culture, and walking tours. Include itinerary and budget guide.",
    "Plan a 6-day trip to Rome. Focus on spirituality, history, and Italian cuisine. Include detailed breakdown and safety advice.",
    "Plan a 3-day trip to Lisbon. Focus on tram rides, fado music, and street art. Include daily plan and estimated budget.",
    "Plan a 6-day trip to Lisbon. Focus on cuisine, culture, and nightlife. Include recommendations for authentic spots.",
    "Plan a 5-day trip to New York. Focus on museums, food, and nightlife. Include itinerary and average daily costs.",
    "Plan a 6-day trip to New York. Focus on photography, cuisine, and local markets. Include safety tips and budget advice.",
    "Plan a 5-day trip to Marrakech. Focus on history, wellness, and local crafts. Include essential info, budget, and itinerary.",
    "Plan a 3-day trip to Marrakech. Focus on souks, cuisine, and architecture. Include costs and cultural insights."
]

In [None]:
for q in queries:
  response = trip_agent.run(q)

# Span-Level Evaluation via Arize Python SDK

Arize supports evaluations at multiple levels of granularity. You can evaluate individual steps in an agent‚Äôs run (spans) or the full workflow (trace).

Here, we‚Äôll perform span-level evaluations on retrieval steps to measure how relevant the retrieved documents are to each query.

First, navigate to your project and click "Export to Notebook". From here, copy the `export_model_to_df` function in the code snippet to export your traces.
![Export Traces](https://storage.googleapis.com/arize-phoenix-assets/assets/images/export-traces-arize.png)

In [None]:
from arize.exporter import ArizeExportClient
from datetime import datetime
from arize.utils.types import Environments

client = ArizeExportClient()
print('#### Exporting your primary dataset into a dataframe.')

# INSERT COPY AND PASTED FUNCTION HERE
primary_df = client.export_model_to_df(
    space_id='U3BhY2U6MTUzMDU6dDJJWg==',
    model_id='evaluate-travel-agent',
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat('2025-11-18T08:00:00.000+00:00'),
    end_time=datetime.fromisoformat('2025-11-26T07:59:59.999+00:00'),
)


Next, we define the prompt template for our LLM Judge. Feel free to customize this!

In [None]:
RAG_RELEVANCY_PROMPT_TEMPLATE = """
You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {{input}}
    ************
    [Reference text]: {{documents}}
    ************
    [END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can help answer the Question. First, write out in a step by step manner
an EXPLANATION to show how to arrive at the correct answer. Avoid simply stating the correct answer
at the outset. Your response LABEL must be single word, either "relevant" or "unrelated", and
should not contain any text or characters aside from that word. "unrelated" means that the
reference text does not help answer to the Question. "relevant" means the reference text directly
answers the question.

Example response:
LABEL: "relevant" or "unrelated"
************
"""

Then, we grab relevant columns from our spans dataframe and rename columns to match the variables in the LLM Judge prompt

In [None]:
spans_df = primary_df[
    [
        "name",
        "context.span_id",
        "attributes.openinference.span.kind",
        "context.trace_id",
        "attributes.input.value",
        "attributes.retrieval.documents",
    ]
]

In [None]:
filtered_df = spans_df[
    (spans_df["attributes.openinference.span.kind"] == "RETRIEVER")
    & (spans_df["attributes.retrieval.documents"].notnull())
]

filtered_df = filtered_df.rename(
    columns={"attributes.input.value": "input", "attributes.retrieval.documents": "documents"}
)

filtered_df

Finally, we define our evaluators and run the evaluation. When the evaluation is done running, we log the results back to Arize.

In [None]:
from openinference.instrumentation import suppress_tracing
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals import create_classifier

llm = LLM(provider="openai", model="gpt-5")

relevancy_evaluator = create_classifier(
    name="RAG Relevancy",
    llm=llm,
    prompt_template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    choices={"relevant": 1.0, "unrelated": 0.0},
)

with suppress_tracing():
    results_df = await async_evaluate_dataframe(
        dataframe=filtered_df,
        evaluators=[relevancy_evaluator],
    )
results_df.head()

In [None]:
from arize.pandas.logger import Client
from phoenix.evals.utils import to_annotation_dataframe
import ast

import pandas as pd
client = Client()

rag_eval_df = to_annotation_dataframe(results_df)
rag_eval_df = rag_eval_df.rename(columns={
    "label": "eval.rag.label",
    "score": "eval.rag.score",
    "explanation": "eval.rag.explanation",
    "metadata": "eval.rag.metadata"
})

client.log_evaluations_sync(rag_eval_df, 'evaluate-travel-agent')

Click on the retriever spans within each trace to view detailed evaluation results. You can also filter by evaluation outcome to quickly identify which queries successfully retrieved the most relevant documents.

![Eval Result](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-course-lab6-1.png)

# Trace-Level Evaluation in the Arize UI

### In this section, we will walk you through how to set up and run evaluations in the Arize UI. Specifically, we will be running a trace level evaluation to determine the answer quality of our agent.

<video width="940" height="680" controls>
  <source src="https://storage.googleapis.com/arize-phoenix-assets/assets/videos/trace-level-evals-course.mp4" type="video/mp4">

1. In the project containing your traces, go to Eval Tasks and select LLM as a Judge.

2. Name your task and schedule it to run on historical data. Each task can include multiple evaluators, but this walkthrough focuses on setting up one.

3. Choose a trace-level evaluation.

4. From the predefined templates, select Q&A or another template of your choice. You can also create a custom evaluation. If you define your own, ensure the variables align with your trace structure and specify the output labels (rails).

5. Click Create Evals. Your evaluations will begin running and will appear on your existing traces. Look for the eval result on the top span for each trace.

![Trace Level Eval](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace-level-evals-ui-course.png)