# Hands-On Lab: Building Agent Systems with Databricks

## Part 2 - Agent Evaluation
Now that we've created an agent, how do we evaluate its performance?
For the second part, we're going to create a product support agent so we can focus on evaluation.
This agent will use a RAG approach to help answer questions about products using the product documentation.

### 2.1 Define our new Agent and retriever tool
- [**agent.py**]($./agent.py): An example Agent has been configured - first we'll explore this file and understand the building blocks
- **Vector Search**: We've created a Vector Search endpoint that can be queried to find related documentation about a specific product.
- **Create Retriever Function**: Define some properties about our retriever and package it so it can be called by our LLM.

### 2.2 Create Evaluation Dataset
- We've provided an example evaluation dataset - though you can also generate this [synthetically](https://www.databricks.com/blog/streamline-ai-agent-evaluation-with-new-synthetic-data-capabilities).

### 2.3 Run MLflow.evaluate() 
- MLflow will take your evaluation dataset and test your agent's responses against it
- LLM Judges will score the outputs and collect everything in a nice UI for review

### 2.4 Make Needed Improvements and re-run Evaluations
- Take feedback from our evaluation run and change retrieval settings
- Run evals again and see the improvement!

In [0]:
%pip install -U -qqqq mlflow-skinny[databricks] langgraph==0.3.4 databricks-langchain databricks-agents uv
dbutils.library.restartPython()

In [0]:
%load_ext autoreload
%autoreload 2

In [0]:
import sys
sys.path.append("..")

from workshop_config import *

In [0]:
from agent import AGENT

AGENT.predict({"messages": [{"role": "user", "content": "Can you give me troubleshooting tips for my Soundwave X5 Pro Headphones?"}]})

### Log the `agent` as an MLflow model
Log the agent as code from the [agent]($./agent) notebook. See [MLflow - Models from Code](https://mlflow.org/docs/latest/models.html#models-from-code).

In [0]:
# Determine Databricks resources to specify for automatic auth passthrough at deployment time
import mlflow
from agent import tools, LLM_ENDPOINT_NAME
from databricks_langchain import VectorSearchRetrieverTool
from mlflow.models.resources import DatabricksFunction, DatabricksServingEndpoint
from unitycatalog.ai.langchain.toolkit import UnityCatalogTool

resources = [DatabricksServingEndpoint(endpoint_name=LLM_ENDPOINT_NAME)]
for tool in tools:
    if isinstance(tool, VectorSearchRetrieverTool):
        resources.extend(tool.resources)
    elif isinstance(tool, UnityCatalogTool):
        resources.append(DatabricksFunction(function_name=tool.uc_function_name))

input_example = {
    "messages": [
        {
            "role": "user",
            "content": "What color options are available for the Aria Modern Bookshelf?"
        }
    ]
}

with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        artifact_path="agent",
        python_model="agent.py",
        input_example=input_example,
        resources=resources,
        extra_pip_requirements=[
            "databricks-connect"
        ]
    )

In [0]:
# Load the model and create a prediction function
logged_model_uri = f"runs:/{logged_agent_info.run_id}/agent"
loaded_model = mlflow.pyfunc.load_model(logged_model_uri)

def predict_wrapper(query):
    # Format for chat-style models
    model_input = {
        "messages": [{"role": "user", "content": query}]
    }
    response = loaded_model.predict(model_input)
    
    messages = response['messages']
    return messages[-1]['content']

## Evaluate the agent with [Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html)

You can edit the requests or expected responses in your evaluation dataset and run evaluation as you iterate your agent, leveraging mlflow to track the computed quality metrics.

In [0]:
import pandas as pd

data = {
    "request": [
        "What color options are available for the Aria Modern Bookshelf?",
        "How should I clean the Aurora Oak Coffee Table to avoid damaging it?",
        "How should I clean the BlendMaster Elite 4000 after each use?",
        "How many colors is the Flexi-Comfort Office Desk available in?",
        "What sizes are available for the StormShield Pro Men's Weatherproof Jacket?"
    ],
    "expected_facts": [
        [
            "The Aria Modern Bookshelf is available in natural oak finish",
            "The Aria Modern Bookshelf is available in black finish",
            "The Aria Modern Bookshelf is available in white finish"
        ],
        [
            "Use a soft, slightly damp cloth for cleaning.",
            "Avoid using abrasive cleaners."
        ],
        [
            "The jar of the BlendMaster Elite 4000 should be rinsed.",
            "Rinse with warm water.",
            "The cleaning should take place after each use."
        ],
        [
            "The Flexi-Comfort Office Desk is available in three colors."
        ],
        [
            "The available sizes for the StormShield Pro Men's Weatherproof Jacket are Small, Medium, Large, XL, and XXL."
        ]
    ]
}

eval_dataset = pd.DataFrame(data)

In [0]:
from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

eval_data = []
for request, facts in zip(data["request"], data["expected_facts"]):
    eval_data.append({
        "inputs": {
            "query": request  # This matches the function parameter
        },
        "expected_response": "\n".join(facts)
    })

# Define scorers for evaluation
# These are guidelines that the LLM judge will use to evaluate responses

# Define custom scorers tailored to product information evaluation
scorers = [
    Guidelines(
        guidelines="""Response must include ALL expected facts:
        - Lists ALL colors/sizes if relevant (not partial lists)
        - States EXACT specs if relevant (e.g., "5 ATM" not "water resistant")
        - Includes ALL cleaning steps if asked
        Fails if ANY fact is missing or wrong.""",
        name="completeness_and_accuracy",
    ),
    Guidelines(
        guidelines="""Response must be clear and direct:
        - Answers the exact question asked
        - Uses lists for options, steps for instructions
        - No marketing fluff or extra background
        - Concise but complete.""",
        name="relevance_and_structure",
    ),
    Guidelines(
        guidelines="""Response must stay on-topic:
        - ONLY the product asked about
        - NO made-up features or colors
        - NO generic advice
        - Uses exact product name from request.""",
        name="product_specificity",
    ),
]

In [0]:
print("Running evaluation...")
with mlflow.start_run():
    results = mlflow.genai.evaluate(
        data=eval_data,
        predict_fn=predict_wrapper, 
        scorers=scorers,
    )

## Lets go back to the [agent.py]($./agent.py) file and change our prompt to reduce marketing fluff.

## Register the model to Unity Catalog

In [0]:
from databricks.sdk import WorkspaceClient
import os

UC_MODEL_NAME = f"{WORKSHOP_CATALOG}.{USER_SCHEMA}.product_agent"

# register the model to UC
uc_registered_model_info = mlflow.register_model(model_uri=logged_agent_info.model_uri, name=UC_MODEL_NAME)

In [0]:
from IPython.display import display, HTML

# Retrieve the Databricks host URL
workspace_url = spark.conf.get('spark.databricks.workspaceUrl')

# Create HTML link to created agent
html_link = f'<a href="https://{workspace_url}/explore/data/models/{WORKSHOP_CATALOG}/{USER_SCHEMA}/product_agent" target="_blank">Go to Unity Catalog to see Registered Agent</a>'
display(HTML(html_link))

## Deploy the agent

##### Note: This is disabled for lab users but will work on your own workspace

In [0]:
from databricks import agents

# Deploy the model to the review app and a model serving endpoint

# Disabled for the lab environment but we've deployed the agent already!

# agents.deploy(
#     uc_model_name,
#     uc_registered_model_info.version,
#     environment_vars={
#         "WORKSHOP_CATALOG": WORKSHOP_CATALOG,
#         "WORKSHOP_SCHEMA": WORKSHOP_SCHEMA,
#         "USER_SCHEMA": USER_SCHEMA,
#     },
#     tags={"endpointSource": "Agent Lab"},
# )