# AgentCore Online Evaluation with Actor Simulator - Web Search Agent

**Pipeline:** AgentCore Online Eval Setup â†’ DatasetGenerator â†’ ActorSimulator â†’ Agent Invocation â†’ AgentCore Evaluates via CloudWatch

This notebook:
1. Loads configuration from `eval_config.py`
2. Creates a custom LLM-as-a-Judge evaluator for web search quality
3. Sets up AgentCore online evaluation with builtin + custom metrics
4. Generates test cases using DatasetGenerator for web search scenarios
5. Runs actor simulator to invoke agent with multi-turn conversations
6. AgentCore automatically captures and evaluates traces via CloudWatch

## 1. Imports and Configuration

In [None]:
import boto3
import json
import os
import uuid
from strands_evals import ActorSimulator, Case
from strands_evals.generators import DatasetGenerator
from lg_eval_config import *

os.environ['AWS_DEFAULT_REGION'] = AWS_REGION
print("Configuration loaded from eval_config.py")

Configuration loaded from eval_config.py


## 2. Create Custom Evaluator (LLM-as-a-Judge)

Create a custom evaluator to assess web search quality, including:
- Search result relevance
- Information synthesis quality
- Source attribution
- Information freshness

In [None]:
evaluation_client = boto3.client(
    'agentcore-evaluation-controlplane',
    region_name=AWS_REGION,
)

# Create custom evaluator
try:
    custom_evaluator_response = evaluation_client.create_evaluator(
        evaluatorName=CUSTOM_EVALUATOR_NAME,
        level="TRACE",
        evaluatorConfig=CUSTOM_EVALUATOR_CONFIG
    )
    print(f"âœ“ Created custom evaluator: {CUSTOM_EVALUATOR_NAME}")
    print(json.dumps(custom_evaluator_response, indent=2, default=str))
    custom_evaluator_id = custom_evaluator_response['evaluatorId']
except Exception as e:
    if 'ResourceConflictException' in str(e) or 'already exists' in str(e):
        print(f"âš  Custom evaluator '{CUSTOM_EVALUATOR_NAME}' already exists, using existing one")
        # List evaluators to get the existing ID
        list_response = evaluation_client.list_evaluators()
        for evaluator in list_response.get('evaluators', []):
            if evaluator['evaluatorName'] == CUSTOM_EVALUATOR_NAME:
                custom_evaluator_id = evaluator['evaluatorId']
                print(f"âœ“ Found existing evaluator ID: {custom_evaluator_id}")
                break
    else:
        print(f"âœ— Error creating custom evaluator: {e}")
        raise

create_config_response = evaluation_client.create_online_evaluation_config(
    onlineEvaluationConfigName="custom_web_search_quality_evaluator",
    description="Integration test config",
    rule={
        "samplingConfig": {"samplingPercentage": 100.0}
    },
    dataSourceConfig={
        "cloudWatchLogs": {
            "logGroupNames": [LOG_GROUP_NAME],
            "serviceNames": [SERVICE_NAME]
        }
    },
    evaluators=[{"evaluatorId":custom_evaluator_id}],
    evaluationExecutionRoleArn=EVALUATION_ROLE_ARN,
    enableOnCreate=True
)

## 3. AgentCore Online Evaluation Setup

Configure online evaluation with both builtin and custom evaluators.

In [None]:
# Combine builtin evaluators with custom evaluator
all_evaluators = [{"evaluatorId": evaluator_id} for evaluator_id in EVALUATORS]

print(f"Total evaluators: {len(all_evaluators)} (13 builtin + 1 custom)")

create_config_response = evaluation_client.create_online_evaluation_config(
    onlineEvaluationConfigName=EVAL_CONFIG_NAME,
    description=EVAL_DESCRIPTION,
    rule={
        "samplingConfig": {"samplingPercentage": SAMPLING_PERCENTAGE},
        "sessionConfig": {"sessionTimeoutMinutes": SESSION_TIMEOUT_MINUTES}
    },
    dataSourceConfig={
        "cloudWatchLogs": {
            "logGroupNames": [LOG_GROUP_NAME],
            "serviceNames": [SERVICE_NAME]
        }
    },
    evaluators=all_evaluators,
    evaluationExecutionRoleArn=EVALUATION_ROLE_ARN,
    enableOnCreate=True
)

config_id = create_config_response['onlineEvaluationConfigId']
config_details = evaluation_client.get_online_evaluation_config(onlineEvaluationConfigId=config_id)

print(f"\nâœ“ Created config: {config_id}")
print(f"Status: {config_details['status']}")

Total evaluators: 10 (13 builtin + 1 custom)

âœ“ Created config: web_search_agent_online_eval-Kelx5TGHO1
Status: ACTIVE


## 4. AgentCore Runtime Client

In [5]:
agentcore_client = boto3.client('bedrock-agentcore', region_name=AWS_REGION)

def invoke_agentcore(user_message, session_id=None):
    """
    Invoke agent with session management.
    
    Args:
        user_message: The prompt to send to the agent
        session_id: Optional session ID for maintaining conversation context
    
    Returns:
        Response text from the agent
    """
    # Build the request parameters
    request_params = {
        'agentRuntimeArn': AGENT_ARN,
        'qualifier': QUALIFIER,
        'payload': json.dumps({"prompt": user_message})
    }
    
    # Add session_id if provided
    if session_id is not None:
        request_params['runtimeSessionId'] = session_id
    
    boto3_response = agentcore_client.invoke_agent_runtime(**request_params)
    
    content = []
    if "text/event-stream" in boto3_response.get("contentType", ""):
        for line in boto3_response["response"].iter_lines(chunk_size=1):
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    line = line[6:]
                    content.append(line)
    else:
        events = []
        for event in boto3_response.get("response", []):
            events.append(event)
        if events:
            content.append(json.loads(events[0].decode("utf-8")))
    
    return "\n".join(str(c) for c in content)

## 5. Generate Test Cases

In [6]:
generator = DatasetGenerator[str, str](str, str)

task_description = f"""
Task: {AGENT_CAPABILITIES}
Limitations: {AGENT_LIMITATIONS}
Available tools: {', '.join(AGENT_TOOLS)}
Complexity: {AGENT_COMPLEXITY}
"""

dataset = await generator.from_scratch_async(
    topics=AGENT_TOPICS,
    task_description=task_description,
    num_cases=NUM_TEST_CASES
)

print(f"Generated {len(dataset.cases)} test cases")

Generated 10 test cases


## 6. Preview Test Cases

In [7]:
for i, case in enumerate(dataset.cases, 1):
    print(f"\nCase {i}: {case.input}")
    print(f"Expected: {case.expected_output}")


Case 1: What are the main tourist attractions and must-see places in San Francisco?
Expected: List of major San Francisco attractions (Golden Gate Bridge, Alcatraz, Fisherman's Wharf, etc.), brief descriptions, locations, operating hours, and official tourism website information.

Case 2: How tall is the Statue of Liberty and what are the current visiting requirements?
Expected: Height of Statue of Liberty (151 feet/46 meters statue, 305 feet/93 meters including pedestal), current visiting procedures including advance reservations, security requirements, ferry information, and any COVID-related restrictions or changes to normal operations.

Case 3: What were the most significant moments or record-breaking performances from the most recent Summer Olympics? I'm looking for both athletic achievements and any notable stories that made headlines.
Expected: The AI should search for highlights, records, and notable stories from the most recent Summer Olympics, providing specific athletic ach

## 7. Define Task Function

In [8]:
def task_function(case: Case) -> str:
    # Create a new session for this test case
    session_id = str(uuid.uuid4())
    print(f"\nðŸ†• Started new session: {session_id}")
    
    user_sim = ActorSimulator.from_case_for_user_simulator(case=case, max_turns=MAX_TURNS)
    
    user_message = case.input
    final_response = ""
    
    print(f"\n{'='*80}")
    print(f"Test Case: {case.input}")
    print(f"Expected: {case.expected_output}")
    print(f"{'='*80}")
    
    turn = 1
    while user_sim.has_next():
        print(f"\nTurn {turn}: {user_message}")
        agent_response = invoke_agentcore(user_message, session_id=session_id)
        final_response = agent_response
        print(f"Agent: {agent_response[:200]}...")
        
        user_result = user_sim.act(agent_response)
        user_message = str(user_result.structured_output.message)
        turn += 1
    
    print(f"ðŸ”š Ended session: {session_id}")
    
    return final_response

## 8. Run Evaluations

In [9]:
results = []

for i, case in enumerate(dataset.cases, 1):
    print(f"\n\n{'#'*80}")
    print(f"# Running Test Case {i}/{len(dataset.cases)}")
    print(f"{'#'*80}")
    
    try:
        response = task_function(case)
        results.append({
            "case_number": i,
            "input": case.input,
            "expected": case.expected_output,
            "actual": response,
            "status": "success"
        })
    except Exception as e:
        print(f"ERROR: {e}")
        results.append({
            "case_number": i,
            "input": case.input,
            "expected": case.expected_output,
            "actual": str(e),
            "status": "error"
        })

print(f"\n\nCompleted {len(results)} test cases")
print(f"Successful: {sum(1 for r in results if r['status'] == 'success')}")
print(f"Errors: {sum(1 for r in results if r['status'] == 'error')}")



################################################################################
# Running Test Case 1/10
################################################################################

ðŸ†• Started new session: da90c0d6-85b8-44fd-b90a-65d34093b242

Test Case: What are the main tourist attractions and must-see places in San Francisco?
Expected: List of major San Francisco attractions (Golden Gate Bridge, Alcatraz, Fisherman's Wharf, etc.), brief descriptions, locations, operating hours, and official tourism website information.

Turn 1: What are the main tourist attractions and must-see places in San Francisco?
ERROR: Unterminated string starting at: line 1 column 1 (char 0)


################################################################################
# Running Test Case 2/10
################################################################################

ðŸ†• Started new session: f93faa67-79ce-4dad-bbd4-b901e9add58a

Test Case: How tall is the Statue of Liberty and what are 

## 9. View Results

In [10]:
import pandas as pd

df = pd.DataFrame(results)
df

Unnamed: 0,case_number,input,expected,actual,status
0,1,What are the main tourist attractions and must...,List of major San Francisco attractions (Golde...,Unterminated string starting at: line 1 column...,error
1,2,How tall is the Statue of Liberty and what are...,Height of Statue of Liberty (151 feet/46 meter...,Unterminated string starting at: line 1 column...,error
2,3,What were the most significant moments or reco...,"The AI should search for highlights, records, ...",Unterminated string starting at: line 1 column...,error
3,4,Which areas of Yellowstone and Glacier Nationa...,"Current park accessibility status, road closur...","Read timeout on endpoint URL: ""https://bedrock...",error
4,5,How tall is the Eiffel Tower and when was it b...,Basic factual information about the Eiffel Tow...,"Based on the search results, I can provide you...",success
5,6,What are the current travel conditions and saf...,The AI should search for current conditions at...,"Read timeout on endpoint URL: ""https://bedrock...",error
6,7,What are the current COVID-related travel rest...,Current COVID-related travel requirements for ...,Unterminated string starting at: line 1 column...,error
7,8,What are the current trends in sustainable tou...,Overview of 2024 sustainable tourism trends in...,Unterminated string starting at: line 1 column...,error
8,9,I heard about a new archaeological discovery t...,The AI should search for recent archaeological...,"Read timeout on endpoint URL: ""https://bedrock...",error
9,10,I'm preparing for a trivia competition and nee...,The response should provide: Comprehensive lis...,"Read timeout on endpoint URL: ""https://bedrock...",error


## Summary

This evaluation tested your web search agent with:

**14 Total Evaluators:**
- 13 Builtin evaluators (Correctness, Faithfulness, Helpfulness, Relevance, Conciseness, Coherence, InstructionFollowing, Refusal, Harmfulness, Stereotyping, GoalSuccessRate, ToolSelectionAccuracy, ToolParameterAccuracy)
- 1 Custom LLM-as-a-Judge evaluator (Web Search Quality)

All traces and evaluations are captured in CloudWatch and available in the AgentCore dashboard.

# End