# Evaluating Bedrock LLM Solutions using AWS Strands, RAGAS and Langfuse 

This notebook demonstrates how to build an agent with observability and evaluation capabilities. 

We use [Langfuse](https://langfuse.com/) to process the Strands Agent traces and LLM as a judge to evaluate agent performance. The primary focus is on agent evaluation and the quality of responses generated by the agent using traces produced by the SDK.

### What is Observability and Evaluation?

**Observability** means being able to see what your AI agent is doing "behind the scenes" - like watching its thought process. It helps you understand why your agent makes certain decisions or gives particular responses.

**Evaluation** is how we measure if our agent is doing a good job. Instead of just guessing if responses are good, we use specific metrics to score the agent's performance.

### OpenTelemetry Integration

Strands natively integrates with OpenTelemetry, an industry standard for distributed tracing. You can visualize and analyze traces using any OpenTelemetry-compatible tool. This integration provides:

- **Compatibility with existing observability tools:** Send traces to platforms such as Jaeger, Grafana Tempo, AWS X-Ray, Datadog, and more
- **Standardized attribute naming:** Uses OpenTelemetry semantic conventions
- **Flexible export options:** Console output for development, OTLP endpoint for production
- **Auto-instrumentation:** Trace creation is handled automatically when you turn on tracing

### What are we Evaluating?

In this Example we will be evaluating the knowledge base setup in notebook 01_Setup_S3_Vector_KnowledgeBase.ipynb.

The knowledge base contains 10-k documents that have been added to an S3 Vector KnowledgeBase in Bedrock. We will build a test agent that will run through test scenarios in test_cases.json and try and extract usefull insights from the knowledgebase. 

The Agent will be configured to send all of its traces to langfuse. We will then pull these traces from Langfuse and evaluate them against metrics we have created using the RAGAS framework. We will then send these results back to langfuse where they can be reviewed and analysed. 

Lets get started.

### Pre-Requisites

- Run Jupyter Notebook 01_Setup_S3_Vector_KnowledgeBase.ipynb and Copy the created Knowledge Base ID

- Create a langfuse account and project and copy the secret and public Keys. https://langfuse.com/docs/observability/get-started

### Install Required Packages

First, we need to install all the necessary packages for our notebook. Each package has a specific purpose:

- **langfuse**: Provides observability for our agent
- **boto3**: AWS SDK for Python, used to access AWS services and Use Amazon Bedrock Models
- **strands**: Framework for building AI agents

In [None]:
%pip install -r requirements.txt

### Set Up and Configuration

- Import Libraries 
- Set Environment Variables
- Set up conection with Langfuse to send open telemetry data. 
- To get the Knowledge Base ID, you will need to run the steps in 01_Setup_S3_Vector_KnowledgeBase.ipynb

In [None]:
from strands import Agent
from strands_tools import calculator
from strands.tools import tool
from langfuse import Langfuse
from strands import Agent
import json
import boto3
import uuid
from utils import fetch_traces, process_traces, save_results_to_csv

knowledge_base_id = "YOUR_KNOWLEDGE_BASE_ID"
region_name = "us-east-1"

# Initialize LangFuse client
langfuse = Langfuse(
    secret_key="your-langfuse-secret-key",
    public_key="your-langfuse-public-key",
    #host = "https://cloud.langfuse.com" # 🇪🇺 EU region
    host = "https://us.cloud.langfuse.com" # 🇺🇸 US region
)



 ### Create Knowledge Base Serach tool to add to the Agent

In [None]:
#Create tool to search knowledge base
@tool
def search_vector_db(query: str, customer_id: str) -> str:    
    """    
    Handle document-based, narrative, and conceptual queries using the unstructured knowledge base.    
    Args:        
        query: A question about business strategies, policies, company information,or requiring document comprehension and qualitative analysis        
        customer_id: Customer identifier    
    Returns:        
    Formatted string response from the knowledge base    
    """
    kb_id = knowledge_base_id 
    bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=region_name)    
    try:        
        retrieve_response = bedrock_agent_runtime.retrieve(            
            knowledgeBaseId=kb_id,            
            retrievalQuery={"text": query},            
            retrievalConfiguration={                
                "vectorSearchConfiguration": {                    
                    "numberOfResults": 5
                }
            }
        )
        
        # Format the response for better readability        
        results = []        
        for result in retrieve_response.get('retrievalResults', []):    
            content = result.get('content', {}).get('text', '')  
            
        if content:                
            results.append(content) 
        
        return "\n\n".join(results) if results else "No relevant information found."    
    except Exception as e:        
        return f"Error in unstructured data assistant: {str(e)}"

### Create an Agent tha we would like to Evaluate. 

This Agent is going to Analyze 10-k documents and provide responses to questions from uses about companies. 


In [None]:
# Create an evaluator agent with a stronger model
test_agent = Agent(
    #model="us.amazon.nova-lite-v1:0",
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    tools=[search_vector_db, calculator],
    system_prompt="""
        You are an Finacial Analyst. Your job is to prvide detail analytical responses based on 10-k documents.
        You will look up data from the knowledge base and use the tools to answer questions. 
        If you are not able to answer the question you will say so. 
    """,
    record_direct_tool_call = True,  # Record when tools are used
    trace_attributes={
        "session.id": str(uuid.uuid4()),  # Generate a unique session ID
        "user.id": "henry.j.a.lee@gmail.com",  # Example user ID
        "langfuse.tags": [
            "Agent-SDK-Example",
            "Strands-Project-Demo",
            "Observability-Tutorial"
        ],
    }
)


### Now we can run the Agent against some test cases with expected results built by human analysts

In [None]:
# Create an evaluator agent with a stronger model
evaluator = Agent(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    system_prompt="""
    You are an expert AI evaluator. Your job is to assess the quality of AI responses based on:
    1. Accuracy - factual correctness of the response
    2. Relevance - how well the response addresses the query
    3. Completeness - whether all aspects of the query are addressed
    4. Tool usage - appropriate use of available tools

    Score each criterion from 1-5, where 1 is poor and 5 is excellent.
    Provide an overall score and brief explanation for your assessment.
    """
)

# Load test cases
with open("test_cases.json", "r") as f:
    test_cases = json.load(f)

for case in test_cases["questions"]:
    # Get agent response
    agent_response = test_agent(case["query"])


### Define RAGAS Metrics

We'll define several metrics to evaluate different aspects of our agent's performance:

We are going to look at the following criteria:

    Request Completeness - Has the LLM fullfilled the users request
    Brand Voice - Has the LLM responded in a polite and professional manner
    Tool Usage - Did the LLM use the tools available to it. 

In [None]:
#Import ragas Libraries and Bedrock Model for Evaluation
from ragas.metrics import AspectCritic
from ragas.llms import LangchainLLMWrapper
from langchain_aws import ChatBedrock

#Set Bedrock Model for evaluation 
model = ChatBedrock(model_id='us.anthropic.claude-sonnet-4-20250514-v1:0', region_name=region_name)

# Set up the evaluator LLM (we'll use the same model as our agent)
evaluator_llm = LangchainLLMWrapper(model)

# Metric to check if the agent fulfills all user requests
request_completeness = AspectCritic(
    name="Request Completeness",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the agent completely fulfills all the user requests with no omissions. "
        "otherwise, return 0."
    ),
)

# Metric to assess if the AI's communication aligns with the desired brand voice
brand_tone = AspectCritic(
    name="Brand Voice Metric",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the AI's communication is friendly, approachable, helpful, clear, and concise; "
        "otherwise, return 0."
    ),
)

# Tool usage effectiveness metric
tool_usage_effectiveness = AspectCritic(
    name="Tool Usage Effectiveness",
    llm=evaluator_llm,
    definition=(
        "Return 1 if the agent appropriately used available tools to fulfill the user's request "
        "(such as using search_vector_db for general questions and calculator for financial calculations). "
        "Return 0 if the agent failed to use appropriate tools or used unnecessary tools."
    ),
)


### Create a rubric to help the evaluation model score the responses

In [None]:
from ragas.metrics import RubricsScore

# Define a rubric for evaluating recommendations
rubrics = {
    "score1_description": (
        """The data required to answer the question is not present in the knowledge base, but the model provides 
        incorrect hallucinated or made up data to answer the question"""
    ),
    "score2_description": (
        """The data required to answer the question is not present in the knowledge base 
        and the model explains that it does not have enoough information to answer the question accurately"""
    ),
    "score3_description": (
        "The model retrieves te correct data from the knowledge base "
        "but is not able to do the calculations needed to provide and accurate answer"
    ),
    "score4_description": (
        "The model retrieves the right data from the knowledge base "
        "and does the calculations needed to provide an accurate answer"
    ),
}

# Create the recommendations metric
recommendations = RubricsScore(rubrics=rubrics, llm=evaluator_llm, name="Analysis")

In [None]:
import pandas as pd
from ragas.dataset_schema import (
    EvaluationDataset
)

from ragas import evaluate


def evaluate_conversation_samples(multi_turn_samples, trace_sample_mapping):
    """Evaluate conversation-based samples and push scores to Langfuse"""
    if not multi_turn_samples:
        print("No multi-turn samples to evaluate")
        return None
    
    print(f"Evaluating {len(multi_turn_samples)} multi-turn samples with conversation metrics")
    conv_dataset = EvaluationDataset(samples=multi_turn_samples)
    conv_results = evaluate(
        dataset=conv_dataset,
        metrics=[
            request_completeness, 
            recommendations,
            brand_tone,
            tool_usage_effectiveness
        ]
        
    )
    conv_df = conv_results.to_pandas()
    
    # Push conversation scores back to Langfuse
    for mapping in trace_sample_mapping:
        if mapping["type"] == "multi_turn":
            sample_index = mapping["index"]
            trace_id = mapping["trace_id"]
            
            if sample_index < len(conv_df):
                for metric_name in conv_df.columns:
                    if metric_name not in ['user_input']:
                        try:
                            metric_value = float(conv_df.iloc[sample_index][metric_name])
                            if pd.isna(metric_value):
                                metric_value = 0.0
                            langfuse.create_score(
                                trace_id=trace_id,
                                name=metric_name,
                                value=metric_value
                            )
                            print(f"Added score {metric_name}={metric_value} to trace {trace_id}")
                        except Exception as e:
                            print(f"Error adding conversation score: {e}")
    
    return conv_df

In [None]:
def evaluate_traces(batch_size=10, lookback_hours=24, tags=None, save_csv=False):
    """Main function to fetch traces, evaluate them with RAGAS, and push scores back to Langfuse"""
    # Fetch traces from Langfuse
    traces = fetch_traces(langfuse, batch_size, lookback_hours, tags)
    
    if not traces:
        print("No traces found. Exiting.")
        return
    
    # Process traces into samples
    processed_data = process_traces(langfuse, traces)

    
    conv_df = evaluate_conversation_samples(
        processed_data["multi_turn_samples"], 
        processed_data["trace_sample_mapping"]
    )
    
    # Save results to CSV if requested
    if save_csv:
        save_results_to_csv(conv_df)
    
    return {
        "conversation_results": conv_df
    }

### Now lets pull the traces from langfuse for evaluation and push the results back. 

In [None]:
results = evaluate_traces(
    lookback_hours=2,
    batch_size=20,
    tags=["Agent-SDK-Example"],
    save_csv=True
)

# Access results if needed for further analysis
if results:
    if "conversation_results" in results and results["conversation_results"] is not None:
        print("\nConversation Evaluation Summary:")
        print(results["conversation_results"].describe())

### Check results and Clean up resources

You should now be able to review the results of the evaluation in Langfuse. You should be able to see through the aalysis score that not all the questions were correctly answered, which shows that the 10-k Documents may need additional processeing before being add to the knowledge base or that the financial data would be better sourced from a structured data source. 

To clean up the resources you should go back to 01_Setup_S3_Vector_KnowledgeBase.ipynb and run the clean up section with the clean_resources flag changed to True. 

Thanks!
