# Evaluating Bedrock LLM Solutions using AWS Strands, RAGAS and Langfuse 

This notebook demonstrates how to build an agent with observability and evaluation capabilities. 

We use [Langfuse](https://langfuse.com/) to process the Strands Agent traces and LLM as a judge to evaluate agent performance. The primary focus is on agent evaluation and the quality of responses generated by the agent using traces produced by the SDK.

### What is Observability and Evaluation?

**Observability** means being able to see what your AI agent is doing "behind the scenes" - like watching its thought process. It helps you understand why your agent makes certain decisions or gives particular responses.

**Evaluation** is how we measure if our agent is doing a good job. Instead of just guessing if responses are good, we use specific metrics to score the agent's performance.

### OpenTelemetry Integration

Strands natively integrates with OpenTelemetry, an industry standard for distributed tracing. You can visualize and analyze traces using any OpenTelemetry-compatible tool. This integration provides:

- **Compatibility with existing observability tools:** Send traces to platforms such as Jaeger, Grafana Tempo, AWS X-Ray, Datadog, and more
- **Standardized attribute naming:** Uses OpenTelemetry semantic conventions
- **Flexible export options:** Console output for development, OTLP endpoint for production
- **Auto-instrumentation:** Trace creation is handled automatically when you turn on tracing

### What are we Evaluating?

In this Example we will be evaluating the knowledge base setup in notebook 01_Setup_S3_Vector_KnowledgeBase.ipynb.

The knowledge base contains 10-k documents that have been added to an S3 Vector KnowledgeBase in Bedrock. We will build a test agent that will run through test scenarios in test_cases.json and try and extract usefull insights from the knowledgebase. 

The Agent will be configured to send all of its traces to langfuse. We will then pull these traces from Langfuse and evaluate them against metrics we have created using the RAGAS framework. We will then send these results back to langfuse where they can be reviewed and analysed. 

Lets get started.

### Pre-Requisites

- Run Jupyter Notebook 01_Setup_S3_Vector_KnowledgeBase.ipynb and Copy the created Knowledge Base ID

- Create a langfuse account and project and copy the secret and public Keys. https://langfuse.com/docs/observability/get-started

### Install Required Packages

First, we need to install all the necessary packages for our notebook. Each package has a specific purpose:

- **langfuse**: Provides observability for our agent
- **boto3**: AWS SDK for Python, used to access AWS services and Use Amazon Bedrock Models
- **strands**: Framework for building AI agents

In [1]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://plugin.us-east-1.prod.workshops.aws
Note: you may need to restart the kernel to use updated packages.


### Set Up and Configuration

- Import Libraries 
- Set Environment Variables
- Set up conection with Langfuse to send open telemetry data. 
- To get the Knowledge Base ID, you will need to run the steps in 01_Setup_S3_Vector_KnowledgeBase.ipynb

In [2]:
from strands import Agent
from strands_tools import calculator
from strands.tools import tool
from strands import Agent
import json
import boto3
import uuid
from utils import fetch_traces, process_traces, save_results_to_csv, run_test_cases_sync
from langfuse import Langfuse

knowledge_base_id = "U8SNYO8ZTM"
region_name = "us-east-1"

# Initialize LangFuse client
langfuse = Langfuse(
    secret_key="sk-lf-75870e2b-a8ee-4357-958e-f5e651bb79ca",
    public_key="pk-lf-b90a644c-179d-4866-8c4e-eebee545a5b7",
    #host = "https://cloud.langfuse.com" # 🇪🇺 EU region
    host = "https://us.cloud.langfuse.com" # 🇺🇸 US region
)



 ### Create Knowledge Base Search tool for the Agent

In [3]:
#Create tool to search knowledge base
@tool
def search_vector_db(query: str, customer_id: str) -> str:    
    """    
    Handle document-based, narrative, and conceptual queries using the unstructured knowledge base.    
    Args:        
        query: A question about business strategies, policies, company information,or requiring document comprehension and qualitative analysis        
        customer_id: Customer identifier    
    Returns:        
    Formatted string response from the knowledge base    
    """
    kb_id = knowledge_base_id 
    bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=region_name)    
    try:        
        retrieve_response = bedrock_agent_runtime.retrieve(            
            knowledgeBaseId=kb_id,            
            retrievalQuery={"text": query},            
            retrievalConfiguration={                
                "vectorSearchConfiguration": {                    
                    "numberOfResults": 5
                }
            }
        )
        
        # Format the response for better readability        
        results = []        
        for result in retrieve_response.get('retrievalResults', []):    
            content = result.get('content', {}).get('text', '')  
            
        if content:                
            results.append(content) 
        
        return "\n\n".join(results) if results else "No relevant information found."    
    except Exception as e:        
        return f"Error in unstructured data assistant: {str(e)}"

### Create an Agent that we would like to Evaluate. 

This Agent is going to Analyze 10-k documents and provide responses to questions from uses about companies. 


In [None]:
# Create an evaluator agent with a stronger model
test_agent = Agent(
    #model="us.amazon.nova-lite-v1:0",
    model="anthropic.claude-3-7-sonnet-20250219-v1:0",
    tools=[search_vector_db, calculator],
    system_prompt="""
        You are an Finacial Analyst. Your job is to prvide detail analytical responses based on 10-k documents.
        You will look up data from the knowledge base and use the tools to answer questions. 
        If you are not able to answer the question you will say so. 
    """,
    record_direct_tool_call = True,  # Record when tools are used
    trace_attributes={
        "session.id": str(uuid.uuid4()),  # Generate a unique session ID
        "user.id": "henry.j.a.lee@gmail.com",  # Example user ID
        "langfuse.tags": [
            "Agent-SDK-Example",
            "Strands-Project-Demo",
            "Observability-Tutorial"
        ],
    }
)


### Now we can run the Agent against some test cases with expected results built by human analysts

In [5]:

# Load test cases
with open("test_cases.json", "r") as f:
    data = json.load(f)
    test_cases = data["questions"]  # Extract the questions array

run_test_cases_sync(test_agent, test_cases)



Test Case 1/5: What was AWS's total net sales revenue for 2022 fiscal year?

Tool #1: search_vector_db
Based on the information provided, it does not appear that the specific total net sales revenue for AWS in the 2022 fiscal year was included. The passage mentions that AWS sales increased 29% in 2022 compared to the prior year, but does not provide the actual 2022 sales figure. Without that specific data point, I cannot directly answer the question about AWS's total net sales revenue for the 2022 fiscal year.Response: Based on the information provided, it does not appear that the specific total net sales revenue for AWS in the 2022 fiscal year was included. The passage mentions that AWS sales increased 29% in 2022 compared to the prior year, but does not provide the actual 2022 sales figure. Without that specific data point, I cannot directly answer the question about AWS's total net sales revenue for the 2022 fiscal year.


Test Case 2/5: What was AWS's total net sales revenue for 2

[{'query': "What was AWS's total net sales revenue for 2022 fiscal year?",
  'response': AgentResult(stop_reason='end_turn', message={'role': 'assistant', 'content': [{'text': "Based on the information provided, it does not appear that the specific total net sales revenue for AWS in the 2022 fiscal year was included. The passage mentions that AWS sales increased 29% in 2022 compared to the prior year, but does not provide the actual 2022 sales figure. Without that specific data point, I cannot directly answer the question about AWS's total net sales revenue for the 2022 fiscal year."}]}, metrics=EventLoopMetrics(cycle_count=9, tool_metrics={'search_vector_db': ToolMetrics(tool={'toolUseId': 'tooluse_754K_JY9SjSOjFsbOBAIMw', 'name': 'search_vector_db', 'input': {'customer_id': 'financial_analyst', 'query': 'Amazon capital expenditures 2020 2021 2022 primary areas of investment'}}, call_count=4, success_count=4, error_count=0, total_time=3.0147809982299805)}, cycle_durations=[3.268045902

### Define RAGAS Metrics

We'll define a rubric to evaluate different aspects of our agent's performance:


In [6]:
from ragas.metrics import RubricsScore
from ragas.llms import LangchainLLMWrapper
from langchain_aws import ChatBedrock

model = ChatBedrock(model_id='us.anthropic.claude-haiku-4-5-20251001-v1:0', region_name=region_name)
evaluator_llm = LangchainLLMWrapper(model)

rubrics = {
    "score1_description": (
        "The response contains incorrect or hallucinated data that contradicts the expected answer, "
        "or fails to answer the question entirely."
    ),
    "score2_description": (
        "The response contains some accurate data from the expected answer but also includes "
        "significant inaccuracies or missing key information."
    ),
    "score3_description": (
        "The response contains mostly accurate data that aligns with the expected answer, "
        "with minor formatting differences or additional context that doesn't contradict the core facts."
    ),
    "score4_description": (
        "The response contains accurate data that matches the key facts and figures from the expected answer, "
        "regardless of formatting, wording, or additional explanatory context."
    ),
}

analysis_metric = RubricsScore(rubrics=rubrics, llm=evaluator_llm, name="Analysis")

In [7]:
import pandas as pd
from ragas.dataset_schema import (
    EvaluationDataset
)

from ragas import evaluate


def evaluate_conversation_samples(multi_turn_samples, trace_sample_mapping):
    if not multi_turn_samples:
        print("No samples to evaluate")
        return None
    
    conv_dataset = EvaluationDataset(samples=multi_turn_samples)
    conv_results = evaluate(dataset=conv_dataset, metrics=[analysis_metric])
    conv_df = conv_results.to_pandas()
    
    # Push only Analysis scores back to Langfuse
    for mapping in trace_sample_mapping:
        if mapping["type"] == "multi_turn":
            sample_index = mapping["index"]
            trace_id = mapping["trace_id"]
            
            if sample_index < len(conv_df):
                try:
                    score = float(conv_df.iloc[sample_index]["Analysis"])
                    langfuse.create_score(
                        trace_id=trace_id,
                        name="Analysis",
                        value=score,
                        comment=f"RAG Analysis Score: {score}/4"
                    )
                    print(f"Added Analysis score={score} to trace {trace_id}")
                except Exception as e:
                    print(f"Error adding score: {e}")
    
    return conv_df


In [8]:
def evaluate_traces(batch_size=10, lookback_hours=24, tags=None, save_csv=False):
    """Main function to fetch traces, evaluate them with RAGAS, and push scores back to Langfuse"""
    # Fetch traces from Langfuse
    traces = fetch_traces(langfuse, batch_size, lookback_hours, tags)
    
    if not traces:
        print("No traces found. Exiting.")
        return
    
    # Process traces into samples
    processed_data = process_traces(langfuse, traces)

    
    conv_df = evaluate_conversation_samples(
        processed_data["multi_turn_samples"], 
        processed_data["trace_sample_mapping"]
    )
    
    # Save results to CSV if requested
    if save_csv:
        save_results_to_csv(conv_df)
    
    return {
        "conversation_results": conv_df
    }

### Now lets pull the traces from langfuse for evaluation and push the results back. 

In [10]:
results = evaluate_traces(
    lookback_hours=1,
    batch_size=5,
    tags=["Agent-SDK-Example"],
    save_csv=True
)

# Access results if needed for further analysis
if results:
    if "conversation_results" in results and results["conversation_results"] is not None:
        print("\nConversation Evaluation Summary:")
        print(results["conversation_results"].describe())

Fetching traces from 2025-10-24 15:06:32.665804 to 2025-10-24 16:06:32.665804
Fetched 5 traces
Getting Observation dadf8c692d6c7e5f
Getting Observation 28aa5145ccd0581e
Getting Observation b162f975190dcd90
Getting Observation 635ba7b1274ced83
Getting Observation 9a1408ae22da48c4
Getting Observation 9255929c9212225c
User inputs: ['[{\'role\': \'user\', \'content\': \'[{"text": "How much did Amazon spend on capital expenditures from 2020-2022 and what were the primary areas of investment?"}]\'}]']
Agent responses: ['{\'message\': "The information provided does not directly give the capital expenditure figures for Amazon from 2020-2022. However, it does provide some relevant data points:\\n\\n- The consolidated balance sheet shows that Amazon\'s property and equipment, net increased from $186,715 million in 2022 to $204,177 million in 2023. This suggests significant capital investments were made during this time period.\\n\\n- The cash flow statement shows that Amazon\'s cash used in inve

Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

Added Analysis score=2.0 to trace 12ab587c0d9ee6a1419770ce2a9a30b0
Added Analysis score=4.0 to trace 844af78b509eb023409459acd02d5bc1
Added Analysis score=1.0 to trace 615f165deef828f2e652796b92b17017
Added Analysis score=1.0 to trace cc56896cefe6e3897c7c871bc718cfcf
Added Analysis score=1.0 to trace 776b64f20812981bfc5e37d6e4e36d6f
RAG evaluation results saved to evaluation_results/rag_evaluation_20251024_160648.csv

Conversation Evaluation Summary:
       Analysis
count   5.00000
mean    1.80000
std     1.30384
min     1.00000
25%     1.00000
50%     1.00000
75%     2.00000
max     4.00000


### Check results and Clean up resources

You should now be able to review the results of the evaluation in Langfuse. You should be able to see through the aalysis score that not all the questions were correctly answered, which shows that the 10-k Documents may need additional processeing before being add to the knowledge base or that the financial data would be better sourced from a structured data source. 

To clean up the resources you should go back to 01_Setup_S3_Vector_KnowledgeBase.ipynb and run the clean up section with the clean_resources flag changed to True. 

Thanks!
