### Building & Evaluating Complex Agents with CrewAI and flotorch-eval

This notebook demonstrates a complete workflow for building a simple agent using CrewAI, observing its behavior with OpenLit, and evaluating its performance on multiple criteria using Flotorch-eval.

We'll walk through these key steps:

**`Setup`**: Install all necessary libraries.  
**`Tracing`**: Configure OpenTelemetry and OpenLit to capture the agent's execution trace.  
**`Agent Construction`**: Build an "AWS Tech Agent" that can research topics online.  
**`Evaluation`**: Score the agent's performance on accuracy, reasoning, and cost against a "golden" reference answer.  

#### Setup and dependencies

In [1]:
!pip3 install flotorch-eval crewai plotly opentelemetry-api opentelemetry-sdk pandas ragas duckduckgo-search openlit langchain_aws -q

#### Configure Tracing with OpenLit
To evaluate our agent, we first need to record what it does. We'll use OpenLit to automatically create a detailed trace of the agent's execution, including every LLM call and tool usage.

We'll store these traces in memory for easy access during the evaluation phase.

In [3]:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

# Create an in-memory span exporter
memory_exporter = InMemorySpanExporter()
span_processor = SimpleSpanProcessor(memory_exporter)

# Set up the tracer provider and add the span processor
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(span_processor)
trace.set_tracer_provider(tracer_provider)

# Initialize OpenLit - this will automatically instrument CrewAI when it's imported
import openlit
openlit.init()

Overriding of current TracerProvider is not allowed
Overriding of current TracerProvider is not allowed


{
    "resource_metrics": [
        {
            "resource": {
                "attributes": {
                    "telemetry.sdk.language": "python",
                    "telemetry.sdk.name": "openlit",
                    "telemetry.sdk.version": "1.34.1",
                    "service.name": "default",
                    "deployment.environment": "default"
                },
                "schema_url": ""
            },
            "scope_metrics": [
                {
                    "scope": {
                        "name": "openlit.otel.metrics",
                        "version": "0.1.0",
                        "schema_url": "",
                        "attributes": null
                    },
                    "metrics": [
                        {
                            "name": "gen_ai.client.token.usage",
                            "description": "Measures number of input and output tokens used",
                            "unit": "{token}",
                 

In [None]:
import sys, os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "../../..")))

# Import required libraries
from typing import List
import asyncio

from crewai.tools import tool
from crewai import LLM
from crewai import Agent, Task, Crew
from duckduckgo_search import DDGS
from langchain_aws import ChatBedrockConverse
from ragas.llms import LangchainLLMWrapper

from flotorch_eval.agent_eval.core.evaluator import Evaluator
from flotorch_eval.agent_eval.metrics.base import BaseMetric
from flotorch_eval.agent_eval.metrics.langchain_metrics import (
    TrajectoryEvalWithLLMMetric,
    TrajectoryEvalWithoutLLMMetric,)
from flotorch_eval.agent_eval.metrics.ragas_metrics import (
    AgentGoalAccuracyMetric,
    ToolCallAccuracyMetric,)
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.base import MetricConfig
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric

Reading CSV file from: /home/kiran/Flotorch-fork/flotorch-eval/venv/lib/python3.10/site-packages/flotorch_eval/data/bedrock_limits_small.csv


### Evaluation Setup
Before we build the agent, let's set up the components needed for its evaluation.

##### Configure the LLM Judge
Some of our metrics require an LLM to "judge" the agent's output for quality and correctness.  
For this, we're choosing **`Amazon Nova Micro`**.

In [5]:
region = "us-east-1"
bedrock_model = ChatBedrockConverse(
    region_name=region,
    endpoint_url=f"https://bedrock-runtime.us-east-1.amazonaws.com",
    model_id="us.amazon.nova-micro-v1:0"
)

llm_judge = LangchainLLMWrapper(bedrock_model)

#### Define the Evaluation Logic
This helper function orchestrates the evaluation process. It takes the captured traces (spans) and the evaluation metrics, runs the evaluator, and displays the results.

In [6]:
from evaluation_utils import create_trajectory, initialize_evaluator, display_evaluation_results

async def evaluate_agent(metrics: List[BaseMetric], spans: List):
    """
    Runs an agent with a given prompt, captures its trace, evaluates it,
    and displays the results.

    Args:
        agent: The agent to be evaluated.
        prompt: The input prompt for the agent.
        metrics: A list of configured metrics for evaluation.
    """

    # 2. Capture and convert the trace
    if not spans:
        print("\nEvaluation failed: No spans were provided.")
        return

    trajectory = create_trajectory(spans)

    # 3. Initialize and run the evaluator
    evaluator = initialize_evaluator(metrics)
    print("\n--- Running Evaluation ---")
    results = await evaluator.evaluate(trajectory)

    # 4. Display results
    print("\n--- Evaluation Scores ---")
    display_evaluation_results(results)
    return results

#### Use case: AWS Tech Agent

Now, let's build, run, and evaluate our agent.

#### Build the Agent
We'll define the agent's tools, its LLM, its role, and the task it needs to perform.

**`Tools`**: It will have a DuckDuckGoSearch tool to look up information, Salesforce API to get data  
**`LLM`**: It will be powered by the amazon.nova-pro-v1:0 model on Amazon Bedrock.  
**`Role`**: Its purpose is to be a Writer that simplifies GenAI concepts on AWS for beginner

In [7]:
# Define the tools that will be used by the agents
@tool('DuckDuckGoSearch')
def search_tool(search_query: str):
    """Search the web for information on a given topic"""
    return DDGS().text(search_query, max_results=5)

@tool('SalesforceIntegration')
def salesforce_tool(soql_query: str):
    """Call Salesforce API to get data"""
    return "Salesforce Integration"

# setup the model that is going to be used with the model
model = LLM(
    # model="sagemaker/INSERT ENDPOINT NAME",
    model="bedrock/us.amazon.nova-pro-v1:0",
    temperature=0.7, max_tokens=4*1024,
)

writer = Agent(
        role="Writer",
        goal="You make GenAI concepts understandable for newbies exploring GenAI on AWS",
        backstory="You're an expert in writing crisp summaries about GenAI on AWS.",
        tools=[search_tool],
        llm=model
    )

task = Task(description=("What is {topic}?"),
            expected_output=("Compose a short summary that includes the answer."),
            agent=writer)

crew = Crew(
  agents=[writer],
  tasks=[task],
  share_crew=False
)

#### Run the Agent
Let's kickoff the crew to perform its task with a topic. OpenLit will automatically capture the entire execution in the background.

In [8]:
result = crew.kickoff({"topic": "AWS Bedrock"})
print(result)

{
    "name": "Crew Created",
    "context": {
        "trace_id": "0x225f3eaa7a56f36d2a7c3bfce6f6bdf6",
        "span_id": "0x2a3f5c3878c99c86",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-06-12T13:40:05.607669Z",
    "end_time": "2025-06-12T13:40:05.609476Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "crewai_version": "0.130.0",
        "python_version": "3.10.12",
        "crew_key": "a7f38a97312c2a08d7161334f446c413",
        "crew_id": "d1df5087-e637-42e5-b1d4-8dd959d9dce3",
        "crew_process": "sequential",
        "crew_memory": false,
        "crew_number_of_tasks": 1,
        "crew_number_of_agents": 1,
        "crew_fingerprint": "9e879275-c40d-4f48-b346-2cb9c79f2b4d",
        "crew_fingerprint_created_at": "2025-06-12T19:10:05.599948",
        "crew_agents": "[{\"key\": \"18e63413ba6e2f4d81ec74e7660d93d9\", \"id\": \"b0e805e5-267c-4b1e-8031-6e334020ccfa\", \"role\"

#### Define Ground Truth for Evaluation
To score the agent, we need a "golden" reference to compare it against. This includes:

**`A reference answer`**: The ideal final output we expect.  
**`A reference trajectory`**: The ideal sequence of thoughts and tool calls the agent should have taken.

In [9]:
REFERENCE_FINAL_ANSWER="Amazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. It allows users to build generative AI applications with a choice of foundation models from different AI companies, using a single API. Users can customize these models with their data, orchestrate multistep tasks, trace reasoning, and apply guardrails for responsible AI. Additionally, Amazon Bedrock enables the creation of generative AI workflows by connecting its features with other AWS services."

REFERENCE_TRAJECTORY_OUTPUTS = [
                {"role": "user", "content": "What is AWS Bedrock?"},
                {
                    "role": "assistant",
                    "content": "To compose a poem about Amazon Bedrock, I first need to gather information about what Amazon Bedrock is. I will use the available tool to search for this information.",
                    "tool_calls": [
                        {
                            "function": {
                                "name": "Search the web for information on a given topic",
                                "arguments": "{\"search_query\": \"Amazon Bedrock\"}"
                            }
                        }
                    ]
                },
                {"role": "tool", "content": "{\"searchParameters\": {\"q\": \"Amazon Bedrock\", \"type\": \"search\", \"num\": 5, \"engine\": \"google\"}, \"organic\": [{\"title\": \"Amazon Bedrock - Generative AI - AWS\", \"link\": \"https://aws.amazon.com/bedrock/\", \"snippet\": \"Amazon Bedrock Data Automation streamlines the generation of valuable insights from unstructured multimodal content such as documents, images, audio, and videos ...\", \"position\": 1, \"sitelinks\": [{\"title\": \"Amazon Bedrock\", \"link\": \"https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html\"}, {\"title\": \"Amazon Bedrock Pricing\", \"link\": \"https://aws.amazon.com/bedrock/pricing/\"}, {\"title\": \"Amazon Bedrock Documentation\", \"link\": \"https://docs.aws.amazon.com/bedrock/\"}, {\"title\": \"Amazon Bedrock FAQs\", \"link\": \"https://aws.amazon.com/bedrock/faqs/\"}, {\"title\": \"Amazon Bedrock Agents\", \"link\": \"https://aws.amazon.com/bedrock/agents/\"}]}, {\"title\": \"Getting Started with Amazon Bedrock - AWS\", \"link\": \"https://aws.amazon.com/awstv/watch/6ff4cd6fa97/\", \"snippet\": \"So check the region that you're currently in, make sure it's a region that's supported by Bedrock. Then I'm gonna scroll to the bottom of this ...\", \"position\": 2}], \"relatedSearches\": [{\"query\": \"Amazon Bedrock pricing\"}, {\"query\": \"Amazon Bedrock documentation\"}, {\"query\": \"Amazon Bedrock Claude\"}, {\"query\": \"Amazon Bedrock logo\"}, {\"query\": \"Amazon Bedrock DeepSeek\"}], \"credits\": 1}"},
                {"role": "assistant", "content": "Based on the observation, I have learned that mazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. It allows users to build generative AI applications with a choice of foundation models from different AI companies, using a single API. Users can customize these models with their data, orchestrate multistep tasks, trace reasoning, and apply guardrails for responsible AI. Additionally, Amazon Bedrock enables the creation of generative AI workflows by connecting its features with other AWS services."}
            ]

#### Run the Evaluation
Finally, we retrieve the captured traces (spans), configure our evaluation metrics, and run the evaluate_agent function.

In [10]:
spans = memory_exporter.get_finished_spans()
print("Number of spans:", len(spans))

Number of spans: 7


In [None]:
metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(llm=llm_judge,config=MetricConfig(
        metric_params={
            "reference_answer": REFERENCE_FINAL_ANSWER
            }
        )),
    TrajectoryEvalWithLLMMetric(llm = bedrock_model,config=MetricConfig(
        metric_params={
            "reference_outputs": REFERENCE_TRAJECTORY_OUTPUTS
        }
    )),
    UsageMetric(config=MetricConfig(
        metric_params={"aws_region": "us-east-1"}
    )),
    LatencyMetric()
    ]

# Evaluate trajectory
async def evaluate_dataframe_agent():
    return await evaluate_agent(metrics, spans)

# Execute the evaluation
results = asyncio.run(evaluate_dataframe_agent())