<a href="https://www.nvidia.com/dli"> <img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;"> </a>

# Evaluating, Observing, and Optimizing NeMo Agent Toolkit Workflows

You will walk through how to set up agent evaluation, observability, and optimization in this notebook.

## Evaluating, Observing, and Optimizing Our Agent

A key component of the NeMo Agent Toolkit is that it can run several well-known evaluations against agentic workflows. Proper evaluation helps us:

### 1. Why Evaluate?
- **Measure Performance**: Quantify how well our agent performs on specific tasks
- **Identify Weaknesses**: Find edge cases or failure modes
- **Compare Versions**: Track improvements across different iterations
- **Ensure Reliability**: Verify the agent works consistently

### 2. Evaluation Process
1. **Create Test Data**: Define questions with known answers
2. **Configure Evaluators**: Set up metrics to measure performance
3. **Run Evaluation**: Process all test cases
4. **Analyze Results**: Review metrics and identify areas for improvement

### 3. Available Metrics
- **Answer Accuracy**: How correct are the agent's responses?
- **Context Relevance**: Is the agent using appropriate context?
- **Response Groundedness**: Are responses based on retrieved information?
- **Trajectory Analysis**: Is the agent's reasoning process sound?

This notebook will show how to run evaluations and observe our agent as it works.

Before we begin, let's load our environment variables.

In [None]:
import os
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

## Evaluation Methods in the NeMo Agent Toolkit

The NeMo Agent Toolkit provides several built-in evaluators to assess the performance of your workflows:

1. **RAGAS Evaluator**: An open-source evaluation framework for RAG (Retrieval-Augmented Generation) workflows. RAGAS provides metrics like Answer Accuracy, Context Relevance, and Response Groundedness.

2. **Trajectory Evaluator**: Uses the intermediate steps generated by the workflow to evaluate the agent's reasoning process and decision-making path.

3. **SWE-Bench Evaluator**: Specifically designed for software engineering tasks, this evaluator tests if the agent can solve programming problems by running tests on the generated code.

In this notebook, we'll primarily use the **RAGAS Evaluator** and **Trajectory Evaluator** to assess our math tools agent.

Let's create a directory to store evaluation data. This will contain test cases with questions and expected answers.

In [None]:
!mkdir -p workflows/math_tools/data

In the next cell, we will create an evaluation JSON file.

It will include both standard and time-aware test cases.

Note that each test case includes the following:
- id: A unique identifier
- question: The input to send to our agent
- answer: The expected correct response (or "dynamic" for time-based answers)

In [None]:
%%writefile workflows/math_tools/data/comprehensive_eval.json

[
    {
        "id": 1,
        "question": "What is the square root of 49?",
        "answer": "7"
    },
    {
        "id": 2,
        "question": "Add 10 to 25",
        "answer": "35"
    },
    {
        "id": 3,
        "question": "What is the modulus of 100 divided by 3?",
        "answer": "1"
    },
    {
        "id": 4,
        "question": "What is five to the power of three?",
        "answer": "125"
    },
    {
        "id": 5,
        "question": "Is the current hour even?",
        "answer": "dynamic"
    }
]

## Creating a Comprehensive Evaluation Configuration

We will write a single evaluation configuration file that includes all the tools our agent needs to handle both mathematical operations and time-based queries. By now you have seen most this configuration format. This configuration includes:

1. **General settings**: Where to store results and which dataset to use
2. **Functions**: All the tools our agent will use (math operations and time functions)
3. **LLMs**: Both the agent LLM and a separate evaluation LLM
4. **Evaluators**: The specific metrics we want to measure

The `eval` section is new. In the `eval` section, you can specify however many evaluators you want to run, calling either built-in evaluators or your own custom evaluation components.

In [None]:
%%writefile workflows/math_tools/configs/comprehensive_eval_config.yml

general:
  use_uvloop: true

functions:
  calculator_exponent:
    _type: calculator_exponent
  calculator_modulus:
    _type: calculator_modulus
  calculator_square_root:
    _type: calculator_square_root
  calculator_add:
    _type: calculator_add
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0
  eval_llm:
    _type: nim
    model_name: meta/llama-3.1-405b-instruct
    temperature: 0.0
    max_tokens: 1024

workflow:
  _type: react_agent
  tool_names:
    - calculator_exponent
    - calculator_modulus
    - calculator_square_root
    - calculator_add
    - current_datetime
  llm_name: nim_llm
  verbose: true

eval:
  general:
    output_dir: ./math_tools_eval/
    dataset:
      _type: json
      file_path: workflows/math_tools/data/comprehensive_eval.json
  evaluators:
    math_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm
    math_trajectory_accuracy:
      _type: trajectory
      llm_name: eval_llm

## Running the Comprehensive Evaluation

Now let's run the evaluation using our consolidated configuration. This will:
1. Load our test cases from the JSON file
2. Run each test case through our agent
3. Evaluate the responses using the specified metrics
4. Store the results in the output directory

This single evaluation run will test both standard mathematical operations and time-based queries.

In [None]:
!nat eval --config_file=workflows/math_tools/configs/comprehensive_eval_config.yml

## Examining Evaluation Results

After running the evaluation, the NeMo Agent Toolkit stores the results in JSON files in our specified `output_dir`. Let's analyze these JSON results with some helper functions.

In [None]:
import json
import pandas as pd
from IPython.display import display

# Simple function to load JSON files
def load_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

# Simple summary for accuracy evaluations
def get_accuracy_summary(data):
    items = data['eval_output_items']
    summary = []
    for item in items:
        summary.append({
            'Question': item['reasoning']['user_input'],
            'Score': item['score']
        })
    return pd.DataFrame(summary)

# Simple summary for workflow
def get_workflow_summary(data):
    summary = []
    for item in data:
        tools = []
        total_tokens = 0
        steps = len(item['intermediate_steps'])
        
        # Count tools and tokens
        for step in item['intermediate_steps']:
            if 'payload' in step and 'name' in step['payload']:
                tools.append(step['payload']['name'])
            if 'payload' in step and 'usage_info' in step['payload']:
                total_tokens += step['payload']['usage_info']['token_usage']['total_tokens']
        
        summary.append({
            'Question': item['question'],
            'Steps': steps,
            'Tokens': total_tokens,
            'Tools': ', '.join(set(tools))  # Unique tools only
        })
    return pd.DataFrame(summary)

# Load the three files
accuracy = load_json('./math_tools_eval/math_accuracy_output.json')
trajectory = load_json('./math_tools_eval/math_trajectory_accuracy_output.json')
workflow = load_json('./math_tools_eval/workflow_output.json')

# Create simple DataFrames
accuracy_df = get_accuracy_summary(accuracy)
workflow_df = get_workflow_summary(workflow)

# Show basic metrics
print("Overall Scores:")
print(f"Math Accuracy Score: {accuracy['average_score']}")
print(f"Trajectory Accuracy Score: {trajectory['average_score']}")
print(f"Average Steps: {workflow_df['Steps'].mean()}")
print(f"Average Tokens: {workflow_df['Tokens'].mean()}")

# Show accuracy results
print("\nMath Accuracy Results:")
display(accuracy_df)

# Show workflow results
print("\nWorkflow Summary:")
display(workflow_df)

# Count tool usage
tools_used = {}
for tools in workflow_df['Tools']:
    for tool in tools.split(', '):
        if tool:  # Skip empty strings
            tools_used[tool] = tools_used.get(tool, 0) + 1

print("\nTools Used:")
display(pd.DataFrame(list(tools_used.items()), columns=['Tool', 'Count']))

## Setting Up Observability

Now that we can run our agent as a service, we need to monitor its performance and behavior. The NeMo Agent Toolkit provides comprehensive observability features that help us understand what's happening inside our agent.

## Observability and Profiling in the NeMo Agent Toolkit

The NeMo Agent Toolkit offers comprehensive observability and profiling capabilities to monitor and optimize your workflows:

1. **Telemetry Options**:
   - **Logging**: Configure logs to console or file with different verbosity levels
   - **Tracing**: Track the flow of requests through your system
   - **Metrics**: Measure performance characteristics of your workflow

2. **Profiling Tools**:
   - **Token Usage Analysis**: Track and forecast token consumption
   - **Latency Analysis**: Identify performance bottlenecks
   - **Concurrency Analysis**: Understand parallel execution patterns

3. **Tracing Providers**:
   - **Phoenix Profiler**: A visualization tool by Arize AI for tracing and profiling
   - **OpenTelemetry Collector**: Standard collector for observability data
   - **Custom Providers**: Extensible system for custom telemetry exporters

In this notebook, we'll use the **Phoenix Profiler** to visualize the execution of our agent and understand its performance characteristics.

### 1. Understanding Observability in the NeMo Agent Toolkit

The NeMo Agent Toolkit supports multiple observability options, including:

- **Logging Providers**: Console logging and file-based logging with configurable verbosity levels
- **Tracing Providers**: Phoenix Profiler, OpenTelemetry Collector, and custom providers
- **Metrics Collection**: Performance measurements for optimization

For this notebook, we'll use the **Phoenix Profiler** for tracing. Phoenix is developed by Arize AI (https://github.com/Arize-ai/phoenix) and provides detailed insights into your agent's execution, including:

- Visual representation of the agent's reasoning process
- Timing information for each step and tool call
- Token usage statistics and bottleneck identification
- Hierarchical view of nested function calls

These features help with debugging, performance optimization, and understanding usage patterns.

### 2. Updating Configuration for Observability

Let's update our configuration file to enable observability features. We'll add a `telemetry` section to the `general` configuration that includes logging and tracing settings.

In [None]:
%%writefile workflows/math_tools/configs/observability_config.yml

general:
  use_uvloop: true
  telemetry:
    logging:
        console:
            _type: console
            level: WARN
    tracing:
        phoenix:
            _type: phoenix
            endpoint: http://localhost:7007/v1/traces
            project: math_tools_example

functions:
  calculator_exponent:
    _type: calculator_exponent
  calculator_modulus:
    _type: calculator_modulus
  calculator_square_root:
    _type: calculator_square_root
  calculator_add:
    _type: calculator_add
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0

workflow:
  _type: react_agent
  tool_names:
    - calculator_exponent
    - calculator_modulus
    - calculator_square_root
    - calculator_add
    - current_datetime
  llm_name: nim_llm
  verbose: true


In [None]:
import subprocess
import time

# Start the Phoenix server using Popen to gain direct control over the process
# We also suppress the output by redirecting stdout and stderr
phoenix_process = subprocess.Popen(
    ["phoenix", "serve"],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL,
)

# Give Phoenix a moment to start up
time.sleep(3)

print(f"Phoenix server started with PID: {phoenix_process.pid}")
print("You can access the Phoenix UI at: http://localhost:7007")

### 4. Running the NeMo Agent Toolkit with Observability Enabled

Now let's run our agent with observability enabled. This will generate logs and traces that we can use to monitor and debug our agent.

In [None]:
!nat run --config_file workflows/math_tools/configs/observability_config.yml --input "What is the square root of the current hour plus 5?"

After you have run the above (feel free to modify the query), look at Phoenix and you should see observability data flowing in.

### 5. Stopping the Phoenix Server

Let's clean up for the next notebook by stopping the Phoenix server we started. In a production system, you would not stop Phoenix.

In [None]:
# Check if the process object exists and is running
if 'phoenix_process' in locals() and phoenix_process.poll() is None:
    print(f"Stopping Phoenix server with PID: {phoenix_process.pid}...")
    
    # Send the termination signal to the process
    phoenix_process.terminate()
    
    try:
        # Wait for the process to terminate
        phoenix_process.wait(timeout=5)
        print("Phoenix server stopped successfully.")
    except subprocess.TimeoutExpired:
        # If it doesn't terminate gracefully, force kill it
        print("Server did not terminate gracefully. Forcing kill...")
        phoenix_process.kill()
        phoenix_process.wait()
        print("Phoenix server killed.")
    
    # Clean up the variable
    del phoenix_process
else:
    print("Phoenix server was not running or the process object was not found.")

## Introduction to Optimization

Now that we've evaluated our agent and observed its behavior, the natural next step is to **optimize** it. The NeMo Agent Toolkit provides a built-in optimizer that can automatically tune your workflow's parameters to improve performance.

### The Evaluate → Profile → Optimize Loop

The NeMo Agent Toolkit supports a powerful iterative development cycle:

1. **Evaluate** — Measure your agent's accuracy and quality using `nat eval`
2. **Profile** — Understand where time and tokens are spent (observability)
3. **Optimize** — Automatically search for better parameter settings using `nat optimize`

### How Does `nat optimize` Work?

The optimizer supports two complementary tuning modes:

- **Numeric Optimization (Optuna)**: Searches over numeric hyperparameters like `temperature`, `max_tokens`, and `top_p` using Bayesian optimization. Fast and efficient for continuous parameters.

- **Prompt Optimization (Genetic Algorithm)**: Evolves prompt templates using a genetic algorithm — generating, mutating, and recombining prompts across generations to find more effective phrasings.

Both modes can run together or independently, and support **multi-objective scoring** — optimizing for multiple metrics simultaneously (e.g., maximize accuracy while minimizing token usage).

### Creating an Optimizer Configuration

Let's create a simple optimizer configuration for our math tools agent. We'll use **numeric-only optimization** with just 3 trials to keep it lightweight. The optimizer will tune `temperature` and `max_tokens` for our LLM while measuring accuracy.

In [None]:
%%writefile workflows/math_tools/configs/optimize_config.yml

general:
  use_uvloop: true

functions:
  calculator_exponent:
    _type: calculator_exponent
  calculator_modulus:
    _type: calculator_modulus
  calculator_square_root:
    _type: calculator_square_root
  calculator_add:
    _type: calculator_add
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0
    search_space:
      temperature:
        low: 0.0
        high: 0.5
        step: 0.1
      max_tokens:
        low: 256
        high: 1024
        step: 256
  eval_llm:
    _type: nim
    model_name: meta/llama-3.1-405b-instruct
    temperature: 0.0
    max_tokens: 1024

workflow:
  _type: react_agent
  tool_names:
    - calculator_exponent
    - calculator_modulus
    - calculator_square_root
    - calculator_add
    - current_datetime
  llm_name: nim_llm
  verbose: true

eval:
  general:
    output_dir: ./math_tools_optimize/
    dataset:
      _type: json
      file_path: workflows/math_tools/data/comprehensive_eval.json
  evaluators:
    math_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm

optimizer:
  output_path: ./math_tools_optimize/
  reps_per_param_set: 1
  eval_metrics:
    accuracy:
      evaluator_name: math_accuracy
      direction: maximize
  numeric:
    enabled: true
    n_trials: 3
  prompt:
    enabled: false

### Running the Optimizer

Let's run the optimizer. It will execute 3 trials with different parameter combinations and score each one using our accuracy evaluator.

In [None]:
!nat optimize --config_file=workflows/math_tools/configs/optimize_config.yml

### Inspecting Optimization Results

After the optimizer completes, it writes several artifacts to the output directory. Let's look at the trial results and the best configuration found.

In [None]:
from pathlib import Path

optimizer_output = Path("./math_tools_optimize/")

# Show trial results if available
trials_csv = optimizer_output / "trials_dataframe_params.csv"
if trials_csv.exists():
    trials_df = pd.read_csv(trials_csv)
    print("Optimization Trials:")
    display(trials_df)
else:
    print("No trials CSV found yet — run the optimizer cell above first.")

# Show the optimized config if available
optimized_config = optimizer_output / "optimized_config.yml"
if optimized_config.exists():
    print("\nOptimized Configuration:")
    print(optimized_config.read_text())

## Summary

In this notebook, we've explored how to evaluate, observe, and optimize our NeMo Agent Toolkit workflow. We've learned how to:

1. Create comprehensive evaluation datasets with test cases for different capabilities
2. Configure and run evaluations using RAGAS and Trajectory evaluators
3. Analyze evaluation results to understand agent performance
4. Handle time-based queries and dynamic responses in evaluations
5. Set up observability features using Phoenix Profiler by Arize AI for monitoring and debugging
6. Run the NAT optimizer to automatically tune LLM hyperparameters
7. Inspect optimization trial results and the best configuration found

These techniques form the **evaluate → observe → optimize** loop that helps ensure your agents perform reliably and efficiently. For a deeper exploration of these topics using a production-grade workflow, check out the [Deep Dive](deep-dive/) notebooks.

### Going Deeper

> This section introduced the basics of NAT's optimization capabilities using a simple numeric search over LLM parameters. For a **production-grade deep dive** — including custom evaluators, prompt genetic algorithm optimization, Pareto trade-offs, and multi-objective tuning on a real-world email phishing workflow — see the **[Deep Dive](deep-dive/)** notebooks in the `deep-dive/` directory.