<a href="https://www.nvidia.com/dli"> <img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;"> </a>

> **Deep Dive**: This notebook is part of the deep-dive series that extends [03_Evaluation_Observability_And_Optimization.ipynb](../03_Evaluation_Observability_And_Optimization.ipynb). In the previous notebooks, you were introduced to evaluation, observability, and optimization using a simple math agent. This deep-dive takes those same concepts to production depth using a real-world email phishing analyzer workflow with custom evaluators, advanced profiling, and multi-objective optimization.

# Email Phishing Analyzer Evaluation Notebook

Welcome! This notebook-style walkthrough is designed to help anyone new to the NeMo Agent Toolkit (NAT) build intuition for how workflow evaluation works. We will explore the Email Phishing Analyzer example, but the principles carry over to any NAT workflow you build.

## Orientation & Learning Goals

**Why evaluate?**

Evaluation is how you verify that agent behaviors are reliable before shipping them to users or chaining them into larger systems. With NAT you can:
- Run the entire workflow against a labeled dataset to surface regressions early.
- Layer multiple evaluators (LLM-based or deterministic) for richer insights than "pass/fail".
- Capture telemetry for model cost, latency, or prompt usage while scoring accuracy.

**In this notebook you will learn how to:**
- Inspect the phishing workflow architecture and understand how its tools cooperate.
- Configure NAT's evaluation stack, combining built-in evaluators with a custom metric.
- Execute `nat eval`, read the produced artifacts, and reason about the results.
- Iterate on prompts, parameters, and evaluators while keeping metrics trustworthy.

> **Mindset Shift:** Treat evaluation configs like tests. Commit them. Run them on every change. That way a "phishing detector" stays trustworthy even as you tweak prompts or swap models.

## System Prerequisites & Environment Check

In [None]:
# Install the phishing workflow package in editable mode
# ! uv pip install -e .
# ! uv pip install -U langchain
# Confirm the CLI entry point is available
! nat --version

**Before moving on:**
For this notebook, you will need the following API keys to run all examples end-to-end:

NVIDIA Build: You can obtain an NVIDIA Build API Key by creating an NVIDIA Build account and generating a key at https://build.nvidia.com/settings/api-keys
Then you can run the cell below:

In [None]:
import getpass
import os

if "NVIDIA_API_KEY" not in os.environ:
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## Anchor Key Paths for the Session

In [None]:
from pathlib import Path
root = Path.cwd()
workflow_dir = root
config_path = workflow_dir / "configs" / "config.yml"
data_path = workflow_dir / "data" / "smaller_test.csv"
config_path, data_path

**Why this matters:** Treating these paths as variables avoids chasing typos later. It also reminds you where the evaluation artifacts will land (`config_path` controls the output directory).

## Understand the Workflow Components

The phishing analyzer is built as a `react_agent` that coordinates several tools:

- `sensitive_info_detector`: asks an LLM whether the email requests sensitive data.
- `intent_classifier`: classifies likely attacker intent (credential theft, fraud, malware, etc.).
- `link_and_domain_analyzer`: a deterministic helper that spots suspicious URLs locally.
- `phishing_risk_aggregator`: aggregates the tool outputs into a structured verdict.
- `email_phishing_analyzer`: a convenience tool for single-step direct calls.

In [None]:
import yaml
config = yaml.safe_load(config_path.read_text())
config["workflow"], list(config["functions"].keys())

**Key takeaway:** Evaluation is only meaningful if you understand the workflow outputs. Here, the expected output is JSON with fields such as `is_likely_phishing`, `risk_score`, and `factors`. Keep that schema in mind when interpreting evaluator results.

## Mental Model for NAT Evaluation

NAT's evaluation pipeline is opinionated but flexible. Think of it as three building blocks that you plug together in `config.yml`:

1. **General settings (`eval.general`)** — Where to write outputs, which dataset to load, whether to stream verbose logs, and optional profiler/telemetry toggles.
2. **Evaluators (`eval.evaluators`)** — A map of evaluator names to evaluator configs (built-in or custom) that will score every workflow run. Multiple evaluators can score the same run.
3. **Shared resources** — Datasets, tool registry entries, and judge LLMs referenced by the evaluators. These typically live in other sections of the same config file (`dataset`, `llms`, `functions`).

```yaml
# Excerpt: eval.general block from config.yml
eval:
  general:
    output_dir: ./.tmp/eval/examples/evaluation_and_profiling/email_phishing_analyzer/original
    verbose: true
    dataset:
      _type: csv
      file_path: examples/evaluation_and_profiling/email_phishing_analyzer/data/smaller_test.csv
      id_key: "subject"
      structure:
        question_key: body
        answer_key: label
```

**Design notes:**
- `id_key` provides a stable identifier so every evaluator output can be joined back to the dataset row.
- `question_key`/`answer_key` tell NAT how to feed the workflow (`body`) and where to find ground truth (`label`).
- The nested `profiler` dictionary (scroll further down in `config.yml`) allows runtime forecasts, token accounting, and concurrency analysis to run alongside evaluation.

## Meet the Built-in Evaluators

The workflow reuses several evaluators that ship with NAT:

- **RAGAS-backed metrics** (`rag_accuracy`, `rag_groundedness`, `rag_relevance`): Judge the generated answer against the ground truth and context. They rely on a judge LLM defined in `llms` (`nim_rag_eval_llm`).
- **Trajectory evaluator** (`trajectory_accuracy`): Asks a judge LLM to grade the entire agent tool-call sequence.

```yaml
# Excerpt: built-in evaluators in config.yml
  evaluators:
    rag_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: nim_rag_eval_llm
    rag_groundedness:
      _type: ragas
      metric: ResponseGroundedness
      llm_name: nim_rag_eval_llm
    rag_relevance:
      _type: ragas
      metric: ContextRelevance
      llm_name: nim_rag_eval_llm
    trajectory_accuracy:
      _type: trajectory
      llm_name: nim_trajectory_eval_llm
```

**How to reason about the scores:**
- All of these metrics return floats in `[0, 1]`. Higher is better.
- If you receive a low `ResponseGroundedness`, inspect the retrieved evidence — the workflow might be hallucinating explanations.
- Low `Trajectory` scores usually mean the agent took suboptimal tool actions even if the final answer looks correct.

## Deep Dive on the Custom Phishing Evaluator

Builtin metrics are great, but this workflow also ships a purpose-built evaluator: `phishing_accuracy`. It translates the workflow’s JSON verdict into classic binary accuracy against the dataset labels.

```python
# In [6]: excerpt from src/nat_email_phishing_analyzer/evaluator_register.py
from nat.eval.evaluator.base_evaluator import BaseEvaluator
from nat.eval.evaluator.evaluator_model import EvalOutputItem

class PhishingAccuracyEvaluatorConfig(EvaluatorBaseConfig, name="phishing_accuracy"):
    metric_name: str = "accuracy"

@register_evaluator(config_type=PhishingAccuracyEvaluatorConfig)
async def register_phishing_accuracy_evaluator(config, _builder):
    class PhishingAccuracy(BaseEvaluator):
        async def evaluate_item(self, item):
            label = str(item.full_dataset_entry.get("label", "")).strip().lower()
            expected_is_phish = label in {"phish", "phishing", "spam", "malicious"}

            output = item.output_obj
            is_phish_pred = False
            try:
                import json
                parsed = json.loads(output) if isinstance(output, str) else output
                if isinstance(parsed, dict):
                    is_phish_pred = bool(parsed.get("is_likely_phishing", False))
                else:
                    is_phish_pred = isinstance(output, str) and "likely a phishing" in output.lower()
            except Exception:
                is_phish_pred = isinstance(output, str) and "likely a phishing" in output.lower()

            score = 1.0 if (is_phish_pred == expected_is_phish) else 0.0
            reasoning = {"expected_label": label, "predicted_is_phish": is_phish_pred}
            return EvalOutputItem(id=item.id, score=score, reasoning=reasoning)
```

**Key points:**
- Evaluators receive both the workflow output (`item.output_obj`) and the full dataset row (`item.full_dataset_entry`). You can use extra columns for richer debugging.
- It is safe to implement fallback heuristics (like substring matches) if you expect occasional non-JSON outputs.
- Registering an evaluator is as simple as decorating it with `@register_evaluator` and yielding an `EvaluatorInfo`.
  
## Explore the Evaluation Dataset

In [None]:
import pandas as pd
preview = pd.read_csv(data_path)
preview[["subject", "body", "label"]].head(5)

**Dataset schema refresher:**
- `subject` (string): used as the evaluation item ID.
- `body` (string): the email body fed into the agent.
- `label` (string): ground truth (`phish` or `benign`).
- Additional columns (`intents`, `source`, etc.) travel with the evaluation item and can inform debugging or future metrics.

> **Quality tip:** Start with a small dataset like `smaller_test.csv` to smoke-test evaluators. As the workflow stabilizes, grow the dataset and keep it version-controlled so improvements are measurable.

Before moving on, take a minute to inspect class balance and metadata coverage (for example, `preview['label'].value_counts()` or `preview['sender'].nunique()`). Knowing whether your dataset leans heavily toward "phish" or "benign" helps you pick metrics that matter (recall vs precision), and spotting sparse columns early prevents confusion when you later rely on them in evaluators.

In [None]:
# Your code here

## Sanity-Check the Evaluation Plan

Before hitting "run", double-check the pieces we rely on:

In [None]:
assert config_path.exists(), "Missing config file"
assert data_path.exists(), "Missing evaluation dataset"
assert "phishing_accuracy" in config["eval"]["evaluators"], "Custom evaluator not wired"
config["llms"].keys()

**Why pause here?**
- Failing fast on missing assets saves time, especially when collaborating with teammates.
- Listing the available LLMs reminds you which judge models will be billed during evaluation.

Consider setting `eval.general.output_dir` to a timestamped folder during experimentation so you keep historical runs side-by-side for comparison.

## Run the Evaluation Pipeline

In [None]:
! nat eval --config_file configs/config.yml

**What you should see:**
- NAT prints the workflow summary, then streams progress as each dataset row is processed.
- For each evaluator you'll get an aggregate score at the end. With `verbose: true`, per-item logs appear too.
- Outputs are written under `eval.general.output_dir`. If the directory doesn't exist NAT creates it.

> **Troubleshooting:** If you encounter `[429] Too Many Requests`, reduce judge LLM concurrency using `eval.general.max_concurrency`. For network hiccups, rerunning the command resumes where it left off as long as the output directory is intact.

Need to change datasets or tweak runtime behaviour without editing YAML? Run `nat eval --help` to discover overrides such as `--dataset`, `--endpoint`, `--reps`, or `--override`. These flags make it easy to spin up quick experiments while keeping the committed config untouched.

## Inspect Evaluation Artifacts

In [None]:
import json
output_dir = Path(config["eval"]["general"]["output_dir"])
phishing_report = json.loads((output_dir / "phishing_accuracy_output.json").read_text())
rag_accuracy_report = json.loads((output_dir / "rag_accuracy_output.json").read_text())

print(f"The average phishing report score was {phishing_report["average_score"]} and the average RAGAS accuracy was {rag_accuracy_report["average_score"]}")

Dive deeper into individual examples:

In [None]:
first_item = phishing_report["eval_output_items"][0]
first_item["id"], first_item["score"], first_item["reasoning"]

**Artifacts generated:**
- `workflow_output.json`: every workflow run plus intermediate steps — perfect for understanding *why* an evaluator scored an item poorly.
- `<evaluator_name>_output.json`: evaluator-specific metrics. Many built-in evaluators include `judgment` or `explanation` fields; custom ones can add any debugging payload you need.
- Optional profiler outputs: CSV/JSON snapshots that estimate runtime, latency bottlenecks, or token uniqueness if those toggles are enabled in the config.

Each evaluator JSON shares a consistent structure (`average_score`, `eval_output_items`, optional `metadata`), which means you can parse them with a single helper and aggregate across runs. Consider enriching `reasoning` with extra context (for example, GUIDs or remediation links) so triage engineers can jump straight to action items when a score dips.

In [None]:
# Explore more evaluation output here

## Wrap-Up & Next Steps

Congratulations! You now understand how to:
- Wire up datasets, built-in evaluators, and custom metrics in NAT.
- Execute `nat eval` and interpret the results with confidence.
- Iterate thoughtfully, keeping evaluation at the center of your workflow development cycle.

**Where to go from here:**
1. Expand the dataset with real phishing/benign emails your team cares about.
2. Add evaluators that capture business-specific risks (false negatives may cost more than false positives).
3. Automate: run this evaluation in CI or scheduled jobs so you always know when performance drifts.

Happy evaluating! Share your discoveries with the team so everyone benefits from the improvements you make.