<a href="https://www.nvidia.com/dli"> <img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;"> </a>

> **Deep Dive**: This notebook is part of the deep-dive series that extends [03_Evaluation_Observability_And_Optimization.ipynb](../03_Evaluation_Observability_And_Optimization.ipynb). In the previous notebooks, you were introduced to evaluation, observability, and optimization using a simple math agent. This deep-dive takes those same concepts to production depth using a real-world email phishing analyzer workflow with custom evaluators, advanced profiling, and multi-objective optimization.

# Email Phishing Analyzer Optimizer Notebook

Welcome to the third leg of the Email Phishing Analyzer journey. After learning how to evaluate and profile the workflow, this notebook-style guide shows you how to *improve* it using the NeMo Agent Toolkit Optimizer. We will follow the reference guide in [`docs/source/reference/optimizer.md`](../../../docs/source/reference/optimizer.md) and make it concrete for this workflow.

## Orientation & What You Will Learn

By the end of this notebook you will know how to:
- Discover which workflow components are tunable and how their search spaces are defined.
- Understand the optimizer configuration in `config_optimizer.yml`, including multi-objective scoring and prompt GA settings.
- Run the optimizer with both numeric and prompt modes engaged.
- Inspect optimizer artifacts (`optimized_config.yml`, `ga_history_prompts.csv`, Pareto plots, etc.) and decide what to ship.
- Close the loop by validating tuned configs with evaluation and profiling.

> **Mindset:** Treat the optimizer as your lab partner. It explores broad parameter spaces quickly, but you direct the experiment by choosing metrics, search spaces, and stopping criteria.

## Prerequisites & Environment Warm-Up

In [None]:
# Install the phishing workflow package in editable mode
! uv pip install -e .

# Confirm the CLI entry point is available
! nat --version

**Before moving on:**
For this notebook, you will need the following API keys to run all examples end-to-end:

NVIDIA Build: You can obtain an NVIDIA Build API Key by creating an NVIDIA Build account and generating a key at https://build.nvidia.com/settings/api-keys
Then you can run the cell below:

In [None]:
import getpass
import os

if "NVIDIA_API_KEY" not in os.environ:
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## Establish Paths & Helpers

In [None]:
from pathlib import Path
root = Path.cwd()
workflow_dir = root 
config_opt_path = workflow_dir / "configs" / "config_optimizer.yml"
optimizer_output_dir = Path("eval_with_optimizer")
config_opt_path, optimizer_output_dir

## Meet the Optimizable Fields in Code

The optimizer can only tune parameters marked with `OptimizableField`. Let’s confirm which knobs the phishing workflow exposes.

In [None]:
import inspect
from nat_email_phishing_analyzer import register as phishing_register

for name, cls in inspect.getmembers(phishing_register, inspect.isclass):
    if not hasattr(cls, "model_fields"):
        continue
    optimizable = []
    for field_name, field_info in cls.model_fields.items():
        extras = getattr(field_info, "json_schema_extra", {}) or {}
        if extras.get("optimizable"):
            search_space = extras.get("search_space")
            optimizable.append((field_name, search_space))
    if optimizable:
        print(name)
        for field_name, search_space in optimizable:
            print(f"  - {field_name}: search_space={search_space}")

Alternatively, read the source directly:
- `sensitive_info_detector`, `intent_classifier`: tunable `llm` and `prompt`.
- `link_and_domain_analyzer`: tunable `min_suspicious_score`.
- `phishing_risk_aggregator`: tunable explanation LLM, weights, and decision threshold.
- LLM configs (`llama_3_405`, `llama_3_70`) expose temperature/top_p/max_tokens.

**Key takeaway:** Every tunable parameter inherits from `OptimizableMixin` and declares `optimizable_params` in the YAML. Combine code + config to see the whole search space.

While you are exploring the printed search spaces, note which ones are categorical (`values`) versus numeric (`low`/`high`). Categorical knobs typically converge faster but may require broader coverage of options, whereas numeric knobs benefit from more trials so Optuna can narrow in on precise values.

## Decode the Optimizer Configuration

Open `config_optimizer.yml` and focus on three sections: `functions`, `llms`, and `optimizer`.

```yaml
# Snippet: optimizer section
optimizer:
  output_path: eval_output_with_optimizer
  reps_per_param_set: 1
  eval_metrics:
    rag_accuracy:
      evaluator_name: rag_accuracy
      direction: maximize
    rag_groundedness:
      evaluator_name: rag_groundedness
      direction: maximize
    token_efficiency:
      evaluator_name: token_efficiency
      direction: minimize
    latency:
      evaluator_name: llm_latency
      direction: minimize

  numeric:
    enabled: true
    n_trials: 5

  prompt:
    enabled: true
    prompt_population_init_function: prompt_init
    prompt_recombination_function: prompt_recombination
    ga_generations: 3
    ga_population_size: 3
    ga_diversity_lambda: 0.3
    ga_parallel_evaluations: 1
```

**What this tells us:**
- Trials use four metrics simultaneously. The optimizer will normalize each (max vs min) and combine them (default `harmonic` unless overridden).
- Numeric mode runs Optuna for five trials. Prompt mode runs a three-generation GA with a small population (great for quick demos).
- Prompt initialization and recombination functions are defined in the config’s `functions` section so the optimizer knows which LLM (`prompt_optimizer`) to call for mutations.

Consider experimenting with `multi_objective_combination_mode` (`harmonic`, `sum`, or `chebyshev`) and metric weights when your priorities shift. Emphasising latency, for example, can push the optimizer toward lighter-weight models even if accuracy dips slightly.

## Visualize Search Spaces at Runtime

Override search spaces dynamically to experiment without editing source files.

In [None]:
import yaml
cfg = yaml.safe_load(config_opt_path.read_text())

# Example: tighten decision threshold range and expand temperature
cfg['functions']['phishing_risk_aggregator'].setdefault('search_space', {})['decision_threshold'] = {
    'low': 0.4,
    'high': 0.7,
    'step': 0.05,
}
cfg['llms']['llama_3_70'].setdefault('search_space', {})['temperature'] = {
    'low': 0.0,
    'high': 0.8,
    'step': 0.1,
}

config_opt_experiment = workflow_dir / 'configs' / 'config_optimizer_experiment.yml'
config_opt_experiment.write_text(yaml.safe_dump(cfg))
config_opt_experiment

**Reminder:** Any parameter marked as optimizable but lacking a search space in both code and config leads to a runtime error, so provide overrides where needed.

Version these experiment configs alongside your codebase. When you discover tighter bounds that work well (for example, a narrower `decision_threshold` window), you can promote them into the main configuration and keep the history of how you got there.

## Understand the Optimizer’s Evaluation Loop

The optimizer reuses the `eval` block in the config. That means:
- Each trial runs the full phishing workflow on `smaller_test.csv`.
- Evaluators (`rag_accuracy`, `rag_groundedness`, `token_efficiency`, `llm_latency`, `phishing_accuracy`) score the outputs.


Later, increase `reps_per_param_set` above 1 so each trial averages multiple runs. That stabilises the metrics when evaluators involve nondeterministic judges or when the workflow itself has inherent variability.

## Launch the Optimizer - This will take a while to run!

In [None]:
! nat optimize --config_file configs/config_optimizer.yml

**What to watch in the log:**
- Optuna trial summaries showing parameter suggestions and objective scores.
- GA generation summaries with fitness rankings and diversity scores.
- References to evaluator outputs stored per trial.
- Checkpoints for the best numeric trial and best prompt set so far.

If the run stops unexpectedly (for instance, due to a transient rate limit), rerun the command once the issue clears. At the moment the study is in-memory, so the optimizer restarts; however, any per-trial configs already written under the output directory give you a breadcrumb trail to resume analysis manually.

## Explore the Optimizer Output Directory

After the run completes, list the output directory defined in `optimizer.output_path`.

Expect to find (names may vary slightly):
- `optimized_config.yml`: Ready-to-run configuration with the best overall parameters.
- `best_params.json`: Raw parameter dictionary for the selected trial.
- `trials_dataframe_params.csv`: Flat table of numeric trials and their scores.
- `pareto_front_2d.png`, `pareto_parallel_coordinates.png`, `pareto_pairwise_matrix.png`: Visual summaries.
- `ga_history_prompts.csv`, `optimized_prompts.json`, `optimized_prompts_gen*.json`: Prompt GA artifacts.
- `prompt_population_generation_<N>.jsonl`: Optional per-generation dumps.

Because checkpoints are written incrementally, you can stop a run after numeric trials finish and still inspect the resulting study, plots, and prompt generations without rerunning everything from scratch.

In [None]:
sorted(optimizer_output_dir.glob("**/*"))[:20]

## Inspect Top Trials in Pandas

In [None]:
import pandas as pd
trials_csv = optimizer_output_dir / "trials_dataframe_params.csv"
if trials_csv.exists():
    trials_df = pd.read_csv(trials_csv).head()

trials_df

## Dive into Prompt Optimization Results
Use this data to see how prompts evolved generation by generation. The GA history file also includes diversity penalties and selection metadata when `ga_diversity_lambda > 0`.

Prompt optimisation tends to be the longer-running phase because every mutation issues one or more LLM calls. If progress plateaus, widen the population, increase `ga_generations`, or tweak mutation rates so fresh ideas continue entering the pool.

In [None]:
ga_history = optimizer_output_dir / "ga_history_prompts.csv"
final_prompts = optimizer_output_dir / "optimized_prompts.json"

if ga_history.exists():
    hist_df = pd.read_csv(ga_history)
    hist_df[["generation"]].head()

if final_prompts.exists():
    import json
    prompts = json.loads(final_prompts.read_text())
    for name, prompt_text in prompts.items():
        print(f"Prompt '{name}' preview:\n{prompt_text[:200]}...\n")

## Explore Pareto Trade-Offs

When optimizing multiple metrics, there may not be a single "best" trial. Review Pareto plots to select a configuration that matches your risk appetite.

In [None]:
pareto_plot = optimizer_output_dir / "plots"/ "pareto_parallel_coordinates.png"
pareto_plot.exists()

In [None]:
from IPython.display import Image
Image(filename=pareto_plot)

For deeper analysis, read the `trials_dataframe_params.csv` file into pandas and recreate Pareto filters interactively—handy when you want to annotate candidate configurations with business-specific thresholds or visualise them directly inside a notebook.

## Next Experiments

1. **Experiment with weights:** Adjust `eval_metrics.<metric>.weight` and `multi_objective_combination_mode` to emphasize the trade-offs you care about.
2. **Increase robustness:** Set `reps_per_param_set` > 1 for noisier datasets. It averages scores and stabilizes rankings.
3. **Expand datasets:** Swap `smaller_test.csv` for a larger evaluation set to tune against production-like data.
4. **Hybrid search:** Start with numeric tuning, lock in the best LLM hyperparameters, then rerun prompt GA only.
5. **Automate:** Add `nat optimize` to CI (nightly or weekly) and alert on improvements or regressions in best trial scores.

Document the context for each optimizer run (dataset snapshot, metric weights, notable prompts) in a short `README` inside the output directory. That breadcrumb trail makes it far easier to defend decisions and revisit successful experiments later.

Congratulations — you now have a full loop: *evaluate*, *profile*, and *optimize* the Email Phishing Analyzer. Keep iterating and share your best configs with the team so everyone benefits.

Happy optimizing!