# Integrating Existing LangGraph Agents with NeMo Agent Toolkit

In this notebook, you'll learn how to integrate any existing LangGraph agent with NeMo Agent Toolkit using the `langgraph_wrapper` workflow type.

We'll use LangGraph's Deep Research agent as a comprehensive example to demonstrate how you can wrap existing LangGraph agents so they work seamlessly with NeMo Agent Toolkit features like configurable LLMs, telemetry and observability with Phoenix, and comprehensive evaluation frameworks-all without refactoring the original agent code.

The techniques shown here apply to any LangGraph agent, making it easy to add powerful capabilities provided by NeMo Agent Toolkit to your existing LangGraph applications.

**Note:** The Deep Research agent is a complex multi-agent system that performs extensive web searches, planning, and synthesis. As a result, workflow execution may take several minutes per query. This is expected behavior due to the agent's thorough research methodology.

# Table of Contents

**Note:** This notebook runs from the NeMo Agent Toolkit repository root directory. All file paths are relative to the repo root.

- [0.0) Setup](#setup)
  - [0.1) Prerequisites](#prereqs)
  - [0.2) API Keys](#api-keys)
  - [0.3) Installing Dependencies](#installing-deps)
- [1.0) About Our Example: The Deep Research Agent](#understanding-agent)
- [2.0) Running the Agent with NeMo Agent Toolkit](#running-basic)
  - [2.1) The Configuration File](#config-file)
  - [2.2) Running Your First Query](#first-query)
- [3.0) Making the Agent Configurable](#configurable-llms)
  - [3.1) Understanding the Configurable Agent](#understanding-config)
  - [3.2) Running with Different LLMs](#running-different-llms)
- [4.0) Adding Telemetry with Phoenix](#telemetry)
  - [4.1) Starting Phoenix](#starting-phoenix)
  - [4.2) Running with Telemetry](#running-telemetry)
  - [4.3) Viewing Traces in Phoenix](#viewing-traces)
- [5.0) Evaluating Agent Performance](#evaluation)
  - [5.1) Setting Up Evaluation](#setup-eval)
  - [5.2) Running Evaluation](#running-eval)
  - [5.3) Analyzing Results](#analyzing-results)
- [6.0) Next Steps](#next-steps)

<span style="color:rgb(0, 31, 153); font-style: italic;">Note: In Google Colab use the Table of Contents tab to navigate.</span>

## Important: Working Directory

**This notebook is designed to run from the NeMo Agent Toolkit repository root directory.**

All paths in this notebook are relative to the repository root. If you're running this notebook from a different location, the setup cells will automatically change to the repository root directory for you.


<a id="setup"></a>
# 0.0) Setup

<a id="prereqs"></a>
## 0.1) Prerequisites

- **Platform:** Linux, macOS, or Windows
- **Python:** version 3.11, 3.12, or 3.13
- **Python Packages:** `uv` (for package management)
- **Docker:** (optional, for running Phoenix locally)

<a id="api-keys"></a>
## 0.2) API Keys

For this notebook, you will need the following API keys:

- **NVIDIA Build:** Obtain an NVIDIA Build API Key by creating an [NVIDIA Build](https://build.nvidia.com) account and generating a key at https://build.nvidia.com/settings/api-keys
- **Tavily:** Obtain a Tavily API Key by creating a [Tavily](https://www.tavily.com/) account and generating a key at https://app.tavily.com/home (generous free tier available)
- **Anthropic API Key** (optional): Required only for Section 2.0, which runs the original Deep Research agent with its default Claude model. You can skip Section 2.0 and start directly from Section 3.0 if you don't have an Anthropic API key.

Then run the cell below to set your API keys:

In [None]:
import getpass
import os

import dotenv

dotenv.load_dotenv(override=True)

if "NVIDIA_API_KEY" not in os.environ:
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

if "TAVILY_API_KEY" not in os.environ:
    tavily_api_key = getpass.getpass("Enter your Tavily API key: ")
    os.environ["TAVILY_API_KEY"] = tavily_api_key

<a id="installing-deps"></a>
## 0.3) Installing Dependencies

First, we need to install `uv`, which offers parallel downloads and faster dependency resolution:

In [None]:
!pip install uv

Now install NeMo Agent Toolkit with the LangChain subpackage:

In [None]:
%%bash
uv pip show -q "nvidia-nat-langchain"
if [ $? -ne 0 ]; then
    uv pip install "nvidia-nat[langchain]"
else
    echo "nvidia-nat[langchain] is already installed"
fi

Next, we need to install the Deep Research agent dependencies. The Deep Research agent comes from LangChain's <!-- vale off -->[`Deepagent Quickstarts`](https://github.com/langchain-ai/deepagents-quickstarts) repository.<!-- vale on -->

**Note:** This notebook is designed to run from the NeMo Agent Toolkit repository root. The cell below will ensure we're in the correct directory and install dependencies with paths relative to the repo root.

In [None]:
import os
import subprocess

# Get the repository root directory
repo_root = subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode('utf-8').strip()

# Change to the repository root
os.chdir(repo_root)

print(f"Working directory set to: {os.getcwd()}")
print(f"Verifying path exists: {os.path.exists('external/lc-deepagents-quickstarts/deep_research')}")

In [None]:
%%bash
# Install the deep_research dependencies
# All paths are relative to the repo root
uv pip install -e external/lc-deepagents-quickstarts/deep_research

<a id="understanding-agent"></a>
# 1.0) About Our Example: The Deep Research Agent

## About Our Example: The Deep Research Agent

For this tutorial, we'll use LangGraph's Deep Research agent as our example. It's a sophisticated multi-agent system that showcases many advanced LangGraph patterns, making it an excellent demonstration of how to integrate complex LangGraph applications with NeMo Agent Toolkit.

**Why This Example?** The Deep Research agent is feature-rich and demonstrates:
- Multi-step workflows with planning and execution
- Sub-agent coordination and parallel processing
- Custom tool integration (Tavily search, strategic thinking)
- File system operations and context management
- State management across multiple agents

These patterns are common in many LangGraph applications, so the integration techniques you'll learn here are widely applicable.

### Deep Research Agent Features

**Multi-Step Research Workflow:**
1. Saves the research request
2. Creates a structured plan with `TODO` items
3. Delegates subtasks to specialized research sub-agents
4. Synthesizes findings across multiple sources
5. Responds with comprehensive analysis

**Built-in DeepAgent Tools:**
- `write_todos` and `read_todos`: Task planning and progress tracking
- `ls`, `read_file`, `write_file`, `edit_file`: File system operations
- `glob` and `grep`: File search and pattern matching
- `task`: Sub-agent delegation for isolated context windows

**Custom Research Tools:**
- `tavily_search`: Web search that fetches full webpage content
- `think_tool`: Strategic reflection mechanism for planning next steps

**Sub-Agent Architecture:**
The agent can spin up parallel research sub-agents (up to three concurrent) to investigate different aspects of a query simultaneously, with each sub-agent having its own isolated context window.

### Original LangGraph Implementation

The original Deep Research agent is defined in `external/lc-deepagents-quickstarts/deep_research/agent.py` and can be run using LangGraph's CLI through the `langgraph.json` [configuration file](https://docs.langchain.com/langsmith/cli#configuration-file):

```json
{
  "dependencies": ["."],
  "graphs": {
    "research": "./agent.py:agent"
  },
  "env": ".env"
}
```

This configuration tells LangGraph's CLI:
- Where to find dependencies (current directory)
- Where to find the agent graph (`agent.py:agent`)
- Where to load environment variables (`.env` file)

**The key insight:** The `langgraph_wrapper` workflow type provided by NeMo Agent Toolkit mimics this configuration pattern, allowing you to run most LangGraph agents that work with LangGraph CLI through NeMo Agent Toolkitâ€”while adding powerful new capabilities like telemetry, evaluation, and configurable LLMs.

In the next section, we'll see exactly how this integration works.

<a id="running-basic"></a>
# 2.0) Running the Agent with NeMo Agent Toolkit

<a id="config-file"></a>
## 2.1) The Configuration File

NeMo Agent Toolkit provides a `langgraph_wrapper` workflow type that allows you to integrate any existing LangGraph agent without modifying its code. Let's examine the basic configuration file:

```yaml
workflow:
  _type: langgraph_wrapper
  dependencies:
    - external/lc-deepagents-quickstarts/deep_research
  graph: external/lc-deepagents-quickstarts/deep_research/agent.py:agent
  env: .env
```

This configuration closely mirrors LangGraph's `langgraph.json` format:

| **LangGraph CLI** | **NeMo Agent Toolkit** | **Purpose** |
|---|---|---|
| `dependencies: ["."]` | `dependencies: ["external/lc-deepagents-quickstarts/deep_research"]` | Specifies Python packages to install |
| `graphs.research: "./agent.py:agent"` | `graph: "external/lc-deepagents-quickstarts/deep_research/agent.py:agent"` | Points to the agent graph object |
| `env: ".env"` | `env: ".env"` | Environment variables file |

### Key Differences

**NeMo Agent Toolkit advantages:**
- Single unified configuration for workflows, LLMs, tools, and telemetry
- Built-in support for evaluation and profiling
- Automatic telemetry integration
- Configurable LLM backends without code changes
- Works seamlessly with other NeMo Agent Toolkit features

Let's view the actual configuration file:

In [None]:
%load examples/frameworks/auto_wrapper/langchain_deep_research/configs/config.yml


<a id="first-query"></a>
## 2.2) Running Your First Query

Now let's run the Deep Research agent using NeMo Agent Toolkit. We'll start with a simple question to verify everything works correctly, then try a more complex research query.

**Note about the LLM:** The Deep Research agent uses Anthropic's Claude model by default (hardcoded in the original `agent.py`). If you don't have access to an Anthropic API key or prefer to use a different model (such as Gemini or GPT-4), you can skip ahead to [Section 3.0: Making the Agent Configurable](#configurable-llms) where we show how to configure any LLM without modifying the agent code.

### Quick Verification Query

First, let's test with a simple question that should return quickly:

In [None]:
import getpass
import os

if "ANTHROPIC_API_KEY" not in os.environ:
    anthropic_api_key = getpass.getpass("Enter your Anthropic API key: ")
    os.environ["ANTHROPIC_API_KEY"] = anthropic_api_key

### Quick Verification Query

First, let's test with a simple question that should return quickly:

In [None]:
!nat run --config_file examples/frameworks/auto_wrapper/langchain_deep_research/configs/config.yml \
  --input "What is the capital of France?"

### What Just Happened?

Behind the scenes, NeMo Agent Toolkit:
1. Loaded the LangGraph agent from the specified Python module
2. Installed the required dependencies automatically
3. Set up the environment variables from the `.env` file
4. Wrapped the agent to work within the NeMo Agent Toolkit execution framework
5. Executed the query and streamed results back

All of this happened **without modifying a single line of the original LangGraph agent code**!

### Complex Research Query

Now that we've verified the setup works, let's try a more complex research question. This query will demonstrate the agent's full capabilities:

**Note:** This query involves web searches and synthesis, so it may take several minutes to complete. The agent will:
1. Create a research plan with `TODO` items
2. Delegate subtasks to research sub-agents
3. Perform multiple web searches using Tavily
4. Synthesize findings into a comprehensive report

In [None]:
!nat run --config_file examples/frameworks/auto_wrapper/langchain_deep_research/configs/config.yml \
  --input "What are the key differences between ReAct and ReWOO agent architectures?"


<a id="configurable-llms"></a>
# 3.0) Making the Agent Configurable

The original Deep Research agent hardcodes its LLM choice in the Python code. NeMo Agent Toolkit allows us to make the LLM configurable without modifying the core agent logic. This enables easy experimentation with different models, and supports choosing the right model and settings with the hyper-parameter optimizer included in NeMo Agent Toolkit (see [optimizer documentation](./../../../../docs/source/improve-workflows/optimizer.md)).


<a id="understanding-config"></a>
## 3.1) Understanding the Configurable Agent

Let's examine the modified agent file that uses the Builder provided by NeMo Agent Toolkit to retrieve a configurable LLM:

In [None]:
%load examples/frameworks/auto_wrapper/langchain_deep_research/src/configurable_agent.py


### Key Changes from Original:

**Original hardcoded LLM:**
```python
model = init_chat_model(model="anthropic:claude-sonnet-4-5-20250929", temperature=0.0)
```

**Configurable version:**
```python
from nat.builder.sync_builder import SyncBuilder
from nat.builder.framework_enum import LLMFrameworkEnum

model = SyncBuilder.current().get_llm("agent", wrapper_type=LLMFrameworkEnum.LANGCHAIN)
```

The `SyncBuilder.current().get_llm()` method:
- Accesses the current builder instance via `SyncBuilder.current()`
- Retrieves the LLM configuration named "agent" from the config file
- Returns a LangChain-compatible model instance
- Allows switching models without code changes

Now let's look at the configuration file with LLM definitions:

In [None]:
%load examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_llms.yml


Notice the `llms` section:

```yaml
llms:
  agent:
    _type: nim
    model: nvidia/nemotron-3-nano-30b-a3b
    max_tokens: 16384
    chat_template_kwargs:
      reasoning_budget: 1024
```

And the workflow now points to the configurable agent:

```yaml
workflow:
  _type: langgraph_wrapper
  dependencies:
    - external/lc-deepagents-quickstarts/deep_research
  graph: examples/frameworks/auto_wrapper/langchain_deep_research/src/configurable_agent.py:agent
```

<a id="running-different-llms"></a>
## 3.2) Running with Different LLMs

Now we can easily experiment with different models by just changing the configuration. Let's try running with `nemotron-3-nano-30b-a3b`:

In [None]:
!nat run --config_file examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_llms.yml \
  --input "What are the trade-offs between using embeddings versus keywords for document retrieval?"

To try a different model, you can easily modify the config file or create a new one. For example, to use the `llama-3.3-nemotron-super-49b-v1` model:

In [None]:
%%writefile /tmp/config_llama.yml

llms:
  agent:
    _type: nim
    model: nvidia/llama-3.3-nemotron-super-49b-v1
    max_tokens: 16384
workflow:
  _type: langgraph_wrapper
  dependencies:
    - external/lc-deepagents-quickstarts/deep_research
  graph: examples/frameworks/auto_wrapper/langchain_deep_research/src/configurable_agent.py:agent
  env: .env

In [None]:
!nat run --config_file /tmp/config_llama.yml \
  --input "What are the trade-offs between using embeddings versus keywords for document retrieval?"

### Benefits of Configurable LLMs:

1. **Easy Experimentation:** Test different models without code changes
2. **A/B Testing:** Compare model performance on the same queries
3. **Cost Optimization:** Switch between models based on cost and performance needs
4. **Environment-Specific Models:** Use different models for dev, staging, and production
5. **Unified Configuration:** All infrastructure choices in one place

In the next section, we'll add telemetry to the agent to see how it performs with different models!

<a id="telemetry"></a>
# 4.0) Adding Telemetry with Phoenix

One of the key benefits of using NeMo Agent Toolkit is the ability to add comprehensive instrumentation to any agent with just configuration changes. Let's add telemetry using Arize Phoenix, an open-source observability platform for LLM applications.

<a id="starting-phoenix"></a>
## 4.1) Starting Phoenix

First, we need to start the Phoenix server. Phoenix provides a web UI for viewing traces, spans, and metrics from your agent executions.

**Option 1: Using Docker (Recommended)**

In [None]:
%%bash
# Start Phoenix in the background using Docker
docker run -d \
  --name phoenix \
  -p 6006:6006 \
  arizephoenix/phoenix:latest

echo "Phoenix is starting... It will be available at http://localhost:6006"
echo "Give it a few seconds to fully initialize"

**Option 2: Using Phoenix CLI**

In [None]:
%%bash
# Install Phoenix
uv pip install arize-phoenix

# Start Phoenix server in the background
# This will start the server on http://localhost:6006
nohup phoenix serve > /dev/null 2>&1 &

echo "Phoenix server is starting at http://localhost:6006"
echo "Give it a few seconds to fully initialize"

### Accessing the Phoenix UI

Once Phoenix is running, open your browser and navigate to:
- **URL:** http://localhost:6006

You should see the Phoenix dashboard. Initially, it will be empty since we haven't sent any traces yet.

<a id="running-telemetry"></a>
## 4.2) Running with Telemetry

Now let's examine the configuration file that adds Phoenix telemetry. The key addition is the `general.telemetry` section:

In [None]:
%load examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_telemetry.yml


The telemetry configuration is straightforward:

```yaml
general:
  telemetry:
    tracing:
      phoenix:
        _type: phoenix
        endpoint: http://localhost:6006/v1/traces
        project: lc_deepagents
```

This configuration:
- Enables Phoenix tracing
- Points to the local Phoenix server
- Creates a project named `lc_deepagents` to organize traces

Now let's run the agent with telemetry enabled:

In [None]:
!nat run --config_file examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_telemetry.yml \
  --input "Compare the performance characteristics of RAG versus fine-tuning for domain adaptation"

<a id="viewing-traces"></a>
## 4.3) Viewing Traces in Phoenix

After the query completes, switch to your Phoenix UI (http://localhost:6006) and explore the telemetry data:

### What You'll See in Phoenix:

**1. Traces View:**
- Complete execution trace of your agent run
- Hierarchical view of all function calls and LLM interactions
- Timing information for each step

**2. Spans:**
- Individual operations (LLM calls, tool calls, sub-agent delegations)
- Input and output data for each operation
- Latency and token usage metrics

**3. Projects:**
- All traces organized under the `lc_deepagents` project
- Easy filtering and comparison of different runs

**4. LLM Metrics:**
- Token usage (prompt and completion tokens)
- Cost estimates
- Model performance statistics

### Key Benefits of Telemetry:

- **Debugging:** Trace exactly what your agent did at each step
- **Performance Optimization:** Identify slow operations and bottlenecks
- **Cost Monitoring:** Track token usage and API costs
- **Quality Assurance:** Review agent decisions and tool usage patterns

**Important:** Observability can be added with **zero code changes** to the original LangGraph agent! (although we changed one line in the original code to make the LLM configurable)

<a id="evaluation"></a>
# 5.0) Evaluating Agent Performance

One of the most powerful features of NeMo Agent Toolkit is its built-in evaluation framework. Let's set up systematic evaluation of our Deep Research agent using a dataset and automated metrics.

<a id="setup-eval"></a>
## 5.1) Setting Up Evaluation

Let's examine the evaluation configuration file:

In [None]:
%load examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_eval.yml


### Understanding the Evaluation Configuration:

**1. LLM Definitions:**
```yaml
llms:
  agent:  # The LLM used by the research agent
    _type: nim
    model: nvidia/nemotron-3-nano-30b-a3b
    max_tokens: 16384
    chat_template_kwargs:
      reasoning_budget: 1024
    
  judge:  # A separate LLM used to evaluate outputs
    _type: nim
    model: nvidia/nvidia-nemotron-nano-9b-v2
```

**2. Evaluation Dataset:**
```yaml
eval:
  general:
    dataset:
      _type: csv
      file_path: examples/frameworks/auto_wrapper/langchain_deep_research/data/DeepConsult_top1.csv
      structure:
        answer_key: baseline_answer
```

The dataset contains:
- `question`: Research questions to answer
- `baseline_answer`: Reference answers for comparison
- `candidate_answer`: (populated during eval) Agent's responses

**3. Evaluator Configuration:**
```yaml
evaluators:
  judge:
    _type: ragas
    metric: AnswerAccuracy
    llm_name: judge
    input_obj_field: ground_truth
```

This uses RAGAS (Retrieval Augmented Generation Assessment) to evaluate:
- **AnswerAccuracy:** How well the agent's answer matches the ground truth
- Uses the "judge" LLM to score answers
- Compares against the `ground_truth` field from the dataset

Let's peek at the evaluation dataset:

In [None]:
import pandas as pd
from IPython.display import Markdown
from IPython.display import display

df = pd.read_csv('examples/frameworks/auto_wrapper/langchain_deep_research/data/DeepConsult_top1.csv')
n_questions = len(df)
sample_q = df['question'].iloc[0][:300] + "..." if len(df['question'].iloc[0]) > 300 else df['question'].iloc[0]
sample_a = df['baseline_answer'].iloc[0][:500] + "..." if len(
    df['baseline_answer'].iloc[0]) > 500 else df['baseline_answer'].iloc[0]

display(
    Markdown(f"""
**Dataset contains:** `{n_questions}` **questions**

---

**Sample question:**
```
{sample_q}
```
<br>

**Ground truth answer:**
{sample_a}
"""))

<a id="running-eval"></a>
## 5.2) Running Evaluation

Now let's run the evaluation using the `nat eval` command. This will:
1. Load all questions from the dataset
2. Run the Deep Research agent on each question
3. Collect the agent's responses
4. Use the judge LLM to evaluate answer quality
5. Generate a comprehensive evaluation report

**Note:** This may take a considerable amount of time depending on the dataset size, as each question involves:
- Research planning
- Multiple web searches
- Sub-agent coordination
- Synthesis and reporting

In [None]:
!nat eval --config_file examples/frameworks/auto_wrapper/langchain_deep_research/configs/config_with_eval.yml

<a id="analyzing-results"></a>
## 5.3) Analyzing Results

After the evaluation completes, NeMo Agent Toolkit generates several outputs in the configured `output_dir` (`.tmp/deepagents_eval`):

### Output Files:

**1. Judge Output (`judge_output.json`):**
- Average evaluation score across all questions
- Per-question scores and detailed reasoning
- User input, agent response, and reference answer for each question
- Structure:
  ```json
  {
    "average_score": 0.5,
    "eval_output_items": [
      {
        "id": "",
        "score": 0.5,
        "reasoning": {
          "user_input": "...",
          "response": "...",
          "reference": "..."
        }
      }
    ]
  }
  ```

**2. Workflow Output (`workflow_output.json`):**
- Full agent responses for each question
- Complete execution details
- Raw agent output before evaluation

Let's load and examine the results:

In [None]:
import json
import os

# Load evaluation results from judge_output.json
results_path = '.tmp/deepagents_eval/judge_output.json'
if os.path.exists(results_path):
    with open(results_path) as f:
        results = json.load(f)

    print("Evaluation Summary:")
    print("=" * 60)
    print(f"Average Score: {results.get('average_score', 'N/A')}")
    print(f"Total Questions Evaluated: {len(results.get('eval_output_items', []))}")

    # Show per-question results
    print("\nPer-Question Results:")
    print("=" * 60)
    for i, item in enumerate(results.get('eval_output_items', [])):
        print(f"\nQuestion {i+1}:")
        print(f"  Score: {item.get('score', 'N/A')}")

        # Show reasoning details
        reasoning = item.get('reasoning', {})
        if reasoning:
            user_input = reasoning.get('user_input', 'N/A')
            print(f"  User Input: {user_input[:100]}..." if len(user_input) > 100 else f"  User Input: {user_input}")

            # Show a snippet of the response and reference if available
            if 'response' in reasoning:
                response = str(reasoning['response'])[:200]
                print(f"  Agent Response (snippet): {response}...")

            if 'reference' in reasoning:
                reference = str(reasoning['reference'])[:200]
                print(f"  Reference Answer (snippet): {reference}...")
else:
    print(f"Results file not found at {results_path}")
    print("Please ensure the evaluation has completed successfully.")
    print("\nNote: Evaluation output is saved to:")
    print("  - judge_output.json: Evaluation scores and reasoning")
    print("  - workflow_output.json: Full agent responses")


### Comparing Different Models

You can easily compare how different LLMs perform on the same evaluation dataset. Simply modify the `agent` LLM in the config and run evaluation again:

In [None]:
%%writefile /tmp/config_eval_nemotron.yml

llms:
  agent:
    _type: nim
    model: nvidia/llama-3.3-nemotron-super-49b-v1
    max_tokens: 16384
  judge:
    _type: nim
    model: nvidia/nvidia-nemotron-nano-9b-v2
    max_tokens: 16384

workflow:
  _type: langgraph_wrapper
  dependencies:
    - external/lc-deepagents-quickstarts/deep_research
  graph: examples/frameworks/auto_wrapper/langchain_deep_research/src/configurable_agent.py:agent
  env: .env

eval:
  general:
    output_dir: .tmp/deepagents_eval_nemotron
    workflow_alias: deepagents_eval_nemotron
    dataset:
      _type: csv
      file_path: examples/frameworks/auto_wrapper/langchain_deep_research/data/DeepConsult_top1.csv
      structure:
        answer_key: baseline_answer
    profiler:
      base_metrics: true

  evaluators:
    judge:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: judge
      input_obj_field: ground_truth

In [None]:
# Run evaluation with Nemotron
!nat eval --config_file /tmp/config_eval_nemotron.yml

Now you can compare the results between different models:
- Check the respective output directories
- Analyze cost versus quality trade-offs

### Key Benefits of NeMo Agent Toolkit Evaluation:

1. **Systematic Testing:** Evaluate on consistent datasets
2. **Automated Metrics:** Use LLM judges for quality assessment
3. **Performance Tracking:** Monitor latency, tokens, and costs
4. **Model Comparison:** Easily A/B test different LLMs
5. **Regression Detection:** Catch quality degradation over time

All achieved with **zero modifications** to the original LangGraph agent code!

<a id="next-steps"></a>
# 6.0) Next Steps

Congratulations! You've learned how to integrate a LangGraph Deep Research agent with NeMo Agent Toolkit and unlock powerful capabilities:

### What You Accomplished:

1. âœ… Set up and ran a complex LangGraph agent using NeMo Agent Toolkit
2. âœ… Added comprehensive telemetry with Phoenix
3. âœ… Made the agent configurable for different LLMs
4. âœ… Evaluated agent performance systematically

### Advanced Topics to Explore:

**1. Additional Telemetry Backends:**
- Try OpenTelemetry, Weave, or LangSmith
- Configure multiple telemetry backends simultaneously
- Set up alerting and monitoring

**2. Advanced Evaluation:**
- Add custom metrics beyond AnswerAccuracy
- Use multiple judge LLMs for consensus scoring
- Implement human-in-the-loop evaluation
- Create evaluation reports with visualization

**3. Performance Optimization:**
- Use profiling to identify bottlenecks
- Experiment with different model sizes
- Optimize sub-agent delegation strategies
- Implement caching for common queries

**4. Production Deployment:**
- Deploy the agent as a REST API using `nat serve`
- Set up continuous evaluation pipelines
- Implement version control for configurations

**5. Custom Agent Development:**
- Build your own agents using NeMo Agent Toolkit primitives
- Integrate custom tools and functions
- Implement domain-specific agent workflows
- Create reusable agent templates

### Learn More:

- **NeMo Agent Toolkit Documentation:** https://docs.nvidia.com/nemo-agent-toolkit
- **LangGraph Documentation:** https://langchain-ai.github.io/langgraph/
- **Phoenix Documentation:** https://docs.arize.com/phoenix

Happy agent building! ðŸš€