<div style="background-color:#e6f7ff; padding:10px; border-radius:6px;">

# Advanced Evaluation of Langraph Agents with **watsonx governance**

This notebook demonstrates advanced evaluation capabilities of IBM watsonx.governance for monitoring and governing production-grade LangGraph agentic systems. These evaluators provide enterprise-ready governance for complex agentic architectures, moving beyond basic LLM assessments.

We first create a question answering agent that can use local documents or web search to answer the question. The agent will context relevance to decide what tool. We then use the Agentic AI evaluators from IBM watsonx.governance Python SDK to evaluate this agent on metrics including:
- Retrieval context relevance
- Web search context relevance
- Retrieval precision
- Web search precision
- PII
- HAP
- HARM
- Jailbreak
- Sexual content
- Latency
- Cost



# Agent Architecture

<div>
  <img 
    src="https://raw.githubusercontent.com/ibm-self-serve-assets/building-blocks/main/trusted-ai/design-time-evaluations/agents-evaluations/images/Advanced_Agent.png" 
    alt="Advanced Agent" 
    width="15%">
</div>


## Important Note:

If you are using this watsonx instance for the first time, you need to first associate your instance with a runtime. See the step-by-step PDF guide for detailed instruction.

</div>

## 1. Prerequisites

### Install the dependencies

**Note:** Ignore the dependecy errors when running the next two cells. Your code will run without problem.

In [None]:
!pip install --quiet "ibm-watsonx-gov[agentic,visualization]" "langchain-chroma<0.3.0" "chromadb>=1.0.13,<2.0.0" "langchain-openai<=0.3.0"
!pip uninstall --quiet -y torch
!pip install --quiet ddgs
# !pip install --quiet "ddgs[all]"
!pip install --quiet langchain-ibm
!pip install --quiet sentence-transformers
!pip install --quiet ibm_agent_analytics==0.5.4 

In [None]:
!pip uninstall --quiet protobuf -y
# !pip install --quiet protobuf==4.25.3
!pip install --quiet protobuf==4.25.3

### 🔑 Configure Authentication


Below is a brief description of the required environment variables.  
For detailed instructions on how to obtain them, please see the step-by-step PDF guide.  

- Only **WATSONX_APIKEY** and **WATSONX_PROJECT_ID** are required to run this notebook.  

### First-time setup
When you run the code snippet for the first time:  
1. A pop-up input bar will appear asking for each variable.  
2. Paste your **API key** and press **Enter**.  
3. Next, you will be prompted for the **Project ID**. Paste it and press **Enter**.  

In [None]:
import os, getpass


def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")


# For watsonx.governance Cloud
_set_env("WATSONX_APIKEY")
# _set_env("WATSONX_REGION")
# _set_env("WXG_SERVICE_INSTANCE_ID")

# set project ID for experiment tracking
_set_env("WATSONX_PROJECT_ID")

print("✅ Environment configured successfully!")

### Set URLs for data
First URL is for your RAG PDF source document.

Second one is for the set of test questions and ground truth to evaluate the RAG app.

Test Question CSV should have two columns: `input_text`, `ground_truth`.

`input_text` should be questions for the RAG app, `ground_truth` is examples of "good" or expected answers to the RAG query.

In [None]:
RAG_DATA_PDF_URL_1 = (
    "https://ibm.box.com/shared/static/a9o9dsyzwcuwneugha769cnkf931or6f.pdf"
)
TEST_QUESTIONS_CSV_URL_1 = (
    "https://ibm.box.com/shared/static/yyjyg4xw5ix6xag3p8wsyqc2cobrz655.csv"
)

### Get rid of any persistent files from prior runs

In [None]:
import shutil
shutil.rmtree("vector_store", ignore_errors=True)
!rm -rf ./vector_store
%rm loan_doc.pdf


In [None]:
# import os

for root, _, files in os.walk("."):
    for name in files:
        path = os.path.join(root, name)
        size = os.path.getsize(path)
        print(f"📄 {path} — {size / 1024:.2f} KB")

## 2. Create RAG vector db

### Go grab a pdf as a RAG source

In [None]:
import requests

url = RAG_DATA_PDF_URL_1
r = requests.get(url)
r.raise_for_status()
with open("loan_doc.pdf", "wb") as f:
    f.write(r.content)

### Set up the local vector store

Chunk the PDF and store it in the vector DB to be used by the RAG app.


In [None]:
# import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_ibm.embeddings import WatsonxEmbeddings

# Load PDF
loader = PyPDFLoader("loan_doc.pdf")
documents = loader.load()

# Chunk documents to stay under token limit
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400, chunk_overlap=50  # safely below 512-token max
)
chunked_docs = splitter.split_documents(documents)

# Initialize IBM embedding model
embedding_model = WatsonxEmbeddings(
    model_id="ibm/slate-30m-english-rtrvr",
    url="https://us-south.ml.cloud.ibm.com",
    apikey=os.environ["WATSONX_APIKEY"],
    project_id=os.environ["WATSONX_PROJECT_ID"],
)

# Persist directory for Chroma
persist_dir = "vector_store"

# Create Chroma vector store
vector_store = Chroma.from_documents(
    documents=chunked_docs, embedding=embedding_model, persist_directory=persist_dir
)

# Save to disk
vector_store.persist()

# Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Test retrieval
results = retriever.get_relevant_documents("Auto loans versus personal loans")
for r in results:
    print("🔍", r.page_content)

## 3. Create Langraph Agent Application

### Set up the State

In [None]:
from typing_extensions import TypedDict


class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_text (str):
            The user's raw input query or question.
        ground_truth (Optional[str]):
            Reference/correct answer (if available). Used for evaluation.
        local_context (List[str]):
            Context retrieved from local knowledge base or vector store.
        web_context (List[str]):
            Context fetched from web searches (if used).
        generated_text (Optional[str]):
            The final output generated by the LLM after processing all contexts.
    """

    input_text: str  # The user's raw input query or question
    ground_truth: str  # Reference/correct answer (if available). Used for evaluation
    local_context: list[str]  # Context retrieved from vector store
    web_context: list[str]  # Context fetched from web searches (if used)
    generated_text: (
        str  # The final output generated by the LLM after processing all contexts
    )

### Set up the evaluator

For evaluating your Agentic AI applications, you need to first instantiate the `AgenticEvaluator` class. This class defines evaluators to compute the different metrics at the node and interaction level.

We are going to use the following evaluator group in this notebook:
1. `evaluate_retrieval_quality` : To compute retrieval quality metrics on your content retrieval tool. Retrieval Quality metrics include Context Relevance, Retrieval Precision, Average Precision, Hit Rate, Reciprocal Rank, NDCG
2. `evaluate_answer_quality`: To compute answer quality metric of your answer generation tool. Answer Quality metrics include Answer Relevance, Faithfulness, Answer Similarity, Unsuccessful Requests. Answer Similarity metric requires ground truth. 

#### Configuring Evaluations

You can define the `AgenticApp` class to define the interaction, node metrics to be computed.

You can define the metrics to be computed at the agentic app(interaction) level in metrics_configuration in `AgenticApp`.
You can specify the node level metrics to be computed along with the graph invocation using the decorators on the function.
You can specify the node level metrics to be computed after with the graph invocation using the decorators and with flag `compute_real_time` set to False or specify in the `nodes` attribute in `AgenticApp`, but it's recommended to use only one of these approaches to avoid conflicts or redundancy.

You can define evaluation configurations by specifying relevant fields for different evaluation types, such as retrieval quality and answer quality.

For example, a configuration for evaluating retrieval quality might look like this:  

```python
retrieval_quality_config = {
    "input_fields": ["input_text"], 
    "context_fields": ["local_context"]
}
```

This configuration is then used to create an `AgenticAIConfiguration` instance.

In [None]:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator
from ibm_watsonx_gov.config import AgenticAIConfiguration
from ibm_watsonx_gov.entities.agentic_app import AgenticApp, MetricsConfiguration, Node
from ibm_watsonx_gov.metrics import AnswerRelevanceMetric, ContextRelevanceMetric
from ibm_watsonx_gov.entities.enums import MetricGroup
from ibm_watsonx_gov.config.agentic_ai_configuration import TracingConfiguration


# """

# Define the metrics to be computed at the agentic app(interaction) level in metrics_configuration under AgenticApp, these metrics use the agent input and output fields.
# The node level metrics to be computed after the graph invocation can be specified

retrieval_quality_config_web_search_node = {
    "input_fields": ["input_text"],
    "context_fields": ["web_context"],
}

nodes = [
    Node(
        name="Web \nSearch \nNode",
        metrics_configurations=[
            MetricsConfiguration(
                configuration=AgenticAIConfiguration(
                    **retrieval_quality_config_web_search_node
                ),
                metrics=[ContextRelevanceMetric()],
            )
        ],
    )
]

agent_app = AgenticApp(
    name="Rag agent",
    metrics_configuration=MetricsConfiguration(
        metrics=[AnswerRelevanceMetric()], metric_groups=[MetricGroup.CONTENT_SAFETY]
    ),
    nodes=nodes,
)


evaluator = AgenticEvaluator(
    agentic_app=agent_app,
    tracing_configuration=TracingConfiguration(
        project_id=os.getenv("WATSONX_PROJECT_ID")
    ),
)

# """

print(evaluator)

<div style="background-color:#f9fdff; padding:10px; border-radius:6px;">

### [Optional] Set up LLM Judge for metrics evaluation

For using LLM as judge when evaluating metrics, you need to define the model provider and the model name along with the needed credentials. The metrics that support evaluation using LLM judge are 
`context_relevance`, `faithfulness`, `answer_relevance` and `answer_similarity`.

To use LLM judge to evaluate a metric you need to add the details of the `llm_judge` when creating the metric object. For example:

```python
# Define LLM Judge using watsonx.ai
llm_judge = LLMJudge(
    model=WxAIFoundationModel(
        model_id="ibm/granite-3-3-8b-instruct",
        project_id=os.environ["WATSONX_PROJECT_ID"],
    )
)

# Defining LLM Judge using OpenAI
llm_judge = LLMJudge(
    model=OpenAIFoundationModel(
        model_id="gpt-4o-mini",
    )
)

# Specify the LLM judge when initializing the metric
@evaluator.evaluate_context_relevance(
    configuration=AgenticAIConfiguration(**context_relevance_config_local_search_node),
    metrics=[ContextRelevanceMetric(llm_judge=llm_judge)],
)

In [None]:
from ibm_watsonx_gov.entities.foundation_model import WxAIFoundationModel
from ibm_watsonx_gov.entities.llm_judge import LLMJudge

# _set_env("PROJECT_ID")
PROJECT_ID = os.environ["WATSONX_PROJECT_ID"]

llm_judge = LLMJudge(
    model=WxAIFoundationModel(
        model_id="ibm/granite-3-3-8b-instruct",
        project_id=PROJECT_ID,
    )
)

### Build your langgraph application

In [None]:
from langchain_core.prompts import ChatPromptTemplate

# from langchain_openai import ChatOpenAI
from langgraph.config import RunnableConfig

#### Define vector database retrieval node

We are using a Similarity with Threshold Retrieval strategy. This will fetch the top 3 documents matching the query if the threshold score is more than 0.1

The `local_search_node` tool defined below is decorated with IBM watsonx.governance evaluator group `evaluate_retrieval_quality` tool to compute the retrieval quality metrics. Retrieval Quality metrics include Context Relevance, Retrieval Precision, Average Precision, Hit Rate, Reciprocal Rank, NDCG. Users can use the individual metrics by decorating the individual evaluators (`evaluate_retrieval_precision`, `evaluate_average_precision`, `evaluate_hit_rate`, `evaluate_reciprocal_rank`, `evaluate_ndcg`, `evaluate_context_relevance`) . This node reads the user query from the `input_text` attribute from the application state and writes the result into the `local_context` attribute back to the application state.

In [None]:
from ibm_watsonx_gov.metrics import ContextRelevanceMetric


def get_local_search_node(evaluator):
    retrieval_quality_config_local_search_node = {
        "input_fields": ["input_text"],
        "context_fields": ["local_context"],
    }

    @evaluator.evaluate_retrieval_quality(
        configuration=AgenticAIConfiguration(
            **retrieval_quality_config_local_search_node
        ),
        # Uncomment the following line to evaluate context relevance using LLM as judge
        # metrics=[ContextRelevanceMetric(llm_judge=llm_judge)],
    )
    # @evaluator.evaluate_retrieval_precision(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    # @evaluator.evaluate_average_precision(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    # @evaluator.evaluate_hit_rate(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    # @evaluator.evaluate_reciprocal_rank(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    # @evaluator.evaluate_ndcg(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    # @evaluator.evaluate_context_relevance(configuration=AgenticAIConfiguration(**retrieval_quality_config_local_search_node))
    def local_search_node(state: GraphState, config: RunnableConfig) -> dict:
        similarity_threshold_retriever = vector_store.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"k": 3, "score_threshold": 0.1},
        )
        context = similarity_threshold_retriever.invoke(state["input_text"])
        print("\n##########Going to the vector search node#############")
        print("\n########## Retrieved Context: #############")
        for i, doc in enumerate(context):
            print(f"\n--- Document {i} ---\n{doc.page_content}\n")

        return {"local_context": [doc.page_content for doc in context]}

    return local_search_node


local_search_node = get_local_search_node(evaluator)

#### Define web search retrieval node

We are using a DuckDuckGo to do the web search.

The `web_search_node` tool defined below is decorated with IBM watsonx.governance evaluator group `evaluate_retrieval_quality` tool to compute the retrieval quality metrics. Retrieval Quality metrics include Context Relevance, Retrieval Precision, Average Precision, Hit Rate, Reciprocal Rank, NDCG. Users can use the individual metrics by decorating the individual evaluators as mentioned above. This node reads the user query from the `input_text` attribute from the application state and writes the result into the `web_context` attribute back to the application state.

In [None]:
from ddgs import DDGS


def get_web_search_node(evaluator):
    retrieval_quality_config_web_search_node = {
        "input_fields": ["input_text"],
        "context_fields": ["web_context"],
    }

    @evaluator.evaluate_retrieval_quality(
        configuration=AgenticAIConfiguration(
            **retrieval_quality_config_web_search_node
        ),
        # Uncomment the following line to evaluate context relevance using LLM as judge
        # metrics=[ContextRelevanceMetric(llm_judge=llm_judge)],
    )
    def web_search_node(state: GraphState, max_results: int = 5) -> dict:
        """Perform web search using DuckDuckGo and return relevant content snippets"""
        query = state["input_text"]
        results = []
        print("\n##########Going to the web search node#############")
        with DDGS() as ddgs:
            search_results = ddgs.text(query, max_results=max_results)

            for result in search_results:
                url = result.get("href")
                snippet = result.get("body") or result.get("title")
                if snippet and url:
                    results.append(f"From {url}: {snippet}")
                    print(
                        "\n###HERE IS A WEB SEARCH SNIPPET####\n",
                        snippet,
                        "\n###END OF WEB SNIPPET###\n",
                    )

        return {"web_context": results[:max_results]}

    return web_search_node


web_search_node = get_web_search_node(evaluator)

### Create a prompt template to get the response from the LLM

We are using a Llama model from watsonx to generate an answer for our query.

In [None]:
def generate_response(input_text: str, context_text: list[str]):
    from langchain_ibm import WatsonxLLM
    from langchain.prompts import ChatPromptTemplate

    generate_prompt = ChatPromptTemplate.from_template(
        "Answer the query in 1 sentence using only the provided context:\n"
        "Context: {context}\n"
        "Question: {input_text}\n"
        "Answer:"
    )
    formatted_prompt = generate_prompt.invoke(
        {"input_text": input_text, "context": "\n".join(context_text)}
    )

    # Initialize WatsonX LLM
    llm = WatsonxLLM(
        model_id="meta-llama/llama-3-3-70b-instruct",  # You can change this to other available models
        url="https://us-south.ml.cloud.ibm.com",  # Update with your WatsonX URL
        project_id=os.getenv(
            "WATSONX_PROJECT_ID"
        ),  # Replace with your actual project ID
        params={
            "max_new_tokens": 500,
            "decoding_method": "greedy",
            "repetition_penalty": 1.1,
            "stop_sequences": ["."],
        },
    )

    result = llm.invoke(formatted_prompt)
    return result

#### Define vector-database based answer generation tool

We are using the watsonx call in `generate_response` to generate an answer for our query.


The `generate_local_context_node` defined below is decorated with evaluator group `evaluate_answer_quality` for computing answer quality metrics. Answer Quality metrics include Answer Relevance, Faithfulness, Answer Similarity, Unsuccessful Requests. Like in the previous cell, this node reads the user query from the `input_text` attribute from the application state, the `context` attribute consists of the context chunks. After generating the answer, the node writes the result into the `generated_text` attribute back to the application state. Users can use the individual metrics by decorating the individual evaluators as mentioned above. Similar to retrieval quality metrics in the previous cell, answer quality metrics can be evaluated using LLM as judge.

In [None]:
from ibm_watsonx_gov.metrics import FaithfulnessMetric


def get_local_context_node(evaluator):
    answer_quality_config_local_context_node = {
        "input_fields": ["input_text"],
        "context_fields": ["local_context"],
        "output_fields": ["generated_text"],
        "reference_fields": ["ground_truth"],
    }

    @evaluator.evaluate_answer_quality(
        configuration=AgenticAIConfiguration(
            **answer_quality_config_local_context_node
        ),
        # Uncomment the following block evaluate the faithfulness metric using LLM judge
        # metrics=[FaithfulnessMetric(llm_judge=llm_judge)],
    )
    # @evaluator.evaluate_faithfulness(configuration=AgenticAIConfiguration(**answer_quality_config_local_context_node))
    # @evaluator.evaluate_answer_similarity(configuration=AgenticAIConfiguration(**answer_quality_config_local_context_node))
    # @evaluator.evaluate_answer_relevance(configuration=AgenticAIConfiguration(**answer_quality_config_local_context_node))
    # @evaluator.evaluate_unsuccessful_requests(configuration=AgenticAIConfiguration(**answer_quality_config_local_context_node))
    def generate_local_context_node(state: GraphState, config: RunnableConfig) -> dict:

        result = generate_response(state["input_text"], state["local_context"])

        # print ("\n####################\n I am inside the vectordb tool\n####################\n")
        # print (result)
        return {**state, "generated_text": result}

    return generate_local_context_node


generate_local_context_node = get_local_context_node(evaluator)

#### Define web-search based answer generation tool

We are using the watsonx call in `generate_response` to generate an answer for our query.

The `generate_web_context_node` defined below is decorated with with evaluator group `evaluate_answer_quality` for computing answer quality metrics. Answer Quality metrics include Answer Relevance, Faithfulness, Answer Similarity, Unsuccessful Requests. Like in the previous cell, this node reads the user query from the `input_text` attribute from the application state, the `context` attribute consists of the context chunks. After generating the answer, the node writes the result into the `generated_text` attribute back to the application state. Users can use the individual metrics by decorating the individual evaluators as mentioned above.

In [None]:
def get_web_context_node(evaluator):
    answer_quality_config_web_context_node = {
        "input_fields": ["input_text"],
        "context_fields": ["web_context"],
        "output_fields": ["generated_text"],
        #    "reference_fields":["ground_truth"]
    }

    @evaluator.evaluate_answer_quality(
        configuration=AgenticAIConfiguration(**answer_quality_config_web_context_node)
    )
    def generate_web_context_node(state: GraphState, config: RunnableConfig) -> dict:

        result = generate_response(state["input_text"], state["web_context"])
        # print ("\n####################\n I am inside the websearch tool\n####################\n")
        # print (result)
        return {**state, "generated_text": result}

    return generate_web_context_node


generate_web_context_node = get_web_context_node(evaluator)

#### Adding a Router Function to Check Context Relevance

- The `check_context_relevance` function evaluates the retrieved context and assigns a **Context Relevance Score**.  
- If the score meets the required threshold, the workflow proceeds to the **Vector DB Answer Generation** node.  
- If the score is below the threshold, the workflow reroutes to the **Web Search Node** for additional information.  

In [None]:
def check_context_relevance(state: GraphState, config: RunnableConfig) -> str:

    # Filter results for "context_relevance" from the "retrieval_node"
    latest_metric = evaluator.get_metric_result(
        metric_name="context_relevance", node_name="local_search_node"
    )
    print(
        "\n######## Context Relevance Metric: ###########\n",
        latest_metric.value,
    )
    if not latest_metric:
        print(
            "\n######## NO Context Relevance Metric: ###########\n",
            latest_metric.value,
        )
        # Default to "no" if no metrics found
        return "Context Relevance \nScore is Bad"

    # Check if context relevance is below threshold
    if latest_metric.value > 0.35:
        print(
            "\n######## GOOD Context Relevance Metric: ###########\n",
            latest_metric.value,
        )
        return "Context Relevance \nScore is Good"
    else:
        print(
            "\n######## BAD Context Relevance Metric: ###########\n",
            latest_metric.value,
        )
        return "Context Relevance \nScore is Bad"

#### Assemble your application

In [None]:
from langgraph.graph import START, END, StateGraph


def build_llm_agent():
    graph = StateGraph(GraphState)

    # Add nodes
    graph.add_node("Vector DB \nRetrieval \nNode", local_search_node)
    graph.add_node("Web \nSearch \nNode", web_search_node)
    graph.add_node("Vector DB \nAnswer \nGeneration", generate_local_context_node)
    graph.add_node("Web Search \nAnswer \nGeneration", generate_web_context_node)

    # Add edges
    graph.add_edge(START, "Vector DB \nRetrieval \nNode")
    graph.add_conditional_edges(
        "Vector DB \nRetrieval \nNode",
        check_context_relevance,
        {
            "Context Relevance \nScore is Good": "Vector DB \nAnswer \nGeneration",
            "Context Relevance \nScore is Bad": "Web \nSearch \nNode",
        },
    )
    graph.add_edge("Web \nSearch \nNode", "Web Search \nAnswer \nGeneration")
    graph.add_edge("Vector DB \nAnswer \nGeneration", END)
    graph.add_edge("Web Search \nAnswer \nGeneration", END)

    # Compile the graph
    rag_app = graph.compile()

    return rag_app

In [None]:
rag_app = build_llm_agent()

#### Display the graph

**Note:** you can get the link from below and paste it to https://mermaid.live to see the graph structure. To see the graph image, follow these steps: 

- Copy the entire printed text.

- Open https://mermaid.live

- Paste it in the editor.

The diagram will render instantly.


In [None]:
# # Get the raw Mermaid graph syntax
# mermaid_code = rag_app.get_graph().draw_mermaid()

# # Print it so you can copy-paste into mermaid.ink or mermaid.live
# print(mermaid_code)

 <div style="background-color:#dff6dd; padding:10px; border-radius:6px;">
  <h3 style="margin:0;">
  
  🤔 Discussion Point: 
  
Here we used a fixed threshold rather than a LLM as a judge. What are the use cases for which a fixed threhsold won't be a good fit?

Think about a scenario that would require LLM as a judge to route the agent.  </h3>
</div>


## 4. Run Agent With Test Data

### Do a single invocation

Now the application is invoked for a single row of data. You will see a new key as the input:

`ground_truth`: As the name suggests, this attribute holds the ground truth for your input text. This is needed for the answer similarity metric, which is a reference based metric. For the other metrics, this is not required.

In [None]:
import warnings

warnings.filterwarnings("ignore", message="Task exception was never retrieved")

evaluator.start_run()
result = rag_app.invoke(
    {
        "input_text": "how can I minimize my taxes?",
        "ground_truth": "To minimize taxes, consider the following strategies based on the provided context: 1.Adjust withholding: Review your withholding and update it if necessary to avoid underpayment or overpayment. This can help prevent surprise tax bills later. 2.Maximize tax-advantaged accounts: Contribute to accounts like Health Savings Accounts (HSAs), Flexible Spending Accounts (FSAs), or 529 plans. These accounts offer tax advantages that can add up over time. 3.Organize tax documents: Keep records of deductions, receipts, and other important financial paperwork. This can help you identify potential tax deductions and ensure you're not missing out on any tax-saving opportunities. 4.Automate saving and investing: An automated approach can help take the emotion out of investing and create a disciplined saving process, reducing the chance of impulsive changes that might lead to higher taxes. 5.Continue financial education: Stay informed about changes in tax rules and policies that could impact your ta situation. This can help you make proactive decisions to minimize your tax liability.",
        "record_id": "764545",
    }
)

evaluator.end_run()

### Prepare the app results

In [None]:
result.keys()
for key, val in result.items():
    print("-" * 20)
    print(f"{key=}")
    print(val)

In [None]:
eval_result = evaluator.get_result()

metric_result = eval_result.to_df()

### Display the metric results

In [None]:
from IPython.display import display

display(metric_result)

#### Get Metric Results for a specific node

In [None]:
eval_result.get_aggregated_metrics_results(node_name="Web Search \nAnswer \nGeneration")

<div style="background-color:#e6f7ff; padding:20px; border-radius:10px;
            border: 2px solid #3399ff; text-align:left; 
            display:inline-block;">

  <h1 style="margin-top:0;">🎉 🏆 🥳 Congratulations!</h1>

  <p style="font-size:18px;">
You have completed design time evaluations for an advanced LangGraph question-answering agent that uses local documents and web search to answer a given question.
  </p>

  <p style="font-size:16px; color:#444;">
  If you are interested to continue, in the following sections, you can explore 
  <b>How to evaluate your agent using batch invocation</b> and 
  <b>Compare the results of different experiments in Watsonx Evaluation Studio</b>.
  </p>

</div>


### Invoke the graph on multiple rows

IBM watsonx.governance evaluation of Agentic Applications can be done with batch invocation too. Here, a dataframe with questions and ground truths for those questions have been defined. 

#### Get a set of test questions for the RAG application

In [None]:
import pandas as pd

# supply a CSV with two columns: input_text, ground_truth
# input_text should be questions for the RAG app, ground_truth is examples of "good" or expected answers to the RAG query

question_bank_df = pd.read_csv(
    TEST_QUESTIONS_CSV_URL_1,
    encoding="ISO-8859-1",  # or try "cp1252" this is for CSV files
)

question_bank_df.head()

## 5. Enable Experiment Tracking And Compare Agent Runs


In [None]:
ai_experiment_id = evaluator.track_experiment(
    name="Agentic Evaluation", use_existing=False
)

### Start the Experiment Run 1
Now the application is invoked for the first three rows of data with the input_text which is the user's raw input query or question.

In [None]:
import warnings

warnings.filterwarnings("ignore")

from ibm_watsonx_gov.entities.ai_experiment import AIExperimentRunRequest

name = "1st Agent Run - First 3 Questions"
# [OPTIONAL] Specify custom tags for the experiment run
custom_tags = []
"""
custom_tags = [
    {
        "key": "LLM",
        "value": "garbage"
    },
    {
        "key": "temperature",
        "value": 0.1
    }
]
"""
run_request = AIExperimentRunRequest(name=name, custom_tags=custom_tags)
evaluator.start_run(run_request)

results = []
for record in question_bank_df.head(3).to_dict("records"):
    result = rag_app.batch(inputs=[record])  # Wrap single record in a list
    results.append(result)

In [None]:
evaluator.end_run()

### Prepare the App results and Display the metrics
By default, the metric result will only include the interaction_id column.
If you want to include additional data like input, output or ground_truth, you can specify them in the input_data parameter.

In [None]:
if isinstance(results, list):
    for result in results:
        print("#" * 20)
        for key, val in result[0].items():
            print("-" * 20)
            print(f"{key=}")
            print(val)

In [None]:
from IPython.display import display

#### ONE row per metric row ####

# Make sure result_df has 1 row
result_df = pd.DataFrame([result[0]])

input_data = result_df[["input_text", "generated_text"]]

eval_result = evaluator.get_result()
metric_result = eval_result.to_df()

# Repeat input_data for each row of metric_result
input_repeated = input_data.loc[
    input_data.index.repeat(len(metric_result))
].reset_index(drop=True)

# Combine
result_with_data = pd.concat(
    [input_repeated, metric_result.reset_index(drop=True)], axis=1
)

display(result_with_data)

### Start the Experiment Run 2
Now the application is invoked to evaluate the second run of the experiment.

In [None]:
import warnings

warnings.filterwarnings("ignore")

name = "2nd Agent Run - Last 3 Question"
# [OPTIONAL] Specify custom tags for the experiment run
custom_tags = []
"""
custom_tags = [
    {
        "key": "LLM",
        "value": "Granite 20B"
    },
    {
        "key": "temperature",
        "value": 0.5
    }
]
"""
run_request = AIExperimentRunRequest(name=name, custom_tags=custom_tags)
evaluator.start_run(run_request)

results = []
for record in question_bank_df.tail(3).to_dict("records"):
    result = rag_app.batch(inputs=[record])  # Wrap single record in a list
    results.append(result)

In [None]:
evaluator.end_run()

In [None]:
from IPython.display import display

result_df = pd.DataFrame(result)
input_data = result_df[
    ["input_text", "generated_text"]
]  # Add the columns which should be part of the application metric results: "local_context", "web_context"
eval_result = evaluator.get_result()
metric_result = eval_result.to_df()
result_with_data = pd.concat([input_data, metric_result], axis=1)

display(result_with_data)

### Compare the AI experiment runs in Evaluation Studio UI 
You can use Evaluation Studio UI to view the comparison of AI experiment runs with below URL.

**Important Note:** When you click the link, you will be directed to a page that requires you to complete a setup before you can view the comparison plot. Please refer to the instructions in the provided PDF for a few simple steps to complete the setup.

In [None]:
from ibm_watsonx_gov.entities.ai_experiment import AIExperiment

ai_experiment = AIExperiment(asset_id=ai_experiment_id)

evaluator.compare_ai_experiments(ai_experiments=[ai_experiment])