<div style="background-color:#e6f7ff; padding:10px; border-radius:6px;">

# Design-Time Evaluation of LangGraph-Based RAG Agents with IBM watsonx.governance Python SDK


This notebook demonstrates how to use the Agentic AI evaluators from IBM watsonx.governance for governing your agentic applications right in your development environment.

First, we will create a LangGraph RAG agent and then use watsonx Agentic AI evaluator to evaluate the agent’s performance. The agent for this lab has the following architecture. It uses local documents to perform a RAG task. 
We evaluate this agent on range of metrics:
answer similarity

- context relevance
- faithfulness
- retrieval latency
- generation latency
- interaction cost
- interaction duration
- input token count
- output token count

# Agent Architecture

<div>
  <img 
    src="https://raw.githubusercontent.com/ibm-self-serve-assets/building-blocks/main/trusted-ai/design-time-evaluations/agents-evaluations/images/Basic_Agent.png" 
    alt="Advanced Agent" 
    width="10%">
</div>

## Important Note:

If you are using this watsonx instance for the first time, you need to first associate your instance with a runtime. See the step-by-step PDF guide for detailed instruction.

</div>

In [None]:
# from ibm_agent_analytics.instrumentation.utils import record_span_attributes

### Install the dependencies

**Note:** When running the cell below, ignore the error warning for depency mismatch, it won't affect the rest of the notebook.

In [None]:
!pip install --quiet "ibm-watsonx-gov[agentic,visualization,metrics]" "langchain-chroma<0.3.0" "chromadb>=1.0.13,<2.0.0" "langchain-openai<=0.3.0"
!pip install --quiet ibm_agent_analytics==0.5.4
!pip uninstall -y -qqq torch
!pip uninstall --quiet protobuf -y
!pip install --quiet --no-deps protobuf==4.25.3

### 🔑 Configure Authentication


Below is a brief description of the required environment variables.  
For detailed instructions on how to obtain them, please see the step-by-step PDF guide.  

- Only **WATSONX_APIKEY** and **WATSONX_PROJECT_ID** are required to run this notebook.  

### First-time setup
When you run the code snippet for the first time:  
1. A pop-up input bar will appear asking for each variable.  
2. Paste your **API key** and press **Enter**.  
3. Next, you will be prompted for the **Project ID**. Paste it and press **Enter**.

In [None]:
import os, getpass


def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")


_set_env("WATSONX_APIKEY")
_set_env("WATSONX_PROJECT_ID")

print("✅ Environment configured successfully!")

### Create a Vector Store using WX embedding

We selected a few medium posts by [Manish Bhide](https://medium.com/@manish.bhide) and [Ravi Chamarthy](https://medium.com/@ravi-chamarthy) that focus on the various capabilities in IBM watsonx.governance (and erstwhile IBM Watson OpenScale). Hence, our RAG queries will focus on these capabilities covered in the above posts. 

Downloads a JSON file containing the medium posts from a shared URL and saves it locally as medium_articles.json.

In [None]:
import requests

url = "https://ibm.box.com/shared/static/o6jp3gfl3smyegjmjvsri1ll6zpq5jcv.json"
r = requests.get(url)
r.raise_for_status()

with open("medium_articles.json", "wb") as f:
    f.write(r.content)

Load Medium articles from JSON, embeds them with Watsonx embedding model, and stores them in a Chroma vector database.

In [None]:
import json
from langchain.vectorstores import (
    Chroma,
)  # Chroma: a vector store for saving/retrieving embeddings.
from langchain.schema import (
    Document,
)  # Document: LangChain schema wrapper for text + metadata
from langchain_ibm import WatsonxEmbeddings

# Load JSON
with open("medium_articles.json", "r") as f:
    data = json.load(f)

# Extract valid documents
docs = []
for i, item in enumerate(data):
    if "id" in item and "document" in item and item["document"].strip():
        docs.append(
            Document(page_content=item["document"], metadata={"id": str(item["id"])})
        )
    else:
        print(f":warning: Skipping entry {i}: missing 'id' or 'document'")

# create an embedding model instance
embedding_model = WatsonxEmbeddings(
    model_id="ibm/slate-30m-english-rtrvr",
    url="https://us-south.ml.cloud.ibm.com",
    apikey=os.environ["WATSONX_APIKEY"],
    project_id=os.environ["WATSONX_PROJECT_ID"],
)

# Persist directory for Chroma (save vecor database locally)
persist_dir = "vector_store"

# Create Chroma vector store
vector_store = Chroma.from_documents(
    documents=docs, embedding=embedding_model, persist_directory=persist_dir
)

# Save to disk
vector_store.persist()

# Optional: create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Test retrieval
results = retriever.get_relevant_documents("example query text")
for r in results:
    print(":mag:", r.page_content)

### Set up the State

The `ibm-watsonx-gov` library provides a pydantic based state class - `EvaluationState`. This provides various attributes for your use e.g. `input_text` is for storing the application input, `context` is for storing the context documents. For simple applications, developers can extend this class for their use. 

In [None]:
from ibm_watsonx_gov.entities.state import EvaluationState


class AppState(EvaluationState):
    pass

### Set up the evaluator

For evaluating your Agentic AI applications, you need to first instantiate the `AgenticEvaluator` class. This class defines a few evaluators to compute the different metrics.

We are going to use the following evaluators in this notebook:
1. `evaluate_context_relevance` : To compute context relevance metric of your content retrieval tool.
2. `evaluate_faithfulness`: To compute faithfulness metric of your answer generation tool. This metric does not require ground truth.
3. `evaluate_answer_similarity`: To compute answer similarity metric of your answer generation tool. This metric requires ground truth for computation.

You can specify the evaluators to be computed after the graph invocation by specifying flag `compute_real_time` set to False (eg: `evaluate_context_relevance(compute_real_time=False)`)

Check this documentation for a comprehensive set of evaluation metircs in watsonx.governance: https://www.ibm.com/docs/en/watsonx/w-and-w/2.2.0?topic=models-evaluation-metrics

In [None]:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator

evaluator = AgenticEvaluator()

## Build your LangGraph Agentic Application

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langgraph.config import RunnableConfig

#### Define content retrieval node

We are using a Similarity with Threshold Retrieval strategy. This will fetch the top 3 documents matching the query if the threshold score is more than 0.1

The `retrieval_node` tool defined below is decorated with IBM watsonx.governance evaluator `evaluate_context_relevance` tool to compute the context relevance metric. This node reads the user query from the `input_text` attribute from the application state and writes the result into the `context` attribute back to the application state.

In [None]:
@evaluator.evaluate_context_relevance(compute_real_time=False)
def retrieval_node(state: AppState, config: RunnableConfig) -> dict:
    similarity_threshold_retriever = vector_store.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"k": 3, "score_threshold": 0.1},
    )
    context = similarity_threshold_retriever.invoke(state.input_text)

    return {"context": [doc.page_content for doc in context]}

#### Define answer generation tool

We are using `llama-3-3-70b-instruct` to generate an answer for our query.

The `generate_node` defined below is decorated with two evaluators `evaluate_faithfulness` and `evaluate_answer_similarity` for computing answer quality metrics. Like in the previous cell, this node reads the user query from the `input_text` attribute from the application state, the `context` attribute consists of the context chunks. After generating the answer, the node writes the result into the `generated_text` attribute back to the application state.

In [None]:
from langchain_ibm import WatsonxLLM

In [None]:
@evaluator.evaluate_faithfulness(compute_real_time=False)
@evaluator.evaluate_answer_similarity(compute_real_time=False)
def generate_node(state: AppState, config: RunnableConfig) -> dict:
    generate_prompt = ChatPromptTemplate.from_template(
        "Answer the following question based on the given context:\n"
        "Context: {context}\n"
        "Question: {input_text}\n"
        "Answer:"
    )

    formatted_prompt = generate_prompt.invoke(
        {"input_text": state.input_text, "context": "\n".join(state.context)}
    )

    # Initialize WatsonX LLM
    llm = WatsonxLLM(
        model_id="meta-llama/llama-3-3-70b-instruct",
        url="https://us-south.ml.cloud.ibm.com",
        project_id=os.getenv("WATSONX_PROJECT_ID"),
        params={
            "max_new_tokens": 500,
            "decoding_method": "greedy",
            "repetition_penalty": 1.1,
            "stop_sequences": ["."],
        },
    )

    result = llm.invoke(formatted_prompt)

    # Normalize result
    if hasattr(result, "content"):  # AIMessage
        output_text = result.content
    elif isinstance(result, str):  # plain string
        output_text = result
    else:
        raise TypeError(f"Unexpected llm.invoke return type: {type(result)}")

    return {"generated_text": output_text}

#### Assemble your application

Build and compile a LangGraph workflow connecting the retrieval and generation nodes for the RAG application.

In [None]:
from langgraph.graph import START, END, StateGraph

graph = StateGraph(AppState)

# Add nodes
graph.add_node("Retrieval \nNode", retrieval_node)
graph.add_node("Generation \nNode", generate_node)

# Add edges
graph.add_edge(START, "Retrieval \nNode")
graph.add_edge("Retrieval \nNode", "Generation \nNode")
graph.add_edge("Generation \nNode", END)

# Compile the graph
rag_app = graph.compile()

#### Display the graph

**Note:** you can get the link from below and paste it to https://mermaid.live to see the graph structure. To see the graph image, follow these steps: 

- Copy the entire printed text.

- Open https://mermaid.live

- Paste it in the editor.

The diagram will render instantly.


In [None]:
# # Get the raw Mermaid graph syntax
# mermaid_code = rag_app.get_graph().draw_mermaid()

# # Print it so you can copy-paste into mermaid.ink or mermaid.live
# print(mermaid_code)

 <div style="background-color:#dff6dd; padding:10px; border-radius:6px;">
  <h3 style="margin:0;">
  
  🤔 Discussion Point: 
  
In this agentic example, what metrics do you think are important for evaluating the agent’s performance? </h3>
</div>

### Do a single invocation

Now the application is invoked for a single row of data. You will see two new keys as the input:
1. `ground_truth`: As the name suggests, this attribute holds the ground truth for your input text. This is needed for the answer similarity metric, which is a reference based metric. For the other metrics, this is not required.
2. `interaction_id`: This is required so that IBM watsonx.governance can keep track of individual rows and associate metrics with each row. This will become evident when we do batch invocation in the next cell.

In [None]:
evaluator.start_run()
result = rag_app.invoke(
    {
        "input_text": "What is counterfactual fairness?	",
        "ground_truth": "Counterfactual fairness is a fairness criterion used in machine learning to ensure that a model’s decisions remain unchanged if a protected attribute (e.g., race, gender) were different, while keeping everything else the same.",
        "interaction_id": "1",
    }
)
evaluator.end_run()

In [None]:
result.keys()
for key, val in result.items():
    print("-" * 20)
    print(f"{key=}")
    print(val)

### Prepare the app results

In [None]:
import pandas as pd

eval_result = evaluator.get_result()
metric_result = eval_result.to_df()

### Display the metric results

In [None]:
from IPython.display import display

display(metric_result)

#### Get Metric Results for a specific node

In [None]:
eval_result.get_aggregated_metrics_results(node_name="Generation \nNode")

## Batch processing

### Invoke the graph on multiple rows

IBM watsonx.governance evaluation of Agentic Applications can be done with batch invocation too. Here, a dataframe with questions and ground truths for those questions have been defined. The dataframe index has been used as  `record_id` to uniquely identify each row. 

In [None]:
import pandas as pd

question_bank_df = pd.read_csv(
    "https://raw.githubusercontent.com/IBM/ibm-watsonx-gov/refs/heads/samples/notebooks/data/agentic/medium_question_bank.csv"
)

question_bank_df["interaction_id"] = question_bank_df.index.astype(str)
question_bank_df

### Display the metric results

In [None]:
evaluator.start_run()
result = rag_app.batch(inputs=question_bank_df.to_dict("records"))
evaluator.end_run()

In [None]:
eval_result = evaluator.get_result()
metric_result = eval_result.to_df()
display(metric_result)

<div style="background-color:#e6f7ff; padding:20px; border-radius:10px;
            border: 2px solid #3399ff; text-align:left; 
            display:inline-block;">

  <h1 style="margin-top:0;">🎉 🏆 🥳 Congratulations!</h1>

  <p style="font-size:18px;">
You have completed basic design-time evaluations of a LangGraph RAG agent that answers questions using local documents. 
</p>

</div>
