<div style="background-color:#e6f7ff; padding:10px; border-radius:6px;">

# Design-Time Evaluation of Tool Calling in a LangGraph Agent Using the IBM watsonx.governance Python SDK

This notebook demonstrates how to use the Tool call Syntactic Accuracy evaluator from IBM watsonx.governance for governing your applications right in your development environment.


First, we will create a question answering agent that is equipped with two custom tools, **"convert_currency"** and **"assess_loan_risk"** to respond to the user queries. Given the user’s query, an LLM routes it to the relevant tool. If there is not a relevant tool to answer that question, the agent will generate without a tool. We will use the Agentic AI evaluators from IBM watsonx.governance Python SDK to evaluate the tool calling functionality of the agent in this lab on metrics such as:

- Tool call accuracy
- Tool call relevance
- Tool call latency

# Agent Architecture

<div>
  <img 
    src="https://raw.githubusercontent.com/ibm-self-serve-assets/building-blocks/main/trusted-ai/design-time-evaluations/agents-evaluations/images/Tool Calling_Agent.png" 
    alt="Advanced Agent" 
    width="15%">
</div>

### Install the dependencies

### Note: 

Ignore the dependency error warning after running the cell below. Your notebook will still run without porblem.

In [None]:
# Install Watsonx + LangChain stack
!pip install --quiet \
    "ibm-watsonx-gov[agentic,visualization]" \
    "ibm-watsonx-ai>=1.3.1,<1.4.0" \
    "langchain-ibm>=0.3.10,<0.4.0" \
    "langchain-community<=0.3.3"

# Re-pin conflicting dependencies
!pip install --quiet --force-reinstall --no-deps \
    "protobuf==4.21.12" \
    "scikit-learn==1.3.2" \
    "jsonschema<=4.20.0" \
    "grpcio<=1.67.1"

!pip install --quiet ibm_agent_analytics==0.5.4

**Note**: If you encounter any Torch-related attribute errors while setting up the evaluator, try resolving them by running the cell below to uninstall Torch. This step is required when running in Watson Studio.

In [None]:
!pip uninstall -y -qqq torch

In [None]:
!pip uninstall -y -qqq transformers
!pip install -qqq "transformers[tf]<4.38"

### 🔑 Configure Authentication


Below is a brief description of the required environment variables.  
For detailed instructions on how to obtain them, please see the step-by-step PDF guide.  

- Only **WATSONX_APIKEY** and **WATSONX_PROJECT_ID** are required to run this notebook.  

### First-time setup
When you run the code snippet for the first time:  
1. A pop-up input bar will appear asking for each variable.  
2. Paste your **API key** and press **Enter**.  
3. Next, you will be prompted for the **Project ID**. Paste it and press **Enter**.  

In [None]:
import os, getpass


def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")


# For watsonx.governance Cloud
_set_env("WATSONX_APIKEY")
# _set_env("WATSONX_REGION")
# _set_env("WXG_SERVICE_INSTANCE_ID")

# set project ID for experiment tracking
_set_env("WATSONX_PROJECT_ID")

print("✅ Environment configured successfully!")

### Set up the Watsonx LLM model
In this section, we initialize the ChatWatsonx from the langchain_ibm package to use IBM's Granite language model (granite-3-2b-instruct) for text generation.
Environment variables are used to securely access the IBM Watsonx platform:

- WATSONX_URL: Base URL for the Watsonx service.
- WATSONX_APIKEY: API key for authentication.
- WATSONX_PROJECT_ID: ID of the Watsonx project to scope resources

In [None]:
from langchain_ibm import ChatWatsonx
import os

llm = ChatWatsonx(
    model_id="ibm/granite-3-3-8b-instruct",
    url=os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com"),
    # Uncomment the following lines if using watsonx.governance on-prem
    # username=os.getenv("WATSONX_USERNAME"),
    # password=os.getenv("WATSONX_PASSWORD"), # Only one of api_key or password is needed
    # instance_id="openshift",
    # version=os.getenv("WATSONX_VERSION"),
    apikey=os.getenv("WATSONX_APIKEY"),
    project_id=os.getenv("WATSONX_PROJECT_ID"),
    params={
        "decoding_method": "greedy",
        "temperature": 0,
        "min_new_tokens": 5,
        "max_new_tokens": 250,
        "stop_sequences": ["Human:", "Observation:"],
    },
)

### Defining Tools for the Agent
This cell sets up custom tools that the agent can call during execution. Each tool wraps a specific Python function and provides a structured interface for interaction within the graph

In [None]:
from langchain.tools import tool
from enum import Enum
import random


def convert_currency(amount: float, from_currency: str, to_currency: str) -> dict:
    """Converts an amount of money from one currency to another (mock rates).

    Args:
        amount: The amount of money to convert.
        from_currency: The currency code of the amount (e.g., "USD").
        to_currency: The target currency code (e.g., "EUR").

    Returns:
        A dictionary with the converted value and details.
    """
    # Mock exchange rates (for demo purposes)
    rates = {
        "USD": {"EUR": 0.92, "JPY": 148.3, "GBP": 0.79},
        "EUR": {"USD": 1.09, "JPY": 161.2, "GBP": 0.86},
        "JPY": {"USD": 0.0067, "EUR": 0.0062, "GBP": 0.0053},
    }

    if from_currency not in rates or to_currency not in rates[from_currency]:
        return {
            "error": f"Conversion from {from_currency} to {to_currency} is not supported."
        }

    converted = round(amount * rates[from_currency][to_currency], 2)

    return {
        "from": f"{amount} {from_currency}",
        "to": f"{converted} {to_currency}",
        "rate_used": rates[from_currency][to_currency],
    }


@tool
def assess_loan_risk(credit_score: int, income: float, loan_amount: float) -> dict:
    """Assesses loan application risk based on credit score, income, and loan amount.

    Args:
        credit_score: Applicant’s credit score (300–850).
        income: Applicant’s annual income in USD.
        loan_amount: Requested loan amount in USD.

    Returns:
        A dictionary with risk classification, debt-to-income ratio, and recommendation.
    """

    # Debt-to-Income ratio
    dti = loan_amount / max(income, 1)  # avoid division by zero
    dti_percent = round(dti * 100, 2)

    # Rule-based risk classification
    if credit_score >= 750 and dti < 0.3:
        risk = "Low"
        recommendation = "Approve"
    elif credit_score >= 650 and dti < 0.5:
        risk = "Medium"
        recommendation = "Review manually"
    else:
        risk = "High"
        recommendation = "Reject"

    return {
        "credit_score": credit_score,
        "income_usd": income,
        "loan_amount_usd": loan_amount,
        "debt_to_income_ratio_percent": dti_percent,
        "risk_level": risk,
        "recommendation": recommendation,
    }


tools = [convert_currency, assess_loan_risk]

### Binding the Language Model with Tool Specifications

In [None]:
llm_with_tools = llm.bind_tools(tools)

### Set up the State

In [None]:
from typing_extensions import TypedDict
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from typing import TypedDict


class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_text (str):
            The user's raw input query or question.
        record_id (Optional[str]):
            Unique identifier for the record.
        generated_text (Optional[str]):
            The final output generated by the LLM after processing all contexts.
        tool_calls (list):
            The list of tools invoked for a user query
        messages (list):
            List of messages required for the LLM
    """

    messages: list  # List of messages required for the LLM
    input_text: str  # The user's raw input query or question
    tool_calls: list  # The list of tools invoked for a user query
    record_id: str  # Unique identifier for the record.
    generated_text: str  # The final output generated by the LLM after processing

#### Initialise the evaluator

In [None]:
# !pip uninstall -y -qqq transformers
# !pip install -qqq "transformers[tf]<4.38"

In [None]:
from ibm_watsonx_gov.evaluators.agentic_evaluator import AgenticEvaluator

evaluator = AgenticEvaluator()

### Set up LLM Judge for metrics evaluation

To evaluate the Tool Call Relevance Metric and Tool Call Parameter Accuracy Metric, an LLM must be used as the judge. To enable this, you need to specify the model provider, model name, and provide the necessary credentials.

To use LLM judge to evaluate a metric you need to add the details of the `llm_judge` when creating the metric object. For example:

```python
# Define LLM Judge using watsonx.ai
llm_judge = LLMJudge(
    model=WxAIFoundationModel(
        model_id="meta-llama/llama-3-3-70b-instruct",
        project_id="<PROJECT_ID>",
    )
)

# Defining LLM Judge using OpenAI
llm_judge = LLMJudge(
    model=OpenAIFoundationModel(
        model_id="gpt-4o-mini",
    )
)

# Specify the LLM judge when initializing the metric
@evaluator.evaluate_tool_call_parameter_accuracy(
    configuration=AgenticAIConfiguration(**tool_call_metric_config),
    metrics=[ToolCallParameterAccuracyMetric(llm_judge=llm_judge)],
)

### Build your langgraph application

#### Define LLM Agent Node

Define the logic for invoking the agent within a LangGraph node. It takes the current GraphState and RunnableConfig as input. It invokes the `llm_with_tools ` with the user's input text.

Extracts the tools used during the agent's reasoning process.
Formats the tool usage into a structured list for further analysis or visualization.

Returns the list of tools used, the raw response messages, and a placeholder for generated text.

The `llm_agent` node defined below is decorated with IBM watsonx.governance evaluators: `evaluate_tool_call_accuracy` to measure the syntactic correctness of the tool call, `evaluate_tool_call_relevance` to determine whether the tool call made by the LLM agent correctly addresses the user’s immediate request as the appropriate next step in the conversation, and `evaluate_tool_call_parameter_accuracy` to assess whether all parameter values in a tool call are directly supported by the conversation history or the tool specifications. The accuracy score ranges from `0` to `1`, with values closer to `1` indicating higher accuracy and values closer to `0` indicating lower accuracy. This node reads the user query from the `input_text` attribute from the application state and writes the response into the `tool_calls` attribute and the AIMessage response to the `messages` and set back to the application state.

User can specify the evaluators to be computed after the graph invocation by specifying flag `compute_real_time` set to False.

In [None]:
from ibm_watsonx_gov.config import AgenticAIConfiguration
from langgraph.config import RunnableConfig

In [None]:
from ibm_watsonx_gov.entities.llm_judge import LLMJudge
from ibm_watsonx_gov.entities.foundation_model import WxAIFoundationModel

llm_judge = LLMJudge(
    model=WxAIFoundationModel(
        model_id="meta-llama/llama-3-3-70b-instruct",
        project_id=os.getenv("WATSONX_PROJECT_ID"),
    )
)

In [None]:
import json
from ibm_watsonx_gov.metrics import (
    ToolCallRelevanceMetric,
    ToolCallParameterAccuracyMetric,
)

tool_call_metric_config = {
    "question_field": "input_text",
    "tool_calls_field": "tool_calls",
    "tools": tools,
}


@evaluator.evaluate_tool_call_relevance(
    configuration=AgenticAIConfiguration(**tool_call_metric_config),
    metrics=[ToolCallRelevanceMetric(llm_judge=llm_judge)],
    compute_real_time=False,
)
@evaluator.evaluate_tool_call_parameter_accuracy(
    configuration=AgenticAIConfiguration(**tool_call_metric_config),
    metrics=[ToolCallParameterAccuracyMetric(llm_judge=llm_judge)],
    compute_real_time=False,
)
@evaluator.evaluate_tool_call_accuracy(
    configuration=AgenticAIConfiguration(**tool_call_metric_config),
    compute_real_time=False,
)
def llm_agent(state: GraphState, config: RunnableConfig):

    user_query = state["input_text"]
    response = llm_with_tools.invoke([HumanMessage(user_query)])

    return {"messages": [response], "tool_calls": response}

### Tool Condition Function
- The tools_condition function evaluates the agent's response and determines the next step in the LangGraph workflow based on whether the LLM agent's most recent response includes a tool call.

In [None]:
def tools_condition(state: GraphState, config: RunnableConfig) -> str:
    last_message = state["messages"][-1]
    # If the agent wants to call a tool, go to tools node
    if isinstance(last_message, AIMessage) and last_message.tool_calls:
        return "Has Tools"
    # Otherwise, we're done
    else:
        return "No Tools"

### Define Answer generation node
This node generates the final answer based on the tool execution results, if available; otherwise, it returns a default response.

In [None]:
def generate_response(state: GraphState, config: RunnableConfig) -> dict:
    tool_results = "\n".join(
        [msg.content for msg in state["messages"] if isinstance(msg, ToolMessage)]
    )
    print("\n########## tool results: ", tool_results)
    if not tool_results:
        tool_results = "I'm sorry, but the agent couldn't process your query. Please try rephrasing or providing more details so I can better assist you."
    return {
        "messages": [AIMessage(content=f"Final Answer:\n{tool_results}")],
        "generated_text": tool_results,
    }

#### Assemble your application

**Note:** If you encounter an issue while importing ToolNode from langgraph, please force install langgraph using the command `pip install --upgrade --force-reinstall "langgraph>=0.3.34,<0.4.0"`

In [None]:
!pip install -qqq --upgrade --force-reinstall "langgraph>=0.3.34,<0.4.0" 2>/dev/null

In [None]:
from langgraph.graph import START, END, StateGraph

# from langgraph.prebuilt import ToolNode
from langgraph.prebuilt.tool_node import ToolNode


graph = StateGraph(GraphState)

graph.add_node("LLM Agent", llm_agent)
graph.add_node("Generate Response", generate_response)
graph.add_node("Tools", ToolNode(tools))

graph.set_entry_point("LLM Agent")
graph.add_conditional_edges(
    "LLM Agent",
    tools_condition,
    {"Has Tools": "Tools", "No Tools": "Generate Response"},
)
graph.add_edge("Tools", "Generate Response")
graph.add_edge("Generate Response", END)

# Compile the graph
rag_app = graph.compile()

#### Display the graph

**Note:** you can get the link from below and paste it to https://mermaid.live to see the graph structure. To see the graph image, follow these steps: 

- Copy the entire printed text.

- Open https://mermaid.live

- Paste it in the editor.

The diagram will render instantly.


In [None]:
# # Get the raw Mermaid graph syntax
# mermaid_code = rag_app.get_graph().draw_mermaid()

# # Print it so you can copy-paste into mermaid.ink or mermaid.live
# print(mermaid_code)

 <div style="background-color:#dff6dd; padding:10px; border-radius:6px;">
  <h3 style="margin:0;">
  
  🤔 Discussion Point: 

Think of an Agent equipped with multiple tools, how should the agent decide which tool to invoke?   

What criteria would you use (e.g., accuracy, latency, reliability, interpretability)?

How might you design the graph to handle fallbacks or errors if the first tool call fails?

Should the agent always try to reason first, or immediately delegate to a tool? Why might reasoning first not be beneficial?
  </h3>
</div>

### Do a single invocation

Now the application is invoked for a single row of data.

#### Evaluator Integration
1. `evaluator.start_run()`: Begins tracking this specific invocation. 

2. `evaluator.end_run()`: Marks the end of the evaluation run. All metrics will now be captured and logged.

In [None]:
evaluator.start_run()

In [None]:
result = rag_app.invoke(
    {
        "input_text": "what is the risk score for a 30000 loan with credit score 780 and income 120000?"
    }
)

In [None]:
evaluator.end_run()

### Prepare the app results

In [None]:
result = evaluator.get_result()

In [None]:
display(result.to_df())

In [None]:
result.get_aggregated_metrics_results(node_name="LLM Agent")

### Invoke the graph on multiple rows

IBM watsonx.governance also supports evaluation of Agentic Applications using batch invocation. The following DataFrame contains sample questions that will be used to demonstrate this capability.

In [None]:
import pandas as pd


question_bank_df = pd.read_csv(
    "https://ibm.box.com/shared/static/khgg3aeo157z97yha1f3dfml6lliomqj.csv"
)

question_bank_df

#### Execute Batch Invocation

In [None]:
evaluator.start_run()
result = rag_app.batch(inputs=question_bank_df.to_dict("records"))
evaluator.end_run()

In [None]:
result = evaluator.get_result()
display(result.to_df())

In [None]:
result.get_aggregated_metrics_results(node_name="LLM Agent")

<div style="background-color:#e6f7ff; padding:20px; border-radius:10px;
            border: 2px solid #3399ff; text-align:left; 
            display:inline-block;">

  <h1 style="margin-top:0;">🎉 🏆 🥳 Congratulations!</h1>

  <p style="font-size:18px;">
You have completed design-time evaluations of the tool-calling functionality in a LangGraph RAG agent designed to answer questions using custom tools.
</p>

</div>
