# Evaluating Agent Trajectories via Tool-Use Consistency

This notebook implements an evaluation framework for agentic systems that explicitly assesses **intermediate decision-making**, rather than relying solely on final answers. The focus is on diagnosing cases where agents appear correct at the output level while following unintended reasoning paths internally.

## Motivation: Evaluating process, not just outcome

Agentic systems built on large language models make decisions through sequences of tool calls and intermediate reasoning steps. In practice, I observed that evaluating these systems purely on final answer correctness often masks important failure modes: agents may arrive at correct answers through spurious searches, unnecessary tool usage, or coincidental reasoning paths that do not generalize.

This notebook focuses on **trajectory-level evaluation**: assessing whether an agent’s *sequence of actions* matches an expected decision process. The goal is not to enforce a single “correct” reasoning path, but to detect cases where agents rely on shortcuts or exhibit unstable behaviour that would not be apparent from output-level metrics alone.


## Experimental Setup

The experiments below use a small, controlled evaluation dataset in which each example includes:
- a user query,
- a reference final answer,
- and an expected sequence of tool calls representing a minimal, intended decision path.

This structure allows us to compare the agent’s observed trajectory against a known baseline and isolate deviations in tool use independently of answer correctness.

In [None]:
# The '%pip install' command installs python packages from the notebook.
# -U flag ensures we get the latest versions of langchain and openai.
%pip install -U langchain openai

Next, we configure our environment variables. This is a secure way to provide API keys to our application.

- **`LANGCHAIN_API_KEY`**: Your secret key for authenticating with LangSmith.
- **`OPENAI_API_KEY`**: Your secret key for the OpenAI API, required for the agent's LLM.
- **`LANGCHAIN_ENDPOINT`**: This URL directs all LangChain tracing data to the LangSmith platform.

**Action Required**: You must replace the placeholder values with your actual keys.

In [None]:
import os # Import the 'os' module to interact with the operating system.

os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY" # Set your OpenAI API key as an environment variable.
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.

## Dataset Construction

Each evaluation example contains an `expected_steps` field specifying the ordered list of tools the agent is expected to invoke. This serves as a lightweight form of ground truth for the agent’s decision process.

For queries that do not require external tools, the expected trajectory is intentionally empty. This enables us to detect unnecessary tool usage, which is a common failure mode in practice.

In [None]:
import uuid # Import the uuid library to generate unique identifiers.

from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

# Define the list of questions and their corresponding outputs.
questions = [
    (
        "Why was was a $10 calculator app one of the best-rated Nintendo Switch games?",
        {
            "reference": "It became an internet meme due to its high price point.", # The ground-truth final answer.
            "expected_steps": ["duck_duck_go"], # The expected sequence of tool calls.
        },
    ),
    (
        "hi",
        {
            "reference": "Hello, how can I assist you?", # The expected direct response.
            "expected_steps": [],  # Expect a direct response with no tools used.
        },
    ),
    (
        "Who is Dejan Trajkov?",
        {
            "reference": "Macedonian Professor, Immunologist and Physician",
            "expected_steps": ["duck_duck_go"],
        },
    ),
    (
        "Who won the 2023 U23 world wresting champs (men's freestyle 92 kg)",
        {
            "reference": "Muhammed Gimri from turkey",
            "expected_steps": ["duck_duck_go"],
        },
    ),
    (
        "What's my first meeting on Friday?",
        {
            "reference": 'Your first meeting is 8:30 AM for "Team Standup"',
            "expected_steps": ["check_calendar"],  # Only expect the calendar tool to be used.
        },
    ),
]

uid = uuid.uuid4() # Generate a new unique identifier.
dataset_name = f"Agent Eval Example {uid}" # Create a unique name for the dataset.
# Create the dataset on the LangSmith platform.
ds = client.create_dataset(
    dataset_name=dataset_name,
    description="An example agent evals dataset using search and calendar checks.",
)
# Create the examples in the dataset.
client.create_examples(
    inputs=[{"question": q[0]} for q in questions], # The inputs are a list of question dictionaries.
    outputs=[q[1] for q in questions], # The outputs are a list of the corresponding output dictionaries.
    dataset_id=ds.id, # Link these examples to the dataset we just created.
)

## Agent Definition

The agent used here is intentionally simple and constrained. It has access to:
- a web search tool,
- and a mock calendar lookup function.

The purpose is not to optimise task performance, but to create a controlled environment where deviations in tool usage are easy to interpret. The agent executor is configured to return intermediate steps so that trajectories can be analysed post hoc.

In [None]:
from dateutil.parser import parse # A utility to parse date strings into datetime objects.
from langchain.agents import AgentExecutor, create_openai_tools_agent # Import core agent components.
from langchain.agents.format_scratchpad import format_to_openai_functions # A formatting helper.
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser # The output parser.
from langchain_openai import ChatOpenAI # The OpenAI chat model wrapper.
from langchain_community.tools import DuckDuckGoSearchResults # The DuckDuckGo search tool.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder # Prompting utilities.
from langchain_core.tools import tool # The decorator for creating custom tools.
from langchain_core.utils.function_calling import format_tool_to_openai_function # A function formatting helper.


# The '@tool' decorator easily turns a Python function into a LangChain tool.
@tool
def check_calendar(date: str) -> list:
    """Check the user's calendar for a meetings on the specified datetime (in iso format).""" # The docstring is used as the tool's description for the agent.
    date_time = parse(date) # Parse the input date string.
    # This is a mock implementation to demonstrate the concept.
    if date_time.weekday() == 4: # 4 corresponds to Friday.
        return [
            "8:30 : Team Standup",
            "9:00 : 1 on 1",
            "9:45 design review",
        ]
    return ["Focus time"] # Return a default for other days.


# Define the main function that creates and runs our agent.
def agent(inputs: dict):
    # Initialize the LLM. We use a model that's good at function calling.
    llm = ChatOpenAI(
        model="gpt-3.5-turbo-16k",
        temperature=0, # Set temperature to 0 for more deterministic, repeatable outputs.
    )
    # Define the list of tools the agent has access to.
    tools = [
        DuckDuckGoSearchResults(
            name="duck_duck_go" # Give the tool a specific name.
        ),
        check_calendar, # Our custom calendar tool.
    ]
    # Define the prompt template for the agent.
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful assistant."),
            MessagesPlaceholder(variable_name="agent_scratchpad"), # Placeholder for intermediate steps.
            ("user", "{question}"), # Placeholder for the user's input question.
        ]
    )
    # Create the runnable agent component.
    runnable_agent = create_openai_tools_agent(llm, tools, prompt)

    # Create the Agent Executor, which orchestrates the agent's runs.
    executor = AgentExecutor(
        agent=runnable_agent,
        tools=tools,
        handle_parsing_errors=True, # Gracefully handle any parsing errors.
        return_intermediate_steps=True, # CRITICAL: This must be True to get the trajectory for evaluation.
    )
    # Invoke the executor with the inputs.
    return executor.invoke(inputs)

## Trajectory Consistency Evaluator

We define a custom evaluator that compares the agent’s observed tool-use trajectory to the expected sequence specified in the dataset.

This evaluator is deliberately strict: it scores a run as correct only when the tool sequence matches exactly. While this does not capture all valid reasoning paths, it provides a clear signal for identifying shortcut behaviour, unnecessary tool calls, or deviations caused by prompt sensitivity.

In [None]:
from typing import Optional # Import typing hints.

from langsmith.schemas import Example, Run # Import the Run and Example schemas from LangSmith.


# Define the custom evaluator function.
def intermediate_step_correctness(run: Run, example: Optional[Example] = None) -> dict:
    if run.outputs is None: # A safety check to ensure the run has completed and has outputs.
        raise ValueError("Run outputs cannot be None")
    # Get the 'intermediate_steps' from the agent's output, defaulting to an empty list if not found.
    intermediate_steps = run.outputs.get("intermediate_steps") or []
    # The intermediate_steps list contains tuples of (AgentAction, observation).
    # We only care about the action's 'tool' attribute.
    # This list comprehension extracts the tool name for each step.
    trajectory = [action.tool for action, _ in intermediate_steps]
    # Retrieve the ground-truth trajectory from our dataset example.
    expected_trajectory = example.outputs["expected_steps"]
    # Perform a simple equality check between the actual and expected trajectories.
    score = int(trajectory == expected_trajectory)
    # Return the result in the format required by LangSmith.
    return {"key": "Intermediate steps correctness", "score": score}

## Running the Evaluation

We run two complementary evaluators:
1. A standard QA evaluator that checks final answer correctness.
2. A custom trajectory evaluator that checks consistency of tool usage.

Separating these signals allows us to distinguish between:
- failures of knowledge or retrieval, and
- failures of decision-making or control flow.

In practice, I found that these two metrics often diverge, highlighting cases where agents appear correct while behaving unreliably internally.

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate # Import the evaluation functions.


# Define a data preparation function for the standard QA evaluator.
def prepare_data(run: Run, example: Example) -> dict:
    # This function creates the specific dictionary that the 'qa' evaluator expects.
    return {
        "input": example.inputs["question"], # The original question.
        "prediction": run.outputs["output"], # The agent's final answer.
        "reference": example.outputs["reference"], # The ground-truth final answer.
    }


# Create an instance of the standard QA evaluator, passing our data preparation function.
qa_evaluator = LangChainStringEvaluator("qa", prepare_data=prepare_data)

# Run the full evaluation.
chain_results = evaluate(
    agent, # The agent function to be tested.
    data=dataset_name, # The name of our dataset in LangSmith.
    # A list containing both our custom evaluator and the standard QA evaluator.
    evaluators=[intermediate_step_correctness, qa_evaluator],
    experiment_prefix="Agent Eval Example", # A prefix for the experiment name in LangSmith.
    max_concurrency=1, # Run sequentially as some agents/tools may not be thread-safe.
)

Error running target function: _get_url() https://links.duckduckgo.com/d.js DuckDuckGoSearchException: Ratelimit\n

## Discussion

This evaluation setup demonstrates how trajectory-level analysis can surface failure modes that are invisible to output-based metrics alone. Even in simple settings, agents frequently arrive at correct answers through unintended or unstable sequences of actions.

While the evaluator implemented here is intentionally simple, it provides a useful baseline for diagnosing agent behaviour and motivates more flexible approaches (e.g. partial matches, semantic trajectory grading) in settings where multiple decision paths may be acceptable.
