<a href="https://colab.research.google.com/github/nov05/Google-Colaboratory/blob/master/generative_ai_with_langchain/08_02_advanced_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Notebook modified by nov05 on 2025-06-13  

In [1]:
%%capture
!pip install langchain langsmith langchain_community langchain_openai
## ⚠️ Restart session after installation

*Successfully installed dataclasses-json-0.6.7 httpx-sse-0.4.0 langchain-core-0.3.65 langchain_community-0.3.25 langchain_openai-0.3.23 langsmith-0.3.45 marshmallow-3.26.1 mypy-extensions-1.1.0 pydantic-settings-2.9.1 python-dotenv-1.1.0 typing-inspect-0.9.0*  

**Make sure you load the API keys for cloud providers!**

You can set your environment keys yourself or use a script. Please note that since keys are private, they are not included in the repository.

In [None]:
# # setting the environment variables, the keys
# import sys
# import os
# sys.path.insert(0, os.path.abspath('..'))
# from config import set_environment
# # for the keys - as explained early in chapter 2
# set_environment()

In [1]:
import os
from google.colab import userdata
# os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = userdata.get("LANGSMITH_API_KEY")
# os.environ["LANGSMITH_PROJECT"] = "generative_ai_with_langchain"
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# 🟢 **Advanced Evaluation**  

## 👉 **Chain-of-Thought Evaluation**  

* The following code uses an OpenAI LLM (e.g., `gpt-4o`) by default, so you'll need to set `os.environ['OPENAI_API_KEY']`.  

  * Check the documentation at https://python.langchain.com/docs/integrations/chat/openai/.  

* `class langchain.evaluation.qa.eval_chain.CotQAEvalChain`: LLM Chain for evaluating QA using chain of thought reasoning.    

  * Check the documentation at https://python.langchain.com/api_reference/langchain/evaluation/langchain.evaluation.qa.eval_chain.CotQAEvalChain.html  

  * **ChatGPT**: LangChain's meta-evaluation studies show that using CoT-style evaluation generally aligns better with human judgments, although it comes with slightly higher token use and runtime cost.    

| Task Type       | Evaluator    | Behavior                                             |
| --------------- | ------------ | ---------------------------------------------------- |
| QA              | `cot_qa`     | Uses chain-of-thought reasoning to judge correctness |
| QA              | `qa`         | Direct correctness verdict without CoT               |
| QA with context | `context_qa` | Uses reference context in straightforward judgment   |
  

  

In [None]:
%%time
from langchain.evaluation import load_evaluator
from pprint import pprint

# Simulated chain-of-thought reasoning provided by the agent:
agent_reasoning = (
    "The current interest rate is 0.25%. I determined this by recalling that recent monetary policies have aimed "
    "to stimulate economic growth by keeping borrowing costs low. A rate of 0.25% is consistent with the ongoing "
    "trend of low rates, which encourages consumer spending and business investment."
)
# Expected reasoning reference:
expected_reasoning = (
    "An ideal reasoning should mention that the Federal Reserve has maintained a low interest rate—around 0.25%—to "
    "support economic growth, and it should briefly explain the implications for borrowing costs and consumer spending."
)
# Load the chain-of-thought evaluator.
cot_evaluator = load_evaluator("cot_qa")  ## ⚠️

result_reasoning = cot_evaluator.evaluate_strings(
    input="What is the current Federal Reserve interest rate and why does it matter?",
    prediction=agent_reasoning,
    reference=expected_reasoning,
)
print("Chain-of-Thought Reasoning Evaluation:")
pprint(result_reasoning)

Chain-of-Thought Reasoning Evaluation:
{'reasoning': 'The student correctly identifies the current Federal Reserve '
              'interest rate as 0.25%. They also correctly explain that this '
              'low rate is intended to stimulate economic growth by keeping '
              'borrowing costs low. They further explain that this encourages '
              'consumer spending and business investment, which aligns with '
              "the context provided. Therefore, the student's answer is "
              'factually accurate and addresses all parts of the question.\n'
              'GRADE: CORRECT',
 'score': 1,
 'value': 'CORRECT'}
CPU times: user 129 ms, sys: 11.2 ms, total: 141 ms
Wall time: 2.73 s


## 👉 **Agent Trajectory Evaluation**  

* You're testing an agent that’s expected to go through:  
  `["intent_classifier", "healthcare_agent", "MedicalDatabaseSearch", "format_response"]`

  And the function `run_graph_with_trajectory` is set up to return this exact trajectory. So you expect a perfect score of **1.0**.

* But in real cases, your agent might do:  
  `["intent_classifier", "MedicalDatabaseSearch", "format_response"]`  

  Then trajectory_subsequence would score this as **0.75**, since it missed `healthcare_agent`.

In [3]:
from langsmith import Client

def trajectory_subsequence(outputs: dict, reference_outputs: dict) -> float:
    """Check how many of the desired steps the agent took."""
    if len(reference_outputs['trajectory']) > len(outputs['trajectory']):
        return 0.0
    i, j = 0, 0
    while i < len(reference_outputs['trajectory']) and j < len(outputs['trajectory']):
        if reference_outputs['trajectory'][i] == outputs['trajectory'][j]:
            i += 1
        j += 1
    return i / len(reference_outputs['trajectory'])

# Create example dataset with expected trajectories
client = Client(
    api_key=userdata.get("LANGSMITH_API_KEY"), ## ✅
)
trajectory_dataset = client.create_dataset(
    "Healthcare Agent Trajectory Evaluation",
    description="Evaluates agent trajectory for medication queries",
)
# Add example with expected trajectory
client.create_example(
    inputs={
        "question": "What is the recommended dosage of ibuprofen for an adult?"
    },
    outputs={
        "trajectory": [
            "intent_classifier",
            "healthcare_agent",
            "MedicalDatabaseSearch",
            "format_response"
        ],
        "response": "Typically, 200-400 mg every 4-6 hours, not exceeding 3200 mg per day."
    },
    dataset_id=trajectory_dataset.id
)

<class 'langsmith.schemas.Example'>(id=31ec1825-a96c-4ebe-923c-7b5a4c7560ce, dataset_id=e8f944f9-58ff-4fe1-88f6-fee256b89a73, link='https://smith.langchain.com/o/bb2fe87a-6912-4ab2-9937-88deb93a5479/datasets/e8f944f9-58ff-4fe1-88f6-fee256b89a73/e/31ec1825-a96c-4ebe-923c-7b5a4c7560ce')

### **Run evaluation with our custom trajectory evaluator**

In [4]:
# Function to run graph with trajectory tracking (example implementation)
async def run_graph_with_trajectory(inputs: dict) -> dict:
    """Run graph and track the trajectory it takes along with the final response."""
    trajectory, final_response = [], ""
    # Here you would implement your actual graph execution
    # For the example, we'll just return a sample result
    trajectory = ["intent_classifier", "healthcare_agent", "MedicalDatabaseSearch", "format_response"]
    final_response = "Typically, 200-400mg every 4-6 hours, not exceeding 3200mg per day."
    return {
        "trajectory": trajectory,
        "response": final_response,
    }

# Note: This is an async function, so in a notebook you'd need to use await
experiment_results = await client.aevaluate(
    run_graph_with_trajectory,
    data=trajectory_dataset.id,
    evaluators=[trajectory_subsequence],
    experiment_prefix="healthcare-agent-trajectory",
    num_repetitions=1,
    max_concurrency=4,
)
## For demonstration without async:
# results_df = experiment_results.to_pandas()
# print(f"Average trajectory match score: {results_df['trajectory_subsequence'].mean()}")

View the evaluation results for experiment: 'healthcare-agent-trajectory-0540aa6a' at:
https://smith.langchain.com/o/bb2fe87a-6912-4ab2-9937-88deb93a5479/datasets/e8f944f9-58ff-4fe1-88f6-fee256b89a73/compare?selectedSessions=e3b00894-e4ab-462b-aa4e-1e35160b00f0




0it [00:00, ?it/s]

In [5]:
results_df = experiment_results.to_pandas()
print(f"Average trajectory match score: {results_df['feedback.trajectory_subsequence'].mean()}")

Average trajectory match score: 1.0
