<a href="https://colab.research.google.com/github/nov05/Google-Colaboratory/blob/master/generative_ai_with_langchain/08_03_langsmith_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Notebook modified by nov05 on 2025-06-13  

In [None]:
%%capture
%pip install -U langchain langchain-community langchain-openai langsmith langchain_core datasets evaluate
## Successfully installed datasets-3.6.0 fsspec-2025.3.0 langchain-openai-0.3.23 langchain_core-0.3.65 langsmith-0.3.45
## Successfully installed dataclasses-json-0.6.7 httpx-sse-0.4.0 langchain-community-0.3.25 marshmallow-3.26.1
## mypy-extensions-1.1.0 pydantic-settings-2.9.1 python-dotenv-1.1.0 typing-inspect-0.9.0
## Successfully installed evaluate-0.4.3

**Make sure you load the API keys for cloud providers!**

You can set your environment keys yourself or use a script. Please note that since keys are private, they are not included in the repository.

In [None]:
# # setting the environment variables, the keys
# import sys
# import os
# sys.path.insert(0, os.path.abspath('..'))
# from config import set_environment
# # for the keys - as explained early in chapter 2
# set_environment()

In [3]:
# Example configuration for LangSmith:
import os
from google.colab import userdata
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = userdata.get("LANGSMITH_API_KEY")
os.environ["LANGSMITH_PROJECT"] = "generative_ai_with_langchain"
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# 🟢 **LangSmith Evaluation**     

In [4]:
from langchain_openai import ChatOpenAI
# Create a simple LLM call that will be traced in LangSmith
llm = ChatOpenAI()
response = llm.invoke("Hello, world!")
print(f"Model response: {response.content}\n")
print("This run has been logged to LangSmith.")
print("You can view it in the LangSmith UI: https://smith.langchain.com")

Model response: Hello! How can I assist you today?

This run has been logged to LangSmith.
You can view it in the LangSmith UI: https://smith.langchain.com


## 👉 **Creating an evaluation dataset**

In [18]:
from langsmith import Client
client = Client()
# Sample financial examples
financial_examples = [
    {
        "inputs": {
            "question": "What are the tax implications of early 401(k) withdrawal?",
            "context_needed": ["retirement", "taxation", "penalties"]
        },
        "outputs": {
            "answer": ("Early withdrawals from a 401(k) typically incur "
                "a 10% penalty if you're under 59½ years old, in addition to "
                "regular income taxes. However, certain hardship withdrawals "
                "may qualify for penalty exemptions."),
            "key_points": ["10% penalty", "income tax", "hardship exemptions"],
            "documents": ["IRS publication 575", "Retirement plan guidelines"]
        }
    },
    {
        "inputs": {
            "question": "How does dollar-cost averaging compare to lump-sum investing?",
            "context_needed": ["investment strategy", "risk management", "market timing"]
        },
        "outputs": {
            "answer": ("Dollar-cost averaging spreads investments over "
                "time to reduce timing risk, while lump-sum investing "
                "typically outperforms in rising markets due to longer "
                "market exposure. DCA may provide psychological benefits "
                "through reduced volatility exposure."),
            "key_points": ["timing risk", "market exposure", "psychological benefits"],
            "documents": ["Investment strategy comparisons", "Market timing research"]
        }
    }
]
# Create dataset in LangSmith
dataset_name = "Financial Advisory RAG Evaluation"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description=(
        "Evaluation dataset for financial advisory RAG systems "
        "covering retirement, investments, and tax planning"
    ),
)
# Add examples to the dataset
for example in financial_examples:
    client.create_example(
        inputs=example["inputs"],
        outputs=example["outputs"],
        dataset_id=dataset.id
    )
print(f"Created evaluation dataset with {len(financial_examples)} examples")

Created evaluation dataset with 2 examples


In [None]:
# # Define evaluation configuration
# from langchain.smith import RunEvalConfig
# # Define evaluation criteria specific to RAG systems
# evaluation_config = RunEvalConfig(
#     evaluators=[
#         # Correctness: Compare response to reference answer
#         RunEvalConfig.LLM(
#             criteria={
#                 "factual_accuracy": ("Does the response contain only "
#                     "factually correct information consistent with the reference answer?")
#             }
#         ),
#         # Groundedness: Ensure response is supported by retrieved context
#         RunEvalConfig.LLM(
#             criteria={
#                 "groundedness": ("Is the response fully supported by the "
#                     "retrieved documents without introducing unsupported information?")
#             }
#         ),
#         # Retrieval quality: Assess relevance of retrieved documents
#         RunEvalConfig.LLM(
#             criteria={
#                 "retrieval_relevance": ("Are the retrieved documents "
#                     "relevant to answering the question?")
#             }
#         )
#     ]
# )

In [19]:
# Define evaluation configuration
from langchain.smith import RunEvalConfig
# Define evaluation criteria specific to RAG systems
evaluation_config = RunEvalConfig(
    evaluators=[
        # Correctness: Compare response to reference answer
        RunEvalConfig.Criteria({
            "factual_accuracy": ("Does the response contain only "
                "factually correct information consistent with the reference answer?")
        }),
        # Groundedness: Ensure response is supported by retrieved context
        RunEvalConfig.Criteria({
            "groundedness": ("Is the response fully supported by the "
                "retrieved documents without introducing unsupported information?")
        }),
        # Retrieval quality: Assess relevance of retrieved documents
        RunEvalConfig.Criteria({
            "retrieval_relevance": ("Are the retrieved documents "
                "relevant to answering the question?")
        })
    ]
)

### **Function to construct your RAG chain (placeholder)**

In [20]:
def construct_chain(*args, **kwargs):
    # This would be your actual RAG implementation
    # For example: return RAGChain(...)
    pass

### **Run evaluation on dataset**

In [21]:
from langchain.smith import run_on_dataset
results = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=construct_chain,
    evaluation=evaluation_config,
)

View the evaluation results for project 'cooked-flower-41' at:
https://smith.langchain.com/datasets/0655c0f9-46d4-4c2d-b280-26b00a1ae141/compare?selectedSessions=39002f5b-ac71-42e0-8d61-b2a8e750f251

View all tests for Dataset Financial Advisory RAG Evaluation at:
https://smith.langchain.com/datasets/0655c0f9-46d4-4c2d-b280-26b00a1ae141
[------------------------------------------------->] 2/2

## 👉 **Insurance Claim Extraction Evaluation Example**

* Warnings  

  * PydanticDeprecatedSince20: The `schema_json` method is deprecated; use `model_json_schema` and json.dumps instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/

  * LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 1.0. An updated version of the class exists in the `langchain-openai` package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
  
  * PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/

  * LangChainDeprecationWarning: The method `BaseChatOpenAI.bind_functions` was deprecated in langchain-openai 0.2.1 and will be removed in 1.0.0. Use `langchain_openai.chat_models.base.ChatOpenAI.bind_tools` instead.  

In [33]:
from langsmith import Client
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
# from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
import json
from pprint import pprint

# Define a list of synthetic insurance claim examples
example_inputs = [
    (
        "I was involved in a car accident on 2023-08-15. "
        "My name is Jane Smith, Claim ID INS78910, "
        "Policy Number POL12345, and the damage is estimated at $3500.",
        {
            "claimant_name": "Jane Smith",
            "claim_id": "INS78910",
            "policy_number": "POL12345",
            "claim_amount": "$3500",
            "accident_date": "2023-08-15",
            "accident_description": "Car accident causing damage",
            "status": "pending"
        },
    ),
    (
        "My motorcycle was hit in a minor collision on 2023-07-20. "
        "I am John Doe, with Claim ID INS112233 "
        "and Policy Number POL99887. The estimated damage is $1500.",
        {
            "claimant_name": "John Doe",
            "claim_id": "INS112233",
            "policy_number": "POL99887",
            "claim_amount": "$1500",
            "accident_date": "2023-07-20",
            "accident_description": "Minor motorcycle collision",
            "status": "pending"
        },
    ),
]
client = Client()
dataset_name = "Insurance Claims"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic dataset for insurance claim extraction tasks",
)
# Store examples in the dataset
for input_text, expected_output in example_inputs:
    client.create_example(
        inputs={"input": input_text},
        outputs={"output": expected_output},
        metadata={"source": "Synthetic"},
        dataset_id=dataset.id,
    )

In [74]:
%%time
# Define the extraction schema
class InsuranceClaim(BaseModel):
    claimant_name: str = Field(..., description="The name of the claimant")
    claim_id: str = Field(..., description="The unique insurance claim identifier")
    policy_number: str = Field(..., description="The policy number associated with the claim")
    claim_amount: str = Field(..., description="The claimed amount (e.g., '$5000')")
    accident_date: str = Field(..., description="The date of the accident (YYYY-MM-DD)")
    accident_description: str = Field(..., description="A brief description of the accident")
    status: str = Field("pending", description="The current status of the claim")

## Create extraction chain
instructions = (
    "Extract the following structured information from the insurance claim text: "
    "claimant_name, claim_id, policy_number, claim_amount, accident_date, "
    "accident_description, and status. Return the result as a JSON object following "
    "this schema: " +
    # InsuranceClaim.schema_json()
    json.dumps(InsuranceClaim.model_json_schema()).replace("{", "{{").replace("}", "}}")
)
prompt = ChatPromptTemplate.from_messages([
    ("system", instructions),
    ("user", "{input}")
])
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0
).bind_functions(
    functions=[
        # InsuranceClaim.schema(),
        InsuranceClaim.model_json_schema(),
    ],
    function_call="InsuranceClaim",
)
output_parser = JsonOutputFunctionsParser()
# extraction_chain = instructions | llm | output_parser | (lambda x: {"output": x})
extraction_chain = prompt | llm | output_parser | (lambda x: {"output": x})

# Test the extraction chain
sample_claim_text = (
    "I was involved in a car accident on 2023-08-15. My name is Jane Smith, "
    "Claim ID INS78910, Policy Number POL12345, and the damage is estimated at $3500. "
    "Please process my claim."
)
result = extraction_chain.invoke({"input": sample_claim_text})

CPU times: user 103 ms, sys: 11.9 ms, total: 115 ms
Wall time: 1.92 s


In [75]:
print("Extraction Result:")
pprint(result)
## Default status pending is missing.

Extraction Result:
{'output': {'accident_date': '2023-08-15',
            'accident_description': 'I was involved in a car accident',
            'claim_amount': '$3500',
            'claim_id': 'INS78910',
            'claimant_name': 'Jane Smith',
            'policy_number': 'POL12345'}}


In [29]:
## Code explanation
pprint(InsuranceClaim.model_json_schema())

{'properties': {'accident_date': {'description': 'The date of the accident '
                                                 '(YYYY-MM-DD)',
                                  'title': 'Accident Date',
                                  'type': 'string'},
                'accident_description': {'description': 'A brief description '
                                                        'of the accident',
                                         'title': 'Accident Description',
                                         'type': 'string'},
                'claim_amount': {'description': 'The claimed amount (e.g., '
                                                "'$5000')",
                                 'title': 'Claim Amount',
                                 'type': 'string'},
                'claim_id': {'description': 'The unique insurance claim '
                                            'identifier',
                             'title': 'Claim Id',
                            

* Tracing and evaluating with `LangSmith`  

  <img src="https://raw.githubusercontent.com/nov05/pictures/refs/heads/master/generative_ai_with_langchain/2025-06-14%2002_56_11-08_03_langsmith_eval_insurance_claim_text_extraction_.jpg" width=800>  

## 👉 **HuggingFace evaluation for code generation**  

In [76]:
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"

In [79]:
%%time
from datasets import load_dataset
from evaluate import load
from langchain_core.messages import HumanMessage

human_eval = load_dataset("openai_humaneval", split="test")
code_eval_metric = load("code_eval")

test_cases = ["assert add(2,3)==5"]
candidates = [[
    "def add(a,b): return a*b",
    "def add(a, b): return a+b"
]]

pass_at_k, results = code_eval_metric.compute(
    references=test_cases, p
    redictions=candidates,
    k=[1, 2])
print(pass_at_k)

README.md:   0%|          | 0.00/6.52k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/9.18k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

{'pass@1': np.float64(0.5), 'pass@2': np.float64(1.0)}
CPU times: user 18.3 s, sys: 4.19 s, total: 22.5 s
Wall time: 39 s
