# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [2]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [3]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://eu.api.smith.langchain.com"
os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [4]:
#Importing Client from Langsmith
from langsmith import Client
import os

client = Client(
    api_url=os.environ["LANGSMITH_ENDPOINT"],
    api_key=os.getenv("LANGSMITH_API_KEY") or os.getenv("LANGCHAIN_API_KEY")
)

print("Endpoint now:", os.environ["LANGSMITH_ENDPOINT"])

Endpoint now: https://eu.api.smith.langchain.com


### Create Dataset


In [5]:
from datasets import load_dataset

# Use the Parquet mirror repo (not CCDV)
cnn_dataset = load_dataset("abisee/cnn_dailymail", "3.0.0")
print(cnn_dataset)


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})


In [6]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['article']}"
    }

cnn_dataset = cnn_dataset.map(add_prefix)

In [7]:
cnn_dataset['train'][0]

{'article': 'Summarize this news:\nLONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on 

In [8]:
#Get just a few news to test
MAX_NEWS=10
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS)).map(add_prefix)

print(sample_cnn[0])

{'article': 'Summarize this news:\nSummarize this news:\n(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign M

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [9]:
print(sample_cnn[0])

{'article': 'Summarize this news:\nSummarize this news:\n(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign M

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [10]:
import datetime

In [11]:
import uuid
input_key=['article']
output_key=['highlights']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

client = Client(
    api_url=os.environ["LANGSMITH_ENDPOINT"],
    api_key=os.getenv("LANGSMITH_API_KEY") or os.getenv("LANGCHAIN_API_KEY")
)

In [12]:
import os
print("ENV endpoint:", os.getenv("LANGSMITH_ENDPOINT"))
os.environ.pop("LANGSMITH_TENANT_ID", None)
os.environ.pop("LANGCHAIN_TENANT_ID", None)


ENV endpoint: https://eu.api.smith.langchain.com


In [13]:
#This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_cnn.to_pandas(),
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [14]:
from langchain_community.llms import HuggingFaceEndpoint

In [15]:
summarizer_base = HuggingFaceEndpoint(
    repo_id="t5-base",
    temperature=0,
    max_new_tokens=180,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)

  summarizer_base = HuggingFaceEndpoint(


In [16]:
summarizer_finetuned = HuggingFaceEndpoint(
    repo_id="flax-community/t5-base-cnn-dm",
    temperature=0,
    max_new_tokens=180,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [None]:
# Ensure rapidfuzz is installed
%pip install -q rapidfuzz==3.6.1

In [17]:
#We are using just one of the multiple evaluator avaiable on LangSmith.
from langchain.smith import run_on_dataset, RunEvalConfig
evaluation_config = RunEvalConfig(
    evaluators=[
        {
            "evaluator_type": "embedding_distance",
            "prediction_key": "output",
            "input_key": "article",        # optional, but nice to log
            "prediction_key": "highlights",
            "reference_key": "highlights",
            "distance_metric": "cosine",
            # uses OpenAI by default; ensure OPENAI_API_KEY is set
            "embedding_model": "openai:text-embedding-3-small",
        }
    ],
)

### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [18]:
import os
from langchain_community.llms import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

# T5 likes the "summarize:" prefix
prompt = PromptTemplate.from_template("summarize: {article}")

def make_chain(repo_id: str):
    llm = HuggingFaceEndpoint(
        repo_id=repo_id,
        temperature=0,
        max_new_tokens=180,
        huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
    )
    # Convert LLM string output -> {"highlights": "..."} for the evaluator
    to_kv = RunnableLambda(lambda s: {"highlights": s.strip()})
    return prompt | llm | StrOutputParser() | to_kv

# Base and finetuned
summarizer_base_chain = make_chain("t5-base")
summarizer_finetuned_chain = make_chain("flax-community/t5-base-cnn-dm")

In [None]:
#project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

#base_t5_results = run_on_dataset(
#    client=client,
#    project_name=project_name,
#    dataset_name=NAME_DATASET,
#    llm_or_chain_factory=lambda: summarizer_base_chain,   # <-- updated
#    evaluation=evaluation_config,
#)

View the evaluation results for project 'T5-BASE 2025-10-25 21:53:35' at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61/compare?selectedSessions=16dca68c-2546-4ecf-ba66-8ec8bb49012b

View all tests for Dataset Summarize_dataset_2025-10-25 21:53:08 at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61


Chain failed for example 71e05189-9011-448b-bcd5-a5db8a1eff4d with inputs {'article': 'Summarize this news:\nSummarize this news:\n(CNN)For the first time in eight years, a TV legend returned to doing what he does best. Contestants told to "come on down!" on the April 1 edition of "The Price Is Right" encountered not host Drew Carey but another familiar face in charge of the proceedings. Instead, there was Bob Barker, who hosted the TV game show for 35 years before stepping down in 2007. Looking spry at 91, Barker handled the first price-guessing game of the show, the classic "Lucky Seven," before turning hosting duties over to Carey, who finished up. Despite being away from the show for most of the past eight years, Barker didn\'t seem to miss a beat.'}
Error Type: AttributeError, Message: 'InferenceClient' object has no attribute 'post'
Chain failed for example 0296f48f-87cd-4ba7-aa98-325a7d69c084 with inputs {'article': 'Summarize this news:\nSummarize this news:\n(CNN)The Palestini

[------------------------------------------------->] 10/10


In [None]:
#project_name = f"T5-FINETUNED {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
#finetuned_results = run_on_dataset(
#    client=client,
#    project_name=project_name,
#    dataset_name=NAME_DATASET,
#    llm_or_chain_factory=lambda: summarizer_finetuned_chain,
#    evaluation=evaluation_config,
#)

View the evaluation results for project 'T5-FINETUNED 2025-10-25 21:53:43' at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61/compare?selectedSessions=d9c4b1b5-ade4-40f2-b489-db4cb8aebdca

View all tests for Dataset Summarize_dataset_2025-10-25 21:53:08 at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61


Chain failed for example 0296f48f-87cd-4ba7-aa98-325a7d69c084 with inputs {'article': 'Summarize this news:\nSummarize this news:\n(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, oppo

[------------------------------------------------->] 10/10


<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [21]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [22]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2025-10-25 21:55:28' at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61/compare?selectedSessions=a602fdbe-aa12-4dd2-969b-3dd598f8cb48

View all tests for Dataset Summarize_dataset_2025-10-25 21:53:08 at:
https://eu.smith.langchain.com/o/ec30a7f8-f351-487e-8b8c-12cd87c174f3/datasets/7b23e299-6446-4b97-8298-a9e048822b61
[------------------------------------------------->] 10/10


<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.