# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [19]:
!pip install python-dotenv




In [20]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [21]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [22]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

### Create Dataset


In [23]:
!pip install datasets




In [6]:
from datasets import load_dataset
cnn_dataset = load_dataset(
    "ccdv/cnn_dailymail", version
    ="3.0.0",
    trust_remote_code=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

cnn_dailymail.py:   0%|          | 0.00/9.27k [00:00<?, ?B/s]

cnn_stories.tgz:   0%|          | 0.00/159M [00:00<?, ?B/s]

dailymail_stories.tgz:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.43M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [24]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['article']}"
    }

#cnn_dataset = cnn_dataset.map(add_prefix)

In [25]:
cnn_dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [26]:
cnn_dataset['train'][0]

{'article': 'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It\'s a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but bec

In [27]:
#Get just a few news to test
MAX_NEWS=10
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS)).map(add_prefix)

sample_cnn

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 10
})

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [28]:
print(sample_cnn[0])

{'article': 'Summarize this news:\n(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [29]:
import datetime

In [30]:
import uuid
input_key=['article']
output_key=['highlights']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

In [31]:
pip install numpy==1.24.3




In [32]:
#This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_cnn,
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [34]:
!pip install langchain-community


Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.51 (from langchain-community)
  Downloading langchain_core-0.3.51-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.23 (from langchain-community)
  Downloading langchain-0.3.23-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting numpy<3,>=1.26.2 (from langchain-community)
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [35]:
from langchain_community.llms import HuggingFaceHub

In [36]:
summarizer_base = HuggingFaceHub(
    repo_id="t5-base",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

  summarizer_base = HuggingFaceHub(


In [37]:
summarizer_finetuned = HuggingFaceHub(
    repo_id="flax-community/t5-base-cnn-dm",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [38]:
from langchain.smith import run_on_dataset, RunEvalConfig
# !pip install -q rapidfuzz==3.6.1

In [39]:
#We are using just one of the multiple evaluator avaiable on LangSmith.

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        #"string_distance"
    ],
)



### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [41]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.2 MB[0m [31m9.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [42]:
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-BASE 2025-04-09 08:15:34' at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa/compare?selectedSessions=c5825fad-6f17-4ee4-a5db-dac29e9879db

View all tests for Dataset Summarize_dataset_2025-04-09 08:04:41 at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa
[>                                                 ] 0/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca8-00f0594821c2bef11927386e;7099e9e4-a66d-4130-a052-4960f1503f33)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca8-61789d120c14f5cd239814d8;cdbac293-3161-41c9-972e-30d47c71fa12)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca8-0fee061e7665650c4148578d;f3b3b09d-80db-47f6-b06b-abbfb8124b74)

403 Forbidden: This authentication method does not have sufficient p

[---->                                             ] 1/10[--------->                                        ] 2/10[-------------->                                   ] 3/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca9-22368a634f034a4a4e8c8011;7f2748d3-4f66-40b7-a653-3e4b9782b284)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca8-00f33aed2069c406656567ca;f07d81ab-3705-4cc0-a87a-0bc21c8b5a48)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca9-1cb9a1b04af769465ea060a4;7bbba818-800c-41e5-a2ac-cfe4365bb8f7)

403 Forbidden: This authentication method does not have sufficient p

[------------------->                              ] 4/10[------------------------>                         ] 5/10[----------------------------->                    ] 6/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca8-7147d44a5cd92f651ca8c315;8051d8e1-ab4e-42f1-a986-fd38c168688e)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62caa-5ca776d850ae78993825676d;4e9cb1dc-0240-4deb-8a6e-0020847a4082)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.


[---------------------------------->               ] 7/10[--------------------------------------->          ] 8/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62caa-38ec2a2237f081e676161461;77ef5063-d2a8-4c74-9c50-56424e7aeb7a)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.


[-------------------------------------------->     ] 9/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62ca9-3153c0055126877c50d80d73;be224dbe-be7a-4caa-ac9b-fdd237b0932c)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/t5-base.
Make sure your token has the correct permissions.


[------------------------------------------------->] 10/10

In [43]:
#Ignore the error shown below
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-FineTuned 2025-04-09 08:16:07' at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa/compare?selectedSessions=585feca1-fed7-4e07-88da-6a18932e925a

View all tests for Dataset Summarize_dataset_2025-04-09 08:04:41 at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa
[>                                                 ] 0/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-2f31dfb13ad317ed1b2809d1;e09e5258-e3a9-464b-a3cc-d55d953fe477)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-1405739a06e559ab3177918e;6595ee6a-290a-4878-a2b7-68053a2c93db)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-35520b893e55cb91555e98a3;5cfa852d-3770-4845-9705-6befd37334f5)

403 Forbidden: This auth

[---->                                             ] 1/10[--------->                                        ] 2/10[-------------->                                   ] 3/10[------------------->                              ] 4/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-6579f5a9026d3bdf607d4dad;bd7a4cc1-1b5c-4754-ac84-bee3b393b8cd)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-286ff2e47d942b90561fb757;86565f3c-0788-4718-a2c3-80b546a70100)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.
Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-5766ae2f625e746116095967;0cc94047-0721-48fd-bb81-6de706d8a515)

403 Forbidden: This auth

[------------------------>                         ] 5/10[----------------------------->                    ] 6/10[---------------------------------->               ] 7/10[--------------------------------------->          ] 8/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc8-3c9e4d0c10d808f17f5a0570;2ddc21f9-15a4-40c3-a599-c5832b6ac091)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.


[-------------------------------------------->     ] 9/10

Error Type: HfHubHTTPError, Message: (Request ID: Root=1-67f62cc9-6ecb6b4c5081023b4e9591ae;f3033820-1eac-443f-b04e-5ca86ccdbd7b)

403 Forbidden: This authentication method does not have sufficient permissions to call Inference Providers on behalf of user NeoKatGen.
Cannot access content at: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm.
Make sure your token has the correct permissions.


[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [45]:
pip install langchain-openai


Collecting langchain-openai
  Downloading langchain_openai-0.3.12-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.12-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.12


In [46]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [47]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2025-04-09 08:18:19' at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa/compare?selectedSessions=aa7bcf43-7cf6-4f11-a44f-a12ff332bf88

View all tests for Dataset Summarize_dataset_2025-04-09 08:04:41 at:
https://smith.langchain.com/o/6e016ce1-7efa-4f31-88c4-0a5b316207f1/datasets/5c557e89-f6c2-4325-a855-f52da3b763fa
[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.