# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [3]:
!pip install -q python-dotenv  # for paperspace

In [4]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv('environment.env'))

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [5]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [6]:
!pip install -q langsmith

In [7]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

### Create Dataset


In [8]:
from datasets import load_dataset
cnn_dataset = load_dataset(
    "ccdv/cnn_dailymail", version
    ="3.0.0",
    # trust_remote_code=True
)

  table = cls._concat_blocks(blocks, axis=0)


In [9]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['article']}"
    }

#cnn_dataset = cnn_dataset.map(add_prefix)

In [10]:
cnn_dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [11]:
cnn_dataset['train'][1]

{'article': '(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men\'s 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I\'m proud of myself and I\'ll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory 

In [12]:
#Get just a few news to test
MAX_NEWS=10
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS)).map(add_prefix)

sample_cnn

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 10
})

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [117]:
print(sample_cnn[0]['article'])

Summarize this news:
(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he'd been a busy actor for decades in theater and in Hollywood, Best didn't become famous until 1979, when "The Dukes of Hazzard's" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catc

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [14]:
import datetime

In [15]:
import uuid
input_key=['article']
output_key=['highlights']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

In [16]:
#This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_cnn,
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [17]:
%pip install -q langchain==0.3.4 langchain_community pydantic==2.9.2

[0mNote: you may need to restart the kernel to use updated packages.


In [18]:
from langchain import HuggingFaceHub

In [19]:
summarizer_base = HuggingFaceHub(
    repo_id="t5-base",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

  summarizer_base = HuggingFaceHub(


In [20]:
summarizer_finetuned = HuggingFaceHub(
    repo_id="flax-community/t5-base-cnn-dm",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [None]:
!pip install -q rapidfuzz==3.6.1 openai tiktoken
from langchain.smith import run_on_dataset, RunEvalConfig

In [33]:
#We are using just one of the multiple evaluator avaiable on LangSmith.

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        "string_distance"
    ],
)



### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [34]:
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-BASE 2024-10-29 15:31:46' at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d/compare?selectedSessions=72e3d642-900f-4f07-96a4-2f6fa6e0bcef

View all tests for Dataset Summarize_dataset_2024-10-29 15:09:49 at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d
[------------------------------------------------->] 10/10

In [35]:
#Ignore the error shown below
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-FineTuned 2024-10-29 15:32:30' at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d/compare?selectedSessions=1932ae74-4edf-42da-829b-71094b972193

View all tests for Dataset Summarize_dataset_2024-10-29 15:09:49 at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d
[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [None]:
%pip install -q langchain_openai

In [38]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [39]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2024-10-29 15:35:53' at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d/compare?selectedSessions=de5ee21b-611b-43fb-9977-4692d415b15f

View all tests for Dataset Summarize_dataset_2024-10-29 15:09:49 at:
https://smith.langchain.com/o/a78c39d8-d67b-5df4-ac33-c58f3235e5bf/datasets/9e46798d-0a8f-4f58-9741-97a079c9c87d
[------------------------------------------------->] 10/10

Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Empty request"}')


<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.

## My Experiments
Let's try with a different dataset

In [90]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv('environment.env'))

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [91]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

In [92]:
from datasets import load_dataset
xsum_dataset = load_dataset("xsum")

In [93]:
xsum_dataset['train'][1]


{'document': 'A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.\nAs they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.\nOne of the tour groups is from Germany, the other from China and Taiwan. It was their first night in Northern Ireland.\nThe driver of one of the buses said many of the passengers had left personal belongings on board and these had been destroyed.\nBoth groups have organised replacement coaches and will begin their tour of the north coast later than they had planned.\nPolice have appealed for information about the attack.\nInsp David Gibson said: "It appears as though the fire started under one of the buses before spreading to the second.\n"While the exact cause is still under investigation, it is thought that the fire was started deliberately."',
 'summary': 'Two tourist buses have been destroyed by fire in a suspected arson attack in Belf

In [94]:
def add_prefix(example):
    return {
        **example,
        "article": f"Summarize this news:\n{example['document']}"
    }

#Get just a few news to test
MAX_NEWS=10
sample_xsum = xsum_dataset["test"].select(range(MAX_NEWS)).map(add_prefix)

sample_xsum

Dataset({
    features: ['document', 'summary', 'id', 'article'],
    num_rows: 10
})

In [None]:
print(sample_xsum[0]['article'])

In [81]:
for i in range(len(sample_xsum)):
    sample_xsum[i]['article'] = sample_xsum[i]['article'].split('\n')

In [123]:
import datetime
import pandas as pd
import uuid
input_key=['article']
output_key=['summary']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

sample_xsum_df = pd.DataFrame({
    'article': [entry['article'] for entry in sample_xsum],
    'summary': [entry['summary'] for entry in sample_xsum]
})

# Upload the DataFrame to LangSmith
dataset = client.upload_dataframe(
    df=sample_xsum_df,
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

In [124]:
from transformers import pipeline

# Base Model
summarizer_base = pipeline("summarization", model="facebook/bart-large")

# Fine-Tuned Model (on CNN/DailyMail or XSum)
summarizer_finetuned = pipeline("summarization", model="facebook/bart-large-cnn")

In [122]:
from langchain.smith import run_on_dataset, RunEvalConfig

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        "string_distance"
    ],
)

In [None]:
project_name = f"BART_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_bart_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

In [None]:
#Ignore the error shown below
project_name = f"BART_FINETUNED_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_bart_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

In [None]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

project_name = f"OpenAI_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)