# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [3]:
# check python and GPU
import sys, os
print("python:", sys.version)
!nvidia-smi || echo "No GPU"

python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
/bin/bash: line 1: nvidia-smi: command not found
No GPU


In [9]:
# install the minimal set you'll need (adjust versions if you pinned them)
!pip install -q --upgrade pip setuptools
!pip install -q langchain==1.1.0 langchain-openai==1.1.0 langchain-huggingface==1.1.0 ragas==0.3.9 \
    sentence-transformers==5.1.2 datasets docarray python-dotenv pandas scikit-learn numpy openai tqdm nest-asyncio
# install transformers 4.x if sentence-transformers complains about transformers v5:
!pip install -q "transformers==4.57.3"

In [1]:
# Quick checks (run in Colab cell or terminal)
# 1) Verify core imports work
!python -c "import langchain, ragas, sentence_transformers, transformers, torch, requests; print('ok')"

# 2) If you see an ImportError for missing jedi, install it:
!pip install --upgrade jedi

# 3) If a package truly needs a specific requests version and you must change it:
#    Be cautious: downgrading requests may break Google Colab runtime helpers.
#    Only do this if you reproducibly need it (and after restarting the runtime):
!pip install requests==2.32.4

2025-12-02 10:46:46.165071: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764672406.207471   13728 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764672406.220165   13728 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764672406.255153   13728 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764672406.255217   13728 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764672406.255227   13728 computation_placer.cc:177] computation placer alr

In [7]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')

In [3]:
import langchain, ragas, sentence_transformers, transformers, torch
print("langchain", langchain.__version__)
print("ragas", ragas.__version__)
print("sentence-transformers", sentence_transformers.__version__)
print("transformers", transformers.__version__)
print("torch", getattr(torch, "__version__", "not installed"))

langchain 1.1.0
ragas 0.3.9
sentence-transformers 5.1.2
transformers 4.57.3
torch 2.9.1+cu128


In [4]:
# Create vectorstore directly
from langchain_community.document_loaders import CSVLoader

file = "/content/OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file)
data = loader.load()
from langchain_classic.indexes import VectorstoreIndexCreator
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_community.document_loaders import CSVLoader

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'}
    )
).from_documents(data)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
from langchain_openai import ChatOpenAI
from langchain_classic.chains import RetrievalQA

In [9]:
import os

# get the key from colab secrets
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [10]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)


In [11]:
data[10]

Document(metadata={'source': '/content/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [12]:
data[11]

Document(metadata={'source': '/content/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

In [13]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import BaseOutputParser
from pydantic import BaseModel, Field
from langchain_classic.chains import LLMChain
from langchain_openai import ChatOpenAI



examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.run({"query": query})

# Print the result
print(result)


  llm_chain = LLMChain(
  result = llm_chain.run({"query": query})


answer='Yes, the Cozy Comfort Pullover Set is available in multiple colors such as gray, navy, and black.'


In [14]:
from langchain_classic.evaluation.qa import QAGenerateChain

In [15]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [16]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [17]:
new_examples = example_gen_chain.batch(
    [{"doc": t} for t in data[:5]]
)

In [18]:
new_examples[0]

{'doc': Document(metadata={'source': '/content/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries."),
 'qa_pairs': {'query': "How should customers determine their size when ordering the Women's Campside Oxfords

In [19]:
data[0]

Document(metadata={'source': '/content/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [20]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': "How should customers determine their size when ordering the Women's Campside Oxfords?",
  'answer': 'Customers should order their regular shoe size. For half sizes not offered, they should order up to the next whole size.'},
 {'query': 'What are the dimensions of the Small and Medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The Small size has dimensions of 18" x 28" and the Medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What features does the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece offer according to the document?",
  'answer': "The toddler's two-piece swimsuit offers bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage. It is recommended to machine wash and line dry for best results."},
 {'query': 'What is the fabric makeup of the Refresh Swimwea

In [21]:
# examples += new_example
examples += d_flattened

In [22]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'answer': 'Yes'}

In [23]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

In [24]:
import langchain
langchain.debug = True

In [25]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side pockets.'}

In [26]:
# Turn off the debug mode
langchain.debug = False

In [27]:
examples += d_flattened

In [29]:
predictions = qa.batch(examples)




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [30]:
predictions

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes',
  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "How should customers determine their size when ordering the Women's Campside Oxfords?",
  'answer': 'Customers should order their regular shoe size. For half sizes not offered, they should order up to the next whole size.',
  'result': "To determine the size when ordering the Women's Campside Oxfords, customers should refer to the specific size and fit information provided for each shirt. The size and fit details typically include information on how the shirt is designed to fit, such as whether it is a relaxed fit, slightly fitted, or tailored fit. Additionally, measurements like chest, 

In [31]:
from langchain_classic.evaluation.qa import QAEvalChain

In [32]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [33]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [34]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [35]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.

Example 2:
Question: How should customers determine their size when ordering the Women's Campside Oxfords?
Real Answer: Customers should order their regular shoe size. For half sizes not offered, they should order up to the next whole size.
Predicted Answer: To determine the size when ordering the Women's Campside Oxfords, customers should refer to the specific size and fit information provided for each shirt. The size and fit details typically include information on how the shirt is designed to fit, such as whether it is a relaxed fit, slightly fitted, or tailored fit. Additionally,

In [36]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
loader = TextLoader("/content/nyc_text.txt")
index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})).from_loaders([loader])


llm = ChatOpenAI(temperature= 0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)

  index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})).from_loaders([loader])


In [37]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.'

In [38]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.',
 'source_documents': [Document(id='169d23dc-a4fd-47c0-a1b6-6bc2366f0fd1', metadata={'source': '/content/nyc_text.txt'}, page_content='The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York City is home to more than 3.2 million residents born outside the U.S., the largest foreign-born population of any city in the world as of 2016.New York City traces its origins to a trading post founded on the southern tip of Manhattan Island by Dutch colonists in approxi

In [39]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [40]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

In [41]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

In [42]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

'Manhattan (New York County) has the highest population density of any borough in New York City.'

In [43]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [44]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'contexts': [Document(id='73d94f0c-cc7b-445f-ae6f-67b2b93041b0', metadata={'source': '/content/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York City's 

In [45]:
import ragas.metrics as metrics
print(dir(metrics))

['AgentGoalAccuracyWithReference', 'AgentGoalAccuracyWithoutReference', 'AnswerAccuracy', 'AnswerCorrectness', 'AnswerRelevancy', 'AnswerSimilarity', 'AspectCritic', 'BaseMetric', 'BleuScore', 'ChrfScore', 'ContextEntityRecall', 'ContextPrecision', 'ContextRecall', 'ContextRelevance', 'ContextUtilization', 'DataCompyScore', 'DiscreteMetric', 'DistanceMeasure', 'ExactMatch', 'FactualCorrectness', 'Faithfulness', 'FaithfulnesswithHHEM', 'IDBasedContextPrecision', 'IDBasedContextRecall', 'InstanceRubrics', 'LLMContextPrecisionWithReference', 'LLMContextPrecisionWithoutReference', 'LLMContextRecall', 'LLMMetric', 'LLMSQLEquivalence', 'Metric', 'MetricOutputType', 'MetricResult', 'MetricType', 'MetricWithEmbeddings', 'MetricWithLLM', 'MultiModalFaithfulness', 'MultiModalRelevance', 'MultiTurnMetric', 'NoiseSensitivity', 'NonLLMContextPrecisionWithReference', 'NonLLMContextRecall', 'NonLLMStringSimilarity', 'NumericMetric', 'RankingMetric', 'ResponseGroundedness', 'ResponseRelevancy', 'Rouge

In [46]:
from ragas.evaluation import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextRelevance,      # ✅ Correct name
    ContextRecall,
)

print("✅ All imports successful!")

✅ All imports successful!


In [47]:
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextRecall,
)
from langchain_openai import ChatOpenAI
from datasets import Dataset

load_dotenv()

# ✅ Create LLM for RAGAS
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Convert Document objects to strings
contexts_str = [doc.page_content for doc in result_updated["contexts"]]

# Prepare dataset
data = {
    "question": [result_updated["question"]],
    "answer": [result_updated["answer"]],
    "contexts": [contexts_str],
    "reference": [result_updated["answer"]]
}

dataset = Dataset.from_dict(data)

# ✅ Pass LLM to metrics
metrics = [
    Faithfulness(llm=llm),
    AnswerRelevancy(llm=llm),
    ContextRecall(llm=llm),
]

# Evaluate
results = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=llm
)

# Print results
print(results)
print(f"\nFaithfulness: {results['faithfulness']}")
print(f"Answer Relevancy: {results['answer_relevancy']}")
print(f"Context Recall: {results['context_recall']}")

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

{'faithfulness': 0.5000, 'answer_relevancy': 0.9717, 'context_recall': 0.0000}

Faithfulness: [0.5]
Answer Relevancy: [0.9716865280717575]
Context Recall: [0.0]


In [48]:
# Recheck the result that we are going to validate.
result

{'query': 'Which borough of New York City has the highest population?',
 'result': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'source_documents': [Document(id='73d94f0c-cc7b-445f-ae6f-67b2b93041b0', metadata={'source': '/content/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York Ci

In [49]:
print(result_updated.keys())  # See what keys are in the dict
for k, v in result_updated.items():
    print(k, ":", str(v)[:500])  # Print first 500 chars or so of each value



dict_keys(['question', 'answer', 'contexts'])
question : Which borough of New York City has the highest population?
answer : Manhattan (New York County) has the highest population density of any borough in New York City.
contexts : [Document(id='73d94f0c-cc7b-445f-ae6f-67b2b93041b0', metadata={'source': '/content/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most po


In [50]:
import nest_asyncio
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
from langchain_openai import ChatOpenAI
from datasets import Dataset

nest_asyncio.apply()
load_dotenv()

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

contexts_str = [
    doc.page_content if hasattr(doc, "page_content") else str(doc)
    for doc in result_updated["contexts"]
]

data = {
    "question": [result_updated["question"]],
    "answer": [result_updated["answer"]],
    "contexts": [contexts_str],
    "reference": [result_updated["answer"]]
}

dataset = Dataset.from_dict(data)

results = evaluate(
    dataset=dataset,
    metrics=[Faithfulness(llm=llm), AnswerRelevancy(llm=llm), ContextRecall(llm=llm)],
    llm=llm
)

print(f"✅ Faithfulness: {results['faithfulness'][0]:.4f}")
print(f"✅ Answer Relevancy: {results['answer_relevancy'][0]:.4f}")
print(f"✅ Context Recall: {results['context_recall'][0]:.4f}")

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Faithfulness: 0.5000
✅ Answer Relevancy: 0.9717
✅ Context Recall: 0.0000


In [51]:
import nest_asyncio
import nest_asyncio
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
from langchain_openai import ChatOpenAI
from datasets import Dataset

nest_asyncio.apply()
load_dotenv()

# ✅ NEW: Create LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ✅ CHANGED: Convert Documents to strings (same as before)
contexts_str = [
    doc.page_content if hasattr(doc, "page_content") else str(doc)
    for doc in result_updated["contexts"]
]

# ✅ NEW: Prepare dataset for RAGAS
data = {
    "question": [result_updated["question"]],
    "answer": [result_updated["answer"]],
    "contexts": [contexts_str],
    "reference": [result_updated["answer"]]
}

dataset = Dataset.from_dict(data)

# ✅ NEW: Use evaluate() instead of faithfulness_chain()
results = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(llm=llm),
        AnswerRelevancy(llm=llm),
        ContextRecall(llm=llm),
    ],
    llm=llm
)

# ✅ NEW: Access results differently
print(f"Faithfulness: {results['faithfulness'][0]}")
print(f"Answer Relevancy: {results['answer_relevancy'][0]}")
print(f"Context Recall: {results['context_recall'][0]}")

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

Faithfulness: 0.5
Answer Relevancy: 0.9716718586593478
Context Recall: 0.0


In [52]:
import nest_asyncio
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
from langchain_openai import ChatOpenAI
from datasets import Dataset

nest_asyncio.apply()
load_dotenv()

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ✅ Helper function to convert documents to strings
def get_contexts_str(contexts):
    return [
        doc.page_content if hasattr(doc, "page_content") else str(doc)
        for doc in contexts
    ]

# ✅ Create metrics ONCE (reuse for all evaluations)
metrics = [
    Faithfulness(llm=llm),
    AnswerRelevancy(llm=llm),
    ContextRecall(llm=llm),
]

# ✅ Helper function to evaluate any answer
def evaluate_answer(question, answer, contexts, test_name=""):
    contexts_str = get_contexts_str(contexts)

    data = {
        "question": [question],
        "answer": [answer],
        "contexts": [contexts_str],
        "reference": [answer]
    }

    dataset = Dataset.from_dict(data)
    results = evaluate(dataset=dataset, metrics=metrics, llm=llm)

    print(f"\n{'='*60}")
    print(f"TEST: {test_name}")
    print(f"{'='*60}")
    print(f"Question: {question}")
    print(f"Answer: {answer[:100]}..." if len(answer) > 100 else f"Answer: {answer}")
    print(f"Faithfulness:    {results['faithfulness'][0]:.4f}")
    print(f"Answer Relevancy: {results['answer_relevancy'][0]:.4f}")
    print(f"Context Recall:   {results['context_recall'][0]:.4f}")

    return results

# ============================================
# ORIGINAL: Real answer from your RAG
# ============================================
real_results = evaluate_answer(
    question=result_updated["question"],
    answer=result_updated["answer"],
    contexts=result_updated["contexts"],
    test_name="✅ Real Answer"
)

# ============================================
# TEST 1: Completely fake answer (hallucination)
# ============================================
fake_results = evaluate_answer(
    question=result_updated["question"],
    answer="we are the champions",
    contexts=result_updated["contexts"],
    test_name="❌ Fake Answer (Hallucination)"
)

# ============================================
# TEST 2: Partially wrong (mix real + false)
# ============================================
partial_results = evaluate_answer(
    question=result_updated["question"],
    answer=result_updated["answer"][:50] + " [FAKE INFO ADDED HERE]",
    contexts=result_updated["contexts"],
    test_name="⚠️ Partially Wrong Answer"
)

# ============================================
# TEST 3: Right answer, wrong question
# ============================================
wrong_question_results = evaluate_answer(
    question="What is completely unrelated?",
    answer=result_updated["answer"],
    contexts=result_updated["contexts"],
    test_name="❌ Wrong Question"
)

# ============================================
# COMPARISON
# ============================================
print(f"\n{'='*60}")
print("FINAL COMPARISON")
print(f"{'='*60}")
print(f"{'Test Name':<30} | {'Faithfulness':<12} | {'Relevancy':<12} | {'Recall':<12}")
print(f"{'-'*70}")
print(f"{'Real Answer':<30} | {real_results['faithfulness'][0]:<12.4f} | {real_results['answer_relevancy'][0]:<12.4f} | {real_results['context_recall'][0]:<12.4f}")
print(f"{'Fake Answer':<30} | {fake_results['faithfulness'][0]:<12.4f} | {fake_results['answer_relevancy'][0]:<12.4f} | {fake_results['context_recall'][0]:<12.4f}")
print(f"{'Partial Wrong':<30} | {partial_results['faithfulness'][0]:<12.4f} | {partial_results['answer_relevancy'][0]:<12.4f} | {partial_results['context_recall'][0]:<12.4f}")
print(f"{'Wrong Question':<30} | {wrong_question_results['faithfulness'][0]:<12.4f} | {wrong_question_results['answer_relevancy'][0]:<12.4f} | {wrong_question_results['context_recall'][0]:<12.4f}")

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: ✅ Real Answer
Question: Which borough of New York City has the highest population?
Answer: Manhattan (New York County) has the highest population density of any borough in New York City.
Faithfulness:    0.5000
Answer Relevancy: 0.9717
Context Recall:   0.0000


Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: ❌ Fake Answer (Hallucination)
Question: Which borough of New York City has the highest population?
Answer: we are the champions
Faithfulness:    0.0000
Answer Relevancy: 0.7337
Context Recall:   0.0000


Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: ⚠️ Partially Wrong Answer
Question: Which borough of New York City has the highest population?
Answer: Manhattan (New York County) has the highest popula [FAKE INFO ADDED HERE]
Faithfulness:    0.6667
Answer Relevancy: 0.0000
Context Recall:   0.0000


Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: ❌ Wrong Question
Question: What is completely unrelated?
Answer: Manhattan (New York County) has the highest population density of any borough in New York City.
Faithfulness:    0.5000
Answer Relevancy: 0.7160
Context Recall:   1.0000

FINAL COMPARISON
Test Name                      | Faithfulness | Relevancy    | Recall      
----------------------------------------------------------------------
Real Answer                    | 0.5000       | 0.9717       | 0.0000      
Fake Answer                    | 0.0000       | 0.7337       | 0.0000      
Partial Wrong                  | 0.6667       | 0.0000       | 0.0000      
Wrong Question                 | 0.5000       | 0.7160       | 1.0000      


In [53]:
# OLD: fake_result["result"] = "we are the champions"
# NEW:
evaluate_answer(
    question=result_updated["question"],
    answer="we are the champions",  # ← This replaces the old code
    contexts=result_updated["contexts"],
    test_name="Fake Answer"
)

# OLD: fake_result["contexts"] = ["I love christmas"]
# NEW:
evaluate_answer(
    question=result_updated["question"],
    answer=result_updated["answer"],
    contexts=["I love christmas"],  # ← This replaces the old code
    test_name="Wrong Context"
)

Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: Fake Answer
Question: Which borough of New York City has the highest population?
Answer: we are the champions
Faithfulness:    0.0000
Answer Relevancy: 0.7337
Context Recall:   0.0000


Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]


TEST: Wrong Context
Question: Which borough of New York City has the highest population?
Answer: Manhattan (New York County) has the highest population density of any borough in New York City.
Faithfulness:    0.0000
Answer Relevancy: 0.9717
Context Recall:   0.0000


{'faithfulness': 0.0000, 'answer_relevancy': 0.9717, 'context_recall': 0.0000}

In [54]:
import nest_asyncio
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
from langchain_openai import ChatOpenAI
from datasets import Dataset
import numpy as np

nest_asyncio.apply()
load_dotenv()

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ✅ Run predictions and SAVE them
predictions = qa_chain.batch(examples)

# Extract ground truths
reference_answers = [ex['ground_truths'][0] for ex in examples]

# Prepare batch data
batch_data = {
    "question": [ex['query'] for ex in examples],
    "answer": [pred.get('result') or pred.get('answer') for pred in predictions],
    "contexts": [
        [doc.page_content if hasattr(doc, "page_content") else str(doc)
         for doc in pred.get('source_documents', [])]
        for pred in predictions
    ],
    "reference": reference_answers
}

dataset = Dataset.from_dict(batch_data)

# Evaluate
print("Evaluating batch...")
results = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(llm=llm),
        AnswerRelevancy(llm=llm),
        ContextRecall(llm=llm),
    ],
    llm=llm
)

# ✅ DISPLAY PREDICTIONS AND EVALUATIONS
print("\n" + "="*70)
print("PREDICTIONS AND EVALUATION RESULTS")
print("="*70)

for i, ex in enumerate(examples):
    pred = predictions[i]

    print(f"\n📌 Example {i+1}")
    print(f"Question: {ex['query']}")
    print(f"\nGround Truth: {ex['ground_truths'][0][:100]}...")
    print(f"\n🤖 Predicted Answer: {pred.get('result') or pred.get('answer')}")

    # Show retrieved documents
    if pred.get('source_documents'):
        print(f"\n📄 Retrieved {len(pred['source_documents'])} documents:")
        for j, doc in enumerate(pred['source_documents'][:2]):  # Show first 2
            content = doc.page_content[:80] if hasattr(doc, 'page_content') else str(doc)[:80]
            print(f"   [{j+1}] {content}...")

    # Show scores
    print(f"\n✅ Scores:")
    print(f"   Faithfulness: {results['faithfulness'][i]:.4f}")
    print(f"   Answer Relevancy: {results['answer_relevancy'][i]:.4f}")
    print(f"   Context Recall: {results['context_recall'][i]:.4f}")
    print("-" * 70)

# Average scores
print("\n" + "="*70)
print("AVERAGE SCORES")
print("="*70)
print(f"Average Faithfulness: {np.mean(results['faithfulness']):.4f}")
print(f"Average Answer Relevancy: {np.mean(results['answer_relevancy']):.4f}")
print(f"Average Context Recall: {np.mean(results['context_recall']):.4f}")

Evaluating batch...


Evaluating:   0%|          | 0/15 [00:00<?, ?it/s]


PREDICTIONS AND EVALUATION RESULTS

📌 Example 1
Question: What is the population of New York City as of 2020?

Ground Truth: 8,804,190...

🤖 Predicted Answer: The population of New York City as of 2020 was 8,804,190 residents.

📄 Retrieved 4 documents:
   [1] New York City is the most populous city in the United States, with 8,804,190 res...
   [2] New York, often called New York City or NYC, is the most populous city in the Un...

✅ Scores:
   Faithfulness: 1.0000
   Answer Relevancy: 0.9933
   Context Recall: 1.0000
----------------------------------------------------------------------

📌 Example 2
Question: Which borough of New York City has the highest population?

Ground Truth: Brooklyn...

🤖 Predicted Answer: Manhattan (New York County) has the highest population density of any borough in New York City.

📄 Retrieved 4 documents:
   [1] New York City is the most populous city in the United States, with 8,804,190 res...
   [2] New York, often called New York City or NYC, is the mo

In [55]:
import nest_asyncio
from dotenv import load_dotenv
import os
from ragas.evaluation import evaluate
from ragas.metrics import ContextRecall
from langchain_openai import ChatOpenAI
from datasets import Dataset

nest_asyncio.apply()
load_dotenv()

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ✅ Run predictions
predictions = qa_chain.batch(examples)

# Extract ground truths
reference_answers = [ex['ground_truths'][0] for ex in examples]

# Prepare batch data
batch_data = {
    "question": [ex['query'] for ex in examples],
    "answer": [pred.get('result') or pred.get('answer') for pred in predictions],
    "contexts": [
        [doc.page_content if hasattr(doc, "page_content") else str(doc)
         for doc in pred.get('source_documents', [])]
        for pred in predictions
    ],
    "reference": reference_answers
}

dataset = Dataset.from_dict(batch_data)

# ✅ Evaluate ONLY context recall
print("Evaluating context recall...")
r = evaluate(
    dataset=dataset,
    metrics=[ContextRecall(llm=llm)],
    llm=llm
)

# Print results
print(r)
print(f"\nContext Recall scores: {r['context_recall']}")

# Show detailed results
print("\n" + "="*70)
print("CONTEXT RECALL EVALUATION")
print("="*70)

for i, ex in enumerate(examples):
    pred = predictions[i]

    print(f"\nExample {i+1}: {ex['query'][:50]}...")
    print(f"Context Recall Score: {r['context_recall'][i]:.4f}")
    print(f"Ground Truth: {ex['ground_truths'][0][:80]}...")
    print(f"Retrieved {len(pred.get('source_documents', []))} documents")
    print("-" * 70)

# Average
import numpy as np
print(f"\n✅ Average Context Recall: {np.mean(r['context_recall']):.4f}")

Evaluating context recall...


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

{'context_recall': 0.8333}

Context Recall scores: [1.0, 1.0, 1.0, 0.5, 0.6666666666666666]

CONTEXT RECALL EVALUATION

Example 1: What is the population of New York City as of 2020...
Context Recall Score: 1.0000
Ground Truth: 8,804,190...
Retrieved 4 documents
----------------------------------------------------------------------

Example 2: Which borough of New York City has the highest pop...
Context Recall Score: 1.0000
Ground Truth: Brooklyn...
Retrieved 4 documents
----------------------------------------------------------------------

Example 3: What is the economic significance of New York City...
Context Recall Score: 1.0000
Ground Truth: New York City's economic significance is vast, as it serves as the global financ...
Retrieved 4 documents
----------------------------------------------------------------------

Example 4: How did New York City get its name?...
Context Recall Score: 0.5000
Ground Truth: New York City got its name when it came under British control in 1664. K