# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [18]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [19]:
file = '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [14]:
import pkg_resources
import sys
print(f"Python path: {sys.executable}")

packages = ['langchain', 'ragas', 'pydantic', 'langsmith', 'openai']
for package in packages:
    try:
        version = pkg_resources.get_distribution(package).version
        print(f"{package}: {version}")
    except pkg_resources.DistributionNotFound:
        print(f"{package}: Not found")
        

Python path: /opt/anaconda3/envs/rag_env/bin/python
langchain: 0.3.11
ragas: 0.2.12
pydantic: 2.10.5
langsmith: 0.2.11
openai: 1.56.0


In [9]:
# !pip install --upgrade --force-reinstall sentence-transformers

In [21]:
# Initialize the embeddings
embeddings = OpenAIEmbeddings()

# Create the index
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings
).from_loaders([loader])



In [23]:
# Initialize the LLM
llm = ChatOpenAI(temperature=0.0)

# Create the QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

#### Coming up with test datapoints

In [24]:
data[10]

Document(metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [25]:
data[11]

Document(metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

#### Hard-coded examples

In [26]:
from langchain.prompts import PromptTemplate

In [27]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.run({"query": query})

# Print the result
print(result)


  llm_chain = LLMChain(
  result = llm_chain.run({"query": query})


answer='Yes, the Cozy Comfort Pullover Set is available in multiple colors such as gray, navy, and burgundy.'


#### LLM-Generated examples

In [28]:
from langchain.evaluation.qa import QAGenerateChain

In [29]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [30]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [31]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [32]:
new_examples[0]

{'qa_pairs': {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."}}

In [33]:
data[0]

Document(metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [34]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the small and medium Recycled Waterhog dog mats?',
  'answer': 'The small Recycled Waterhog dog mat has dimensions of 18" x 28", while the medium Recycled Waterhog dog mat has dimensions of 22.5" x 34.5".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?",
  'answer': "The key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage, and the recommendation to machine wash and line dry for best results."},
 {'query': 'What is the fabric c

#### Combine examples

In [35]:
# examples += new_example
examples += d_flattened

In [36]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'answer': 'Yes'}

In [37]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

### Manual Evaluation - Fun part

In [38]:
import langchain
langchain.debug = True

In [39]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditiona

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

In [40]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [41]:
examples += d_flattened

In [42]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the small and medium Recycled Waterhog dog mats?',
  'answer': 'The small Recycled Waterhog dog mat has dimensions of 18" x 28", while the medium Recycled Waterhog dog mat has dimensions of 22.5" x 34.5".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?",
  'answer': "The key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UP

In [43]:
predictions = qa.batch(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [44]:
predictions

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes',
  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.",
  'result': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the small and medium Recycled Waterhog dog mats?',
  'answer': 'The small Recycled Waterhog dog mat has dimensions of 18" x 28", while the medium Recycled Waterhog dog mat has dimensions of 22.5" x 34.5".',
  'result': 'The dimensions of the small Recycled Waterhog dog mat are 18" x 28", an

In [45]:
from langchain.evaluation.qa import QAEvalChain

In [46]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [47]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [48]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [49]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Grade:", graded_outputs[i]['results'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium Recycled Waterhog dog mats?
Real Answer: The small Recycled Waterhog dog mat has dimensions of 18" x 28", while the medium Recycled Waterhog dog mat has dimensions of 22.5" x 34.5".
Predicted 

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [51]:
from langchain_community.embeddings import OpenAIEmbeddings
loader = TextLoader("/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/nyc_text.txt")

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Create index
index = VectorstoreIndexCreator(
    embedding=embeddings
).from_loaders([loader])

# Initialize LLM and create QA chain
llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)



In [52]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

'New York City was named in honor of the Duke of York, who later became King James II of England. The city was originally called New Amsterdam when it was founded by Dutch colonists in 1624, but it was renamed New York in 1664 when the English took control of the area from the Dutch.'

In [53]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was named in honor of the Duke of York, who later became King James II of England. The city was originally called New Amsterdam when it was founded by Dutch colonists in 1624, but it was renamed New York in 1664 when the English took control of the area from the Dutch.',
 'source_documents': [Document(id='a890966d-c07d-46bc-9790-6b30bf775860', metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/nyc_text.txt'}, page_content="== Etymology ==\n\nIn 1664, New York was named in honor of the Duke of York, who would become King James II of England. James's elder brother, King Charles II, appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.\n\n\n== History =="),
  Document(id='f05455c6-c758-4738-ac49-243742330035', metadata={'source': '/Users/michailkoskinas/Des

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [54]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [55]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [56]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

'Manhattan (New York County) is the most densely populated borough in New York City.'

In [57]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [58]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) is the most densely populated borough in New York City.',
 'contexts': [Document(id='91dd0aab-31e8-42f9-b29b-7b0fed34034d', metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/nyc_text.txt'}, page_content="Manhattan (New York County) is the geographically smallest and most densely populated borough. It is home to Central Park and most of the city's skyscrapers, and is sometimes locally known as The City. Manhattan's population density of 72,033 people per square mile (27,812/km2) in 2015 makes it the highest of any county in the United States and higher than the density of any individual American city.Manhattan is the cultural, administrative, and financial center of New York City and contains the headquarters of many major multinational corporations, the United Nations headquarters, Wall Street, and a number of imp

In [17]:
# !pip install --no-cache-dir recordclass

In [158]:
# !pip install ragas==0.1.9

In [61]:
from ragas.integrations.langchain import EvaluatorChain 
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall
)  # Removed context_relevancy as it's not available in this version

# create evaluation chains
faithfulness_chain = EvaluatorChain(metric=faithfulness)
answer_rel_chain = EvaluatorChain(metric=answer_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [62]:
# Recheck the result that we are going to validate.
result

{'query': 'Which borough of New York City has the highest population?',
 'result': 'Manhattan (New York County) is the most densely populated borough in New York City.',
 'source_documents': [Document(id='91dd0aab-31e8-42f9-b29b-7b0fed34034d', metadata={'source': '/Users/michailkoskinas/Desktop/Github/Ironhack/week19/day3/lab-langchain-evaluation/data/nyc_text.txt'}, page_content="Manhattan (New York County) is the geographically smallest and most densely populated borough. It is home to Central Park and most of the city's skyscrapers, and is sometimes locally known as The City. Manhattan's population density of 72,033 people per square mile (27,812/km2) in 2015 makes it the highest of any county in the United States and higher than the density of any individual American city.Manhattan is the cultural, administrative, and financial center of New York City and contains the headquarters of many major multinational corporations, the United Nations headquarters, Wall Street, and a number o

**Faithfulness**

In [67]:
eval_result = faithfulness_chain(result_updated)
score = eval_result['faithfulness']  # Changed from 'faithfulness_score' to 'faithfulness'
print("Faithfulness Score:", score)


print("\nFull evaluation result:")
print(f"Question: {eval_result['question']}")
print(f"Answer: {eval_result['answer']}")
print(f"Faithfulness Score: {eval_result['faithfulness']}")

Faithfulness Score: 1.0

Full evaluation result:
Question: Which borough of New York City has the highest population?
Answer: Manhattan (New York County) is the most densely populated borough in New York City.
Faithfulness Score: 1.0


High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [73]:
fake_result = {
    "question": result_updated["question"],  # Keep the original question
    "answer": "we are the champions",        # Our fake answer
    "contexts": result_updated["contexts"]    # Keep the original contexts
}

# Now evaluate
eval_result = faithfulness_chain(fake_result)
score = eval_result['faithfulness']
print("Faithfulness Score:", score)

print("\nFull evaluation result:")
print(f"Question: {eval_result['question']}")
print(f"Answer: {eval_result['answer']}")
print(f"Faithfulness Score: {eval_result['faithfulness']}")


Faithfulness Score: 0.0

Full evaluation result:
Question: Which borough of New York City has the highest population?
Answer: we are the champions
Faithfulness Score: 0.0


**Context Relevancy**

In [79]:
# Find the matching index for our question
question = result["query"]
matching_index = eval_questions.index(question)
recall_input = {
    "question": question,
    "contexts": result["source_documents"],
    "ground_truth": eval_answers[matching_index]  # Now it matches the question
}

# Now evaluate
eval_result = context_recall_chain(recall_input)
print("Context Recall Score:", eval_result.get('score') or eval_result.get('context_recall'))

# Full example with your existing eval_questions and eval_answers:
for i, question in enumerate(eval_questions):
    recall_input = {
        "question": question,
        "contexts": result["source_documents"],
        "ground_truth": eval_answers[i]
    }
    eval_result = context_recall_chain(recall_input)
    score = eval_result.get('score') or eval_result.get('context_recall')
    print(f"\nQuestion {i+1}:")
    print(f"Question: {question}")
    print(f"Ground Truth: {eval_answers[i]}")
    print(f"Context Recall Score: {score}")

Context Recall Score: 1.0

Question 1:
Question: What is the population of New York City as of 2020?
Ground Truth: 8,804,190
Context Recall Score: 1.0

Question 2:
Question: Which borough of New York City has the highest population?
Ground Truth: Brooklyn
Context Recall Score: 1.0

Question 3:
Question: What is the economic significance of New York City?
Ground Truth: New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital e

In [80]:
# First standalone evaluation
first_input = {
    "question": result["query"],  # Original query about borough population
    "contexts": result["source_documents"],  # Context about Manhattan
    "ground_truth": eval_answers[3]  # Answer about NYC's name! (index 3)
}

print("First Evaluation:")
print("Question:", first_input["question"])
print("Ground Truth:", first_input["ground_truth"])
print("\n")

# Loop evaluation (first iteration)
loop_input = {
    "question": eval_questions[0],  # First question from eval_questions
    "contexts": result["source_documents"],
    "ground_truth": eval_answers[0]  # Matching answer for first question
}

print("Loop Evaluation:")
print("Question:", loop_input["question"])
print("Ground Truth:", loop_input["ground_truth"])

First Evaluation:
Question: Which borough of New York City has the highest population?
Ground Truth: New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.


Loop Evaluation:
Question: What is the population of New York City as of 2020?
Ground Truth: 8,804,190


High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [82]:
from langchain.schema import Document
fake_recall_input = {
    "question": result["query"],  # Keep original question
    "contexts": [Document(page_content="I love christmas")],  # Our fake context
    "ground_truth": eval_answers[1]  # Ground truth for the borough population question
}

# Now evaluate
eval_result = context_recall_chain(fake_recall_input)
score = eval_result.get('score') or eval_result.get('context_recall')
print("Context Recall Score:", score)

Context Recall Score: 0.5


In [83]:
# Debug our inputs
fake_recall_input = {
    "question": result["query"],  # What was the original query?
    "contexts": [Document(page_content="I love christmas")],
    "ground_truth": eval_answers[1]  # What's in eval_answers[1]?
}

print("Question:", fake_recall_input["question"])
print("Context:", fake_recall_input["contexts"][0].page_content)
print("Ground Truth:", fake_recall_input["ground_truth"])

# Now evaluate
eval_result = context_recall_chain(fake_recall_input)
score = eval_result.get('score') or eval_result.get('context_recall')
print("\nContext Recall Score:", score)

Question: Which borough of New York City has the highest population?
Context: I love christmas
Ground Truth: Brooklyn

Context Recall Score: 0.5


2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [87]:
# Evaluate each prediction individually
print("evaluating...")
results = []
for example, prediction in zip(examples, predictions):
    # Format input for faithfulness chain
    eval_input = {
        "question": prediction["query"],
        "answer": prediction["result"],
        "contexts": prediction["source_documents"]
    }
    
    # Run evaluation
    result = faithfulness_chain(eval_input)
    results.append(result)
    
    # Print results
    print(f"\nQuestion: {eval_input['question']}")
    print(f"Answer: {eval_input['answer']}")
    print(f"Faithfulness Score: {result['faithfulness']}")

# Aggregate results if needed
average_score = sum(r['faithfulness'] for r in results) / len(results)
print(f"\nAverage Faithfulness Score: {average_score}")

evaluating...

Question: What is the population of New York City as of 2020?
Answer: The population of New York City as of 2020 was 8,804,190 residents.
Faithfulness Score: 1.0

Question: Which borough of New York City has the highest population?
Answer: Manhattan (New York County) is the most densely populated borough of New York City.
Faithfulness Score: 1.0

Question: What is the economic significance of New York City?
Answer: New York City is a global hub of business and commerce, with a diverse economy that includes key sectors such as finance, technology, healthcare, retail, tourism, real estate, media, fashion, and the arts. The city is home to Wall Street, the center of the U.S. financial industry, and hosts many Fortune 500 and multinational corporations. Additionally, New York City is a major player in global trade, attracting capital, business, and tourists from around the world. The city's fashion industry, advertising industry, and non-profit institutions also contribute s

In [91]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall
)
from datasets import Dataset  # Using huggingface datasets instead

# Format the data for evaluation
evaluation_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

# Fill the data
for example, prediction in zip(examples, predictions):
    evaluation_data["question"].append(example["query"])
    evaluation_data["answer"].append(prediction["result"])
    evaluation_data["contexts"].append([doc.page_content for doc in prediction["source_documents"]])
    evaluation_data["ground_truth"].append(
        example["ground_truths"][0] if isinstance(example.get("ground_truths"), list) else example.get("answer")
    )

# Create dataset
dataset = Dataset.from_dict(evaluation_data)

# Run evaluation
print("evaluating...")
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_recall]
)

print("\nEvaluation Results:")
print(results)

evaluating...


Evaluating:   0%|          | 0/15 [00:00<?, ?it/s]


Evaluation Results:
{'faithfulness': 0.9000, 'answer_relevancy': 0.9339, 'context_recall': 1.0000}
