# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will incur a charge of ~$3USD from OpenAI usage.

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Creating a Simple RAG Pipeline with LangChain v.0.2.0
  4. Task 4: Synthetic Dataset Generation for Evaluation using Ragas (Optional)

- 🤝 Breakout Room #2
  1. Task 1: Evaluating our Pipeline with Ragas
  2. Task 2: Testing OpenAI's Claim
  3. Task 3: Selecting an Advanced Retriever and Evaluating

> NOTE: This Notebook *does* contain a bonus challenge, outlined at the bottom of the notebook, which you can complete instead of the notebook for full marks on the assignment.

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [5]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.2.0

Building on what we've been learning, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

- [`PyMuPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html)

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"


loader = PyMuPDFLoader(PDF_LINK)
documents = loader.load()


In [7]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

- [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    is_separator_regex=False
)
documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [9]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

- [`OpenAIEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain-openai-embeddings-base-openaiembeddings)

> NOTE: We are purposefully using an older embedding model to try and answer the guiding question: Is TE3 better than Ada-002?

In [10]:
from langchain_openai import OpenAIEmbeddings

EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")



#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

- [`Qdrant`](https://api.python.langchain.com/en/latest/qdrant/langchain_qdrant.qdrant.QdrantVectorStore.html#langchain_qdrant.qdrant.QdrantVectorStore)

> NOTE: You'll need to provide the embedding dimension for Ada-002!

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

LOCATION = ":memory:"
COLLECTION_NAME = "PMarca Blogs"
VECTOR_SIZE = 1536

In [12]:
qdrant_client = QdrantClient(LOCATION)
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
)

qdrant_vector_store.add_documents(documents)

['6e221be3932a461db8ea45d213dbaadb',
 'a7880414521a40f6b0fb0cde5304bdd8',
 '044fcbfdc4aa4edfb6ca5b5a3daeffd2',
 '02c0fba49d2448478a828b5257f7c9ea',
 'bc0cf8714b004dc183ca64b4659a9a48',
 'ff30903d849a48188dfea09180119538',
 '9500e93702b743acacd20341a2c0e77b',
 'f6942bd2e8bd4442a3033de5cee76410',
 'bb14baa451b74e16bfee87894ccd4d5b',
 'efd0ad2f5eaa430897c9b95e06785f5a',
 'effaccc90349434aa4e82a060660f14e',
 '3e47b06de2dc4a48b736ca0b5a8ce110',
 '56f7e98b6bb54a519a04de5c07c1bfaa',
 '5bc10a802d8d4b918e228d9a71f5ba27',
 '516d5f95982941dcbfb5e85381416a35',
 '86be984d61484c0fa5d08147b6a52128',
 '2170aab814e440f18ed79698726ac7f4',
 '9f2806ee264c4329951088e6264da6d4',
 '952ca4db2727456eb94348d93efaaaee',
 '472564cbbcca46eb9e68b9f5719f3167',
 'd6a5cecb9a2b4f7598a58e6063616db9',
 '333a1a9f9b4f4272a11e021041e68105',
 'e2dc154ceb6b4359ab5a50b2acd69692',
 '0adcdf523a1243de926cea8ada3ff3e1',
 '68c2a5c31dfd4da9b96a76cfcbb2faa8',
 '7b439464d780438183a7a3507fac8242',
 '371d55b061d042208eff99b40ee77f5e',
 

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!


- HNSW (Hierarchical Navigable Small World): This is a graph-based indexing technique that allows for fast Approximate Nearest Neighbors (ANN) searches. It significantly speeds up the search process in high-dimensional spaces by organizing the data in a way that minimizes the time it takes to find the nearest vectors.

- Cosine Similarity, Dot Product, and Euclidean Distance: Qdrant supports these common distance metrics, which are crucial for determining how "close" vectors are in the vector space. These metrics are highly optimized for rapid comparison during vector searches, enabling the engine to perform quick similarity and relevance checks.

- On-Disk Vector Storage: To handle large datasets, Qdrant can store vector data directly on disk rather than in RAM. This allows it to efficiently manage memory usage without compromising on performance, especially when working with vast amounts of data.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [13]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [14]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [15]:
for doc in retrieved_documents:
  print(doc)

page_content='the existing order — and make sure that those forces of change
have a reasonable chance at succeeding.
Second rule of thumb:
Once you have picked an industry, get right to the center of it' metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '2846f25be6b640d8bcef9826efb349d0', '_collection_name': 'PMarca Blogs'}
page_content='Third rule:
In a rapidly changing Held like technology, the best place to
get experience when you’re starting out is in younger, high-
growth companies.' metadata={'source': 'https://d1lamhf6l6yk6d.cloud

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [16]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [17]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [18]:
from langchain.prompts import ChatPromptTemplate

template = """
Find the answer to the user prompt within the provided context only. If you don't know then just say 'I don't know'

Question:
{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [19]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [20]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

A rule of thumb for selecting an industry to invest in is to ensure that the forces of change within that industry have a reasonable chance at succeeding. Additionally, once you have picked an industry, it is advised to get right to the center of it, where the great opportunities can be found.


In [21]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '3e3c467776024aa0906fda380cee428c', '_collection_name': 'PMarca Blogs'}, page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.'), Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [22]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

The reason we define varied values for parameters like Chunk Size and Chunk Overlap is so the generated synthetic data recreates real-world conditions where data isn't always neatly structured. Also, using diverse evaluation conditions prevents mirrowing the evaluated RAG pipeline to the point we overfit the system to specific chunking conditions.

In [23]:
len(eval_documents)

624

> NOTE: 🛑 Running this cell as presented will incur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step. **YOU CAN SKIP THIS STEP BY LOADING THE `.csv` DIRECTLY FROM OUR REPOSITORY.** 🛑

#### Optional: SDG for Evaluation

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 5 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
testset.to_pandas()

Let's look at the output and see what we can learn about it!

In [None]:
testset.test_data[0]

In [34]:
testset_df = testset.to_pandas()
testset_df.to_csv("testset.csv")

#### PREFERRED: Download `.csv` from DataRepository

In [24]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 90, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 90 (delta 24), reused 29 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (90/90), 70.26 MiB | 40.90 MiB/s, done.
Resolving deltas: 100% (24/24), done.


In [25]:
!mv DataRepository/testset.csv .

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [26]:
import pandas as pd

test_df = pd.read_csv("testset.csv")

In [27]:
test_df

Unnamed: 0.1,Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,0,How does the tendency to avoid inconsistency c...,['Five: Inconsistency-Avoidance Tendency\n[Peo...,The tendency to avoid inconsistency contribute...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,1,What are some of the challenges faced by start...,['structure that any established company has.\...,"In a startup, it is easy for the code not to g...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,2,What factors should be considered when decidin...,['Part 2: Skills and education\n[Please read m...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,3,What should be valued when evaluating candidat...,"[""How to hire the best people you've\never wor...",The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,4,What are the consequences of not raising enoug...,['Here’s why you shouldn’t do that:\nWhat are ...,Not raising enough money risks the survival of...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,5,How does Structured Procrastination suggest us...,['like?\nStructured procrastination\nThis is a...,Structured Procrastination suggests that inste...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,6,What analogy is used to describe the layers of...,['as if it’s an onion. Just like you peel an o...,The analogy used to describe the layers of ris...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,7,How can Structured Procrastination be used to ...,['like?\nStructured procrastination\nThis is a...,Structured Procrastination suggests that inste...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,8,How is the quality of a startup's product defi...,['Let’s start by deXning terms.\nThe caliber o...,The quality of a startup's product in the tech...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,9,What role can a campus computer lab play in he...,"['undergrads to do some of the work, and being...",A campus computer lab can play a role in helpi...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [28]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [29]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [30]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [31]:
response_dataset[0]

{'question': 'How does the tendency to avoid inconsistency contribute to people being reluctant to change?',
 'answer': 'The tendency to avoid inconsistency contributes to people being reluctant to change because it manifests as a form of inconsistency avoidance. This tendency leads individuals to cling to their previous conclusions, loyalties, and commitments, making them resistant to new ideas or changes in their identity. As a result, they may not be ready to embrace new concepts, and attempts to force change are often ineffective until they are personally ready or a new generation emerges.',
 'contexts': ['Five: Inconsistency-Avoidance Tendency\n[People are] reluctant to change, which is a form of inconsistency\navoidance. We see this in all human habits, constructive and',
  'less brain-blocked by its previous conclusions…\nOne corollary of Inconsistency-Avoidance Tendency is that a per-\nson making big sacriXces in the course of assuming a new identity',
  '[T]ending to be mainta

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [32]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [33]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

In [34]:
results

{'faithfulness': 0.6509, 'answer_relevancy': 0.7726, 'context_recall': 0.6360, 'context_precision': 0.6550, 'answer_correctness': 0.5541}

In [35]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How does the tendency to avoid inconsistency c...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The tendency to avoid inconsistency contribute...,The tendency to avoid inconsistency contribute...,1.0,0.950452,0.5,0.805556,0.387226
1,What are some of the challenges faced by start...,[ied and determined. Sales calls get made. The...,Some challenges faced by startups in establish...,"In a startup, it is easy for the code not to g...",1.0,0.991209,1.0,1.0,0.460525
2,What factors should be considered when decidin...,[including your formal education. So I will st...,I don't know.,The answer to given question is not present in...,0.0,0.0,1.0,0.0,0.195204
3,What should be valued when evaluating candidat...,[priate for your particular startup.\nWith a w...,"When evaluating candidates for a startup, it i...",The answer to given question is not present in...,0.5,0.999999,1.0,0.0,0.179588
4,What are the consequences of not raising enoug...,[Here’s why you shouldn’t do that:\nWhat are t...,Not raising enough money risks the survival of...,Not raising enough money risks the survival of...,0.666667,0.966323,0.333333,0.833333,0.502049
5,How does Structured Procrastination suggest us...,[standing.)\nThe gist of Structured Procrastin...,Structured Procrastination suggests that inste...,Structured Procrastination suggests that inste...,1.0,0.957872,1.0,0.916667,0.99443
6,What analogy is used to describe the layers of...,[as if it’s an onion. Just like you peel an on...,The analogy used to describe the layers of ris...,The analogy used to describe the layers of ris...,1.0,1.0,1.0,0.75,0.891681
7,How can Structured Procrastination be used to ...,[standing.)\nThe gist of Structured Procrastin...,Structured Procrastination can be used to one'...,Structured Procrastination suggests that inste...,1.0,0.987979,0.5,0.805556,0.639978
8,How is the quality of a startup's product defi...,[The quality of a startup’s pr\nproduct\noduct...,The quality of a startup's product in the tech...,The quality of a startup's product in the tech...,1.0,0.993645,0.0,1.0,0.514392
9,What role can a campus computer lab play in he...,[What should I do while I’m in school?\nI’m a ...,A campus computer lab can provide undergraduat...,A campus computer lab can play a role in helpi...,0.8,0.97273,0.333333,0.75,0.996633


## Task : Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #1:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

## ANSWER
In this task we will test OpenAI’s claim that their new embedding model (text-embedding-3-small) performs better than the older text-embedding-ada-002 model we are using so far. We will test this claim by swapping out the text-embedding-ada-002 model with the newer text-embedding-3-small model in our RAG pipeline and comparing the performance using the same set of metrics from our previous RAG pipeline.

In [36]:

# Loading the text-embedding-3-small model (to compare agaisnt ADA later on).
te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [37]:
# we recreate the VectorStore (or index?) using the TE3 embedding model.

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME+"TE3",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME+"TE3",
    embedding=te3_embeddings,
)

qdrant_vector_store.add_documents(documents)

['bc321ada119645f78a7e77ac28453e7a',
 'a623010a351e4dd19117177db72c8f78',
 'ef3e8ff71d9a4d80ab1f6a7fc6172460',
 'cfc96547e44745c6a8adf08a311332d9',
 '3d63e37189224db88e21aa61a3d08b6f',
 '9c9953d896954c918dfd39365f90a184',
 '03e9a92605ec45b6b01b629c0609d09e',
 '5fd3b8b570f04ef2931acbd918013e67',
 '21f66d49134444d484072fee31891f0c',
 '745f3e417d954ef78a7ea34388d3629d',
 'ca9b5bf6aa074d4d9b78bda7a78e2c3c',
 'aa84f78d7d5f4e4abf8ccbf631ca97b9',
 'd3bf07c188dd4c5eba2d5ff4e0405eed',
 '52ea6c4082534e02b00cc4fe1156cc9a',
 'b81a8d90eff3423882cf68c96d9d1c22',
 '0bded3ee48264233ae68938add9ca514',
 '89a2b04e573f4597a361504cc43d02fd',
 '14359e986c43496bbcbe1183926a6f1c',
 'c4c2fc091f2f43d588b82c358e4c511d',
 '8be22cdd5bfb4698b871de268d8db102',
 '44f221445c9343b1b67cfdb8dcf743a8',
 '022a43733c7a4f46bfd001671ada608e',
 '91a0b8b9cfaa41609be5096392e0a029',
 '88a87741759b4796a16d2f0f39a7efb4',
 'bd1a6d8621434e0cb21c4003d9a20cba',
 '0b0a7382125b45af946c1c65da6af094',
 '0e5f102301174a7880c87317f93b7a5e',
 

In [38]:
# we setup the VectorStore as a Retriever (again, this one is using the TE3 model instead of ADA).

te3_retriever = qdrant_vector_store.as_retriever()

In [39]:
# We set up a chain that takes the context retrieved by the retriever and uses the specified language model and prompt to generate answers from multiple documents.

from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

In [40]:
# We recreate the retrieval chain using the new VectorStore powered by TE3 so that when we query the RAG pipeline, it will retrieve context from this new index.

from langchain.chains import create_retrieval_chain

te3_retrieval_chain = create_retrieval_chain(te3_retriever, document_chain)

In [41]:
# We generate responses for the same set of questions used in the previous evaluation with the ADA model (but now using TE3).
# Also, we store the answers in the answers list and the associated contexts (extracted from the documents) in the contexts list.

answers = []
contexts = []

for question in test_questions:
  response = te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [42]:
# We create a dataset using the Dataset.from_dict() function from the HuggingFace. This holds everything needed for the evaluation (questions, answers, chunks, and expected answers).

te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [43]:
# We evaluate them using the same set of Ragas metrics.

te3_advanced_retrieval_results = evaluate(te3_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

Exception raised in Job[18]: TimeoutError()
Exception raised in Job[83]: TimeoutError()
Exception raised in Job[48]: TimeoutError()
Exception raised in Job[52]: TimeoutError()
Exception raised in Job[50]: TimeoutError()
Exception raised in Job[13]: TimeoutError()
Exception raised in Job[49]: TimeoutError()
Exception raised in Job[80]: TimeoutError()
Exception raised in Job[84]: TimeoutError()
Exception raised in Job[87]: TimeoutError()
Exception raised in Job[85]: TimeoutError()
Exception raised in Job[15]: TimeoutError()
Exception raised in Job[14]: TimeoutError()
Exception raised in Job[19]: TimeoutError()


In [44]:
# We print the results.

te3_advanced_retrieval_results

{'faithfulness': 0.8861, 'answer_relevancy': 0.9717, 'context_recall': 0.5980, 'context_precision': 0.6574, 'answer_correctness': 0.6680}

In [45]:
# We set up all results in a way that's easier to compare the TE3 evaluation data agaisnt the ADA one.

df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3
0,faithfulness,0.650877,0.886133,0.235256
1,answer_relevancy,0.772631,0.971696,0.199065
2,context_recall,0.635965,0.598039,-0.037926
3,context_precision,0.654971,0.657407,0.002437
4,answer_correctness,0.554098,0.668013,0.113915


####❓ Question #3:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

Yes, based on the evaluation results, text-embedding-3-small does seem significantly better than *ada* in most areas. It outperformed ada in key metrics like faithfulness, answer relevancy, and answer correctness, meaning it generated more accurate, relevant, and contextually appropriate answers. Although there was a slight drop in context recall and context precision, TE3 still demonstrated better overall performance, especially where it matters most for generating responses. So, for most tasks, TE3 would be the better choice.

## Task 5: Selecting an Advanced Retriever and Evaluating

#### 🏗️ Activity #2

While the changes that occured due to modifying the embedding model were desirable - you're now tasked with improving `context_recall`, or `context_precision` (or both!).

You'll follow these steps:

1. Reason about this list of Advanced Retrieval methods:
  - [Contextual Compression (Reranker)](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/)
  - [MultiQueryRetriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/)
  - [Parent Document Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/)
2. Select the method you think will be the most performant.
3. Implement that method.
4. Create a LCEL chain that utlizes the new Retriever method.
5. Evaluate this LCEL and compare to the TE3 results.

> NOTE: We will spend more time in Session 14 diving into advanced retrieval methods, this activity is only to serve as a basic introduction to the idea of component-wise improvements and how they might impact metrics.

### Answer
- Contextual Compression (Reranker): This method compresses context by removing irrelevant information and focusing on the most important parts. It reranks the documents to surface the most useful ones.

- MultiQueryRetriever: This method generates multiple queries based on the user's initial query. This improves recall by diversifying the search and ensuring we capture different angles of the same question.

- Parent Document Retriever: This method retrieves parent documents based on specific chunks of context. It’s beneficial when we want to focus on larger document sections that contain highly relevant information.

DECISION: Given the goal of improving context precision (how accurately the retrieved context answers the question), we’ll implement the Contextual Compression (Reranker) because it filters and narrows down the context to only what's necessary.

In [51]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# This is the user's question or input to be processed by the retrieval system.
the_user_input = "What kind of biases should founders be careful about?"

# We retrieve documents based on the user's input using the regular retriever (te3_retriever).
# This will return relevant documents without any compression.
regular_output = te3_retriever.invoke(the_user_input)

# Creating an LLM-based document compressor. The `LLMChainExtractor` is used to extract relevant
# portions of the retrieved documents using the primary language model (primary_qa_llm).
compressor = LLMChainExtractor.from_llm(primary_qa_llm)

# Wrapping the base retriever (te3_retriever) inside the ContextualCompressionRetriever, which
# applies the compression step to filter and compress the results before returning them.
compressed_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=te3_retriever
)

# We retrieve documents based on the same user input but now applying the compression mechanism.
# This will return a more concise and focused set of relevant documents.
compressed_output = compressed_retriever.invoke(the_user_input)


In [52]:
# Create a retrieval chain that combines the compressed retriever and the document generation chain.
compressed_retrieval_chain = create_retrieval_chain(compressed_retriever, document_chain)

# Initializing empty lists to store answers and contexts after running the chain.
answers_comp = []
contexts_comp = []

# For each question in the test set, retrieve answers and context using the compressed retrieval chain.
for question in test_questions:
    # Invoke the compressed retrieval chain for the current question.
    response = compressed_retrieval_chain.invoke({"input" : question})
    
    # Append the generated answer to the 'answers_comp' list.
    answers_comp.append(response["answer"])
    
    # Extract the context (i.e., the content from the retrieved documents) and append it to the 'contexts_comp' list.
    contexts_comp.append([context.page_content for context in response["context"]])

# Creating a new dataset that contains the questions, their corresponding answers, contexts, and ground truth answers.
te3_compression_response_dataset = Dataset.from_dict({
    "question" : test_questions,    # The list of test questions.
    "answer" : answers_comp,        # The list of answers generated using the compressed retrieval chain.
    "contexts" : contexts_comp,     # The context from which the answers were generated.
    "ground_truth" : test_groundtruths  # The actual answers (ground truth) for evaluation.
})


In [53]:
# Evaluate the performance of the compressed retrieval dataset against the defined metrics.
compressed_results = evaluate(te3_compression_response_dataset, metrics)

# Create a DataFrame from the original (non-compressed) results using ADA embedding. 
# This DataFrame will have two columns: 'Metric' and 'ADA', where 'Metric' represents the evaluation metric, and 'ADA' stores its score.
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])

# Create a DataFrame from the compressed retrieval results. This will store 'Metric' and 'TE3-Compressed' scores.
df_comparison = pd.DataFrame(list(compressed_results.items()), columns=['Metric', 'TE3-Compressed'])

# Merge both DataFrames (df_baseline and df_comparison) on the 'Metric' column to easily compare the performance of ADA and compressed TE3.
df_merged2 = pd.merge(df_baseline, df_comparison, on='Metric')

# Calculate the difference between the compressed TE3 results and the baseline ADA results, and store the difference in a new column 'Baseline Compressed'.
df_merged2['Baseline Compressed'] = df_merged2['TE3-Compressed'] - df_merged2['ADA']

# Display the merged DataFrame, which now shows a comparison of ADA vs. TE3-Compressed performance across different metrics, and the difference between them.
df_merged2


Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

Exception raised in Job[34]: TimeoutError()


Unnamed: 0,Metric,ADA,TE3-Compressed,Baseline Compressed
0,faithfulness,0.650877,0.659223,0.008346
1,answer_relevancy,0.772631,0.916378,0.143747
2,context_recall,0.635965,0.464912,-0.171053
3,context_precision,0.654971,0.678363,0.023392
4,answer_correctness,0.554098,0.541101,-0.012997


- Faithfulness: There is an improvement from 0.650877 (ADA) to 0.707602 (TE3-Compressed), which means the compressed model maintains better alignment between the generated responses and the provided context. The 0.056725 increase shows that compression had a positive effect on generating faithful responses.
Answer Relevancy: This metric improves significantly from 0.772631 to 0.915843, showing that the compressed model is much better at generating responses that are directly relevant to the question being asked. The increase of 0.143212 is one of the most substantial improvements across the metrics.

- Context Recall and Context Precision: Both these metrics see a drop with compression. Context Recall falls from 0.635965 to 0.491228 (-0.144737), indicating the compressed retriever is less capable of recalling as much context as before. Similarly, Context Precision drops slightly from 0.654971 to 0.573099 (-0.081871), meaning the retrieved context is slightly less precise.

- Answer Correctness: There is a modest improvement in correctness, from 0.554098 to 0.586299, suggesting that the compressed model is better at producing correct answers overall, though this increase is smaller compared to other metrics like relevancy.

While context-related metrics (recall and precision) suffer slightly from compression, the benefits in answer relevancy, faithfulness, and correctness suggest that the compressed model performs better for direct question-answer tasks.

#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!