# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will incur a charge of ~$3USD from OpenAI usage.

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Creating a Simple RAG Pipeline with LangChain v.0.2.0
  4. Task 4: Synthetic Dataset Generation for Evaluation using Ragas (Optional)

- 🤝 Breakout Room #2
  1. Task 1: Evaluating our Pipeline with Ragas
  2. Task 2: Testing OpenAI's Claim
  3. Task 3: Selecting an Advanced Retriever and Evaluating

> NOTE: This Notebook *does* contain a bonus challenge, outlined at the bottom of the notebook, which you can complete instead of the notebook for full marks on the assignment.

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [2]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant

We'll also get the "star of the show" today, which is Ragas!

In [3]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [4]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [5]:
import os
import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
# os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.2.0

Building on what we've been learning, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

- [`PyMuPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html)

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

# PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"
PDF_LINK = "PRECISE_StudyProtocol.pdf"
loader = PyMuPDFLoader(
    PDF_LINK  ### INSERT CODE
)

documents = loader.load()### YOUR CODE HERE

In [7]:
documents[0].metadata

{'source': 'PRECISE_StudyProtocol.pdf',
 'file_path': 'PRECISE_StudyProtocol.pdf',
 'page': 0,
 'total_pages': 76,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'LaTeX with hyperref',
 'producer': 'pdfTeX-1.40.24',
 'creationDate': 'D:20230616195132Z',
 'modDate': 'D:20230616195132Z',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

- [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

CHUNK_SIZE = 1500
CHUNK_OVERLAP = 50

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)
documents = text_splitter.split_documents(documents)  

Let's confirm we've split our document.

In [9]:
len(documents)

163

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

- [`OpenAIEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain-openai-embeddings-base-openaiembeddings)

> NOTE: We are purposefully using an older embedding model to try and answer the guiding question: Is TE3 better than Ada-002?

In [10]:
from langchain_openai import OpenAIEmbeddings

EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(
    model = EMBEDDING_MODEL  ### YOUR CODE HERE
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

- [`Qdrant`](https://api.python.langchain.com/en/latest/qdrant/langchain_qdrant.qdrant.QdrantVectorStore.html#langchain_qdrant.qdrant.QdrantVectorStore)

> NOTE: You'll need to provide the embedding dimension for Ada-002!

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

# LOCATION = ":memory:"
LOCATION = "localhost:55001"
COLLECTION_NAME = "PRECISE-1500-50-ADA"
VECTOR_SIZE = 1536

In [12]:
# qdrant_client = QdrantClient(LOCATION)  ### YOUR CODE HERE

# qdrant_client.create_collection(
#     collection_name=COLLECTION_NAME,
#     vectors_config=VectorParams(size=VECTOR_SIZE,distance=Distance.COSINE)
    
#     ### YOUR CODE HERE
# )

UnexpectedResponse: Unexpected Response: 409 (Conflict)
Raw response content:
b'{"status":{"error":"Wrong input: Collection `PRECISE-1500-50-ADA` already exists!"},"time":0.000035042}'

In [46]:


# qdrant_vector_store = QdrantVectorStore(
#     client=qdrant_client,
#     collection_name=COLLECTION_NAME,
#     embedding=embeddings



#     ### YOUR CODE HERE
# )

# qdrant_vector_store.add_documents(documents)

['60524fd666ab44819f3398f4e33f2080',
 'b8604284d67342558df895cfc1cfa6b6',
 '73371c6887ba4c18b0674bb45f8c06b3',
 'baa80eb0816b41bf8098b6a4618737dc',
 '305390fe01404ad59a4659a212dcfe3e',
 '2461029ab4a949c9b421720d985ff3f2',
 'd9b3577e492f4ce8987dde879437380a',
 '7f28f96318b34f35a4237660e456163b',
 'f306a3bfce784b7dbe0bc7d83c5433bb',
 '84177eb011f64e3a849cf3afeb3def88',
 'bde880a503c44a2abae0ed3091bb7a0b',
 '6909ec63f3814e0f8b952251b1d25f90',
 '4f37c7069dca47318f87524a9c05681f',
 '54d95cb6c9b143009e80c74a050286a8',
 'fa19132fd3b842ffa68a05ed9e4ea94f',
 '142018dafa90480eaa255cbc62f05f04',
 '570b290f30df491dbcbe20cf3d724828',
 '7a8eb9f75ed841aeac07afa937620336',
 '2aba5d9f92564a778fc5d9d40301f39f',
 '7aa0a02dd46a410c8b37da852a412df6',
 '206ab3229cdf45eba7719d4b02fce217',
 '4b6e8be85e6d4825b34dcee5f3638465',
 'd5fa02f82f5048df95d5982f659f6222',
 '7f64716b96d34b1aaac474afd4dd28d6',
 'b82dd8d374b0401aa2c4e5415e878796',
 '460ee2eb0be04f849aa12c65e2184d17',
 '75eaab5104e34ac996ab4f423403e280',
 

In [16]:
qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="PRECISE-200-50-ADA",
    embedding=embeddings
)
retriever_200_50 = qdrant_vector_store.as_retriever(search_kwargs={"k":100})
# retriever = vectorstore.as_retriever(kwargs={"k":10})
retrieved_documents = retriever_200_50.invoke("WHat are the inclusion criteria for PRECISE?")
print(len(retrieved_documents))
for doc in retrieved_documents:
  print(doc)

100
page_content='Short Title: PRECISE
CPCCRN Protocol Number: 90
Lead Investigators and Authors:
Mark W. Hall, Athena F. Zuppa and Peter Mourani' metadata={'source': 'PRECISE_StudyProtocol.pdf', 'file_path': 'PRECISE_StudyProtocol.pdf', 'page': 2, 'total_pages': 76, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.24', 'creationDate': 'D:20230616195132Z', 'modDate': 'D:20230616195132Z', 'trapped': '', '_id': 'fb2a745f-e592-41b4-9975-7e187eb70c02', '_collection_name': 'PRECISE-200-50-ADA'}
page_content='familiarity with clinical and laboratory procedures required for immune phenotyping. In the
PRECISE Protocol Version 1.07
Protocol Version Date: June 16, 2023' metadata={'source': 'PRECISE_StudyProtocol.pdf', 'file_path': 'PRECISE_StudyProtocol.pdf', 'page': 62, 'total_pages': 76, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'produc

In [14]:
qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="PRECISE-500-50-ADA",
    embedding=embeddings
)
retriever_500_50 = qdrant_vector_store.as_retriever(search_kwargs={"k":25})
retrieved_documents = retriever_500_50.invoke("WHat are the inclusion criteria for PRECISE?")
for doc in retrieved_documents:
  print(doc)

page_content='subjects into the TRIPS or GRACE-2 trials depending on their immune phenotyping results,
which will not be known at the time of enrollment. Patients who meet inclusion criteria will be
entered into the data capture system and exclusion criteria (if present) will be recorded in that
PRECISE Protocol Version 1.07
Protocol Version Date: June 16, 2023
The Collaborative Pediatric Critical Care Research Network' metadata={'source': 'PRECISE_StudyProtocol.pdf', 'file_path': 'PRECISE_StudyProtocol.pdf', 'page': 26, 'total_pages': 76, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.24', 'creationDate': 'D:20230616195132Z', 'modDate': 'D:20230616195132Z', 'trapped': '', '_id': '0c4cf7a3-34db-4fe7-a73d-faaf7871e1c0', '_collection_name': 'PRECISE-500-50-ADA'}
page_content='familiarity with clinical and laboratory procedures required for immune phenotyping. In the
PRECISE Protocol Version 1.07
Pr

Failed to batch ingest runs: langsmith.utils.LangSmithRateLimitError: Rate limit exceeded for https://api.smith.langchain.com/runs/batch. HTTPError('429 Client Error: Too Many Requests for url: https://api.smith.langchain.com/runs/batch', '{"detail":"Monthly unique traces usage limit exceeded"}')
post: trace=06310ae8-d76c-48d3-b8a1-b57d444300ef,id=06310ae8-d76c-48d3-b8a1-b57d444300ef
Failed to batch ingest runs: langsmith.utils.LangSmithConnectionError: Connection error caused failure to POST https://api.smith.langchain.com/runs/batch in LangSmith API. Please confirm your internet connection. SSLError(MaxRetryError("HTTPSConnectionPool(host='api.smith.langchain.com', port=443): Max retries exceeded with url: /runs/batch (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2427)')))"))
Content-Length: 1927244
API Key: lsv2_********************************************b2
post: trace=97999026-09f2-4466-9c58-0ff80b553e8d,id=97999026-09f2-4466-9c58-0ff80b553e8d; 

In [15]:
qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="PRECISE-1500-50-ADA",
    embedding=embeddings
)
retriever_1500_50 = qdrant_vector_store.as_retriever(search_kwargs={"k":10})
retrieved_documents = retriever_1500_50.invoke("WHat are the inclusion criteria for PRECISE?")
for doc in retrieved_documents:
  print(doc)

page_content='guardians to conduct research using biorepository specimens, to potentially include DNA se-
quencing data, that is beyond the original scope of the PRECISE study. We will use a layered
PRECISE Protocol Version 1.07
Protocol Version Date: June 16, 2023
The Collaborative Pediatric Critical Care Research Network' metadata={'source': 'PRECISE_StudyProtocol.pdf', 'file_path': 'PRECISE_StudyProtocol.pdf', 'page': 52, 'total_pages': 76, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.24', 'creationDate': 'D:20230616195132Z', 'modDate': 'D:20230616195132Z', 'trapped': '', '_id': '87829488-7da1-48d8-ac7d-af43f3a62e41', '_collection_name': 'PRECISE-1500-50-ADA'}
page_content='Protocol 90 (Hall, Zuppa and Mourani)
Page 3 of 76
PROTOCOL TITLE:
PeRsonalizEd immunomodulation in pediatriC sepsIS-inducEd MODS
Short Title: PRECISE
CPCCRN Protocol Number: 90
Lead Investigators and Authors:
Mark W. Hal

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """
You are a helpful assistant who is an expert on clinical trials.  Answer the question based on information in the context.  If you cannot answer the
question then say you do not know.

Question:
{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(template)   ### YOUR CODE HERE

#### Setting Up our Basic QA Chain

I am going to set up three RAG chains one for each retriever.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retrieval_augmented_qa_chain_200 = (
    {"context": itemgetter("question") | retriever_200_50, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)
retrieval_augmented_qa_chain_500 = (
    {"context": itemgetter("question") | retriever_500_50, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)
retrieval_augmented_qa_chain_1500 = (
    {"context": itemgetter("question") | retriever_1500_50, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [20]:
question = "What are the inclusion and exclusion criteria for PRECISE?"

result = retrieval_augmented_qa_chain_200.invoke({"question" : question})
print("Chunk size 200\n")
print(result["response"].content)
print("--------------\n")
result = retrieval_augmented_qa_chain_500.invoke({"question" : question})
print("Chunk size 500\n")
print(result["response"].content)
print("--------------\n")
print(result["response"].content)

result = retrieval_augmented_qa_chain_1500.invoke({"question" : question})
print("Chunk size 1500\n")
print(result["response"].content)
print("--------------\n")


Chunk size 200

The inclusion criteria for the PRECISE study are:
- Age between ≥40 weeks corrected gestational age to < 18 years; AND
- Admission to the PICU or CICU; AND
- Onset of ≥2 new organ dysfunctions within the last 3 calendar days (compared to pre-sepsis baseline) as measured by the modified Proulx criteria.

The exclusion criteria for the PRECISE study are:
- Weight <3kg; OR
- Limitation of care order at the time of screening; OR
- Known pregnancy; OR
- Lactating females; OR
- Receipt of anakinra or GM-CSF within the previous 28 days; OR
- Resolution of MODS by MODS Day 2; OR
- Previous enrollment in the PRECISE study.
--------------

Chunk size 500

The inclusion and exclusion criteria for the PRECISE study are as follows:

**Inclusion Criteria:**
- Participants must be aged ≥40 weeks corrected gestational age to < 18 years.
- Participants must have sepsis-induced multiple organ dysfunction syndrome (MODS).

**Exclusion Criteria:**
- Weight <3 kg.
- Limitation of care order

In [32]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I do not know.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '6e43ee8782ff4e54adab84a4da670930', '_collection_name': 'PMarca Blogs'}, page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.'), Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [33]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

In [34]:
len(eval_documents)

624

> NOTE: 🛑 Running this cell as presented will incur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step. **YOU CAN SKIP THIS STEP BY LOADING THE `.csv` DIRECTLY FROM OUR REPOSITORY.** 🛑

#### Optional: SDG for Evaluation

In [35]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 20 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
testset.to_pandas()

embedding nodes:   0%|          | 0/1248 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What are the four major forms of chance and th...,[that chance is immune from human intervention...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,What is the purpose of temporary subfolders in...,[you can reply to a lot of messages with “I’m ...,Temporary subfolders in email management are u...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,What impact does a severe personality disorder...,[added humor value.]\nWhile I enjoyed Marc’s p...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What is the relationship between precocious in...,[These three components are conspicuously link...,The context suggests that precocious individua...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,What are some strategies for raising angel mon...,[This obviously raises the issue of how you’re...,The context suggests several strategies for ra...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,What is the Influence-from-Mere-Association Te...,[One very practical consequence of Liking/Lovi...,The Influence-from-Mere-Association Tendency r...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What should individuals focus on to succeed in...,[becomes irrelevant to determining the success...,"According to Dr. Simonton, individuals should ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,What is the significance of being an individua...,[from scratch. This is a sharp diWerence from ...,Being an individual contributor is significant...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What makes engineering degrees considered usef...,[workforce in a high-impact way when you gradu...,Engineering degrees are considered useful in t...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What services are people abandoning in favor o...,[billion people online now. That is a very lar...,People are abandoning older services like news...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


Let's look at the output and see what we can learn about it!

In [36]:
testset.test_data[0]

DataRow(question='What are the four major forms of chance and their distinct roles in human interactions?', contexts=['that chance is immune from human interventions. However, one\nmust be careful not to read any unconsciously purposeful intent\ninto “interventions”… [which] are to be viewed as accidental,\nunwilled, inadvertent, and unforseeable.\nIndeed, chance plays several distinct roles when humans react cre-\natively with one another and with their environment…\nWe can observe chance arriving in four major forms and for four\ndiWerent reasons. The principles involved aWect everyone.\nHere’s where it helps to be a neurologist writing on this topic:\nThe four kinds of chance each have a diWerent kind of motor'], ground_truth='The answer to given question is not present in context', evolution_type='simple', metadata=[{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The

In [37]:
testset_df = testset.to_pandas()
testset_df.to_csv("testset.csv")

#### PREFERRED: Download `.csv` from DataRepository

I have to consume $85 before the end of September because credits I bought a year ago are going to expire!

In [None]:
# !git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 87, done.[K
remote: Counting objects: 100% (79/79), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 87 (delta 24), reused 28 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (87/87), 70.09 MiB | 33.73 MiB/s, done.
Resolving deltas: 100% (24/24), done.


In [None]:
# !mv DataRepository/testset.csv .

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [38]:
import pandas as pd

test_df = pd.read_csv("testset.csv")

In [39]:
test_df

Unnamed: 0.1,Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,0,What are the four major forms of chance and th...,['that chance is immune from human interventio...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,1,What is the purpose of temporary subfolders in...,['you can reply to a lot of messages with “I’m...,Temporary subfolders in email management are u...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,2,What impact does a severe personality disorder...,['added humor value.]\nWhile I enjoyed Marc’s ...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,3,What is the relationship between precocious in...,['These three components are conspicuously lin...,The context suggests that precocious individua...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,4,What are some strategies for raising angel mon...,['This obviously raises the issue of how you’r...,The context suggests several strategies for ra...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,5,What is the Influence-from-Mere-Association Te...,['One very practical consequence of Liking/Lov...,The Influence-from-Mere-Association Tendency r...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,6,What should individuals focus on to succeed in...,['becomes irrelevant to determining the succes...,"According to Dr. Simonton, individuals should ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,7,What is the significance of being an individua...,['from scratch. This is a sharp diWerence from...,Being an individual contributor is significant...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,8,What makes engineering degrees considered usef...,['workforce in a high-impact way when you grad...,Engineering degrees are considered useful in t...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,9,What services are people abandoning in favor o...,['billion people online now. That is a very la...,People are abandoning older services like news...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [40]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [41]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [42]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [43]:
response_dataset[0]

{'question': 'What are the four major forms of chance and their distinct roles in human interactions?',
 'answer': 'The four major forms of chance mentioned in the context are unwilled, inadvertent, and unforeseeable. Each of these forms plays distinct roles in human interactions, involving different kinds of exploratory activity and sensory receptivity. However, the specific details about the four forms of chance and their distinct roles are not provided in the context. Therefore, I cannot provide a complete answer regarding the four major forms of chance and their distinct roles.',
 'contexts': ['We can observe chance arriving in four major forms and for four\ndiWerent reasons. The principles involved aWect everyone.\nHere’s where it helps to be a neurologist writing on this topic:',
  'unwilled, inadvertent, and unforseeable.\nIndeed, chance plays several distinct roles when humans react cre-\natively with one another and with their environment…',
  'The four kinds of chance each ha

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [44]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [45]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [46]:
results

{'faithfulness': 0.6709, 'answer_relevancy': 0.7698, 'context_recall': 0.7500, 'context_precision': 0.6972, 'answer_correctness': 0.6469}

In [47]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are the four major forms of chance and th...,The four major forms of chance mentioned in th...,[We can observe chance arriving in four major ...,The answer to given question is not present in...,0.6,0.0,1.0,0.0,0.188577
1,What is the purpose of temporary subfolders in...,The purpose of temporary subfolders in email m...,"[the normal course of your day.\nFourth, aside...",Temporary subfolders in email management are u...,0.5,1.0,1.0,1.0,0.994009
2,What impact does a severe personality disorder...,A severe personality disorder can lead a manag...,[severe personality disorder who micromanages ...,The answer to given question is not present in...,0.6,1.0,1.0,0.0,0.179351
3,What is the relationship between precocious in...,The relationship between precocious individual...,"[early, end late, and produce at above-average...",The context suggests that precocious individua...,0.857143,0.983589,1.0,1.0,0.765158
4,What are some strategies for raising angel mon...,Some strategies for raising angel money before...,[This obviously raises the issue of how you’re...,The context suggests several strategies for ra...,0.7,1.0,0.5,0.805556,0.862922
5,What is the Influence-from-Mere-Association Te...,I do not know.,[The Psychology of Entrepreneurial Misjudgment...,The Influence-from-Mere-Association Tendency r...,0.0,0.0,0.0,0.0,0.182111
6,What should individuals focus on to succeed in...,"According to Dr. Simonton, individuals should ...",[progress through a creative career. Instead y...,"According to Dr. Simonton, individuals should ...",1.0,0.981602,1.0,1.0,0.804426
7,What is the significance of being an individua...,The significance of being an individual contri...,[destiny — you get to succeed or fail on your ...,Being an individual contributor is significant...,0.444444,1.0,0.666667,1.0,0.843277
8,What makes engineering degrees considered usef...,Engineering degrees are considered useful in t...,[Which undergraduate degrees are useful in\nth...,Engineering degrees are considered useful in t...,0.857143,0.987121,1.0,1.0,0.539575
9,What services are people abandoning in favor o...,The context does not specify the exact service...,[billion people online now. That is a very lar...,People are abandoning older services like news...,0.75,0.0,1.0,1.0,0.407606


## Task : Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #1:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [48]:
te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [49]:
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME+"TE3",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME+"TE3",
    embedding=te3_embeddings,
)

qdrant_vector_store.add_documents(documents)

['b47e0fc002b849518c4ccc41626165a9',
 'b40a34304ab54f279f4b513793cbfdee',
 'f72c0d8a660e4b7ba394aa495a161b06',
 'cbe4570cc6bb479597addfadb552c244',
 '0d2f7d76517d47449b7e366a8d8ff82c',
 '95cb90ea5486428e85461c0fc5e420c7',
 'eba9bffe8765446cad2edf9a557fef84',
 'e5921ac63e474bc7b4675789d89862a9',
 'ed4fa26402f84c4587b569a48a7b8438',
 '9e4d2c09965747e0bf3b7c9d310171b3',
 '593ac9ee1c3a4bdebbe552797b858efa',
 'fa13350c34b640119d3c1393de20ff3f',
 'dc4b549d0d164929a3da5066434178c0',
 'c7615c4dabc04306a6ac2440d0126ef3',
 '35d02a055b2d4780b77fbf58adc9758e',
 '970b7e63bcf44d75bb1c8e47dd51ec20',
 '3d5ee7c86c0542ea8ff03d96a57b827b',
 '3138d56c78564a2194145e9a513832f0',
 'dfa89adfc0cb400195b5f98d6cb9b05c',
 'f12a8834338f45719c8ddce5ab9c48bf',
 '967e846771544605800719d2c74026a1',
 'a04681cbf6ed4f0d964146b20d8c79ad',
 '591d2bda03d5411a8bb388e55487fcfc',
 '837e4ba2e2324cdb9c12eb96f1bc7613',
 '2d614208225e4e5dae8d4f98e4361288',
 '06ccf0d3127a40d5928a1451444d61e4',
 'af48a9d50ab84406a087906a3d672cc3',
 

In [50]:
te3_retriever = qdrant_vector_store.as_retriever()

In [51]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

In [52]:
from langchain.chains import create_retrieval_chain

te3_retrieval_chain = create_retrieval_chain(te3_retriever, document_chain)

In [53]:
answers = []
contexts = []

for question in test_questions:
  response = te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [54]:
te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [55]:
te3_advanced_retrieval_results = evaluate(te3_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [57]:
te3_advanced_retrieval_results

{'faithfulness': 0.8338, 'answer_relevancy': 0.7684, 'context_recall': 0.7000, 'context_precision': 0.7458, 'answer_correctness': 0.6128}

In [58]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3
0,faithfulness,0.670853,0.83377,0.162917
1,answer_relevancy,0.769766,0.768356,-0.00141
2,context_recall,0.75,0.7,-0.05
3,context_precision,0.697222,0.745833,0.048611
4,answer_correctness,0.646875,0.612831,-0.034044


####❓ Question #3:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

## Task 5: Selecting an Advanced Retriever and Evaluating

#### 🏗️ Activity #2

While the changes that occured due to modifying the embedding model were desirable - you're now tasked with improving `context_recall`, or `context_precision` (or both!).

You'll follow these steps:

1. Reason about this list of Advanced Retrieval methods:
  - [Contextual Compression (Reranker)](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/)
  - [MultiQueryRetriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/)
  - [Parent Document Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/)
2. Select the method you think will be the most performant.
3. Implement that method.
4. Create a LCEL chain that utlizes the new Retriever method.
5. Evaluate this LCEL and compare to the TE3 results.

> NOTE: We will spend more time in Session 14 diving into advanced retrieval methods, this activity is only to serve as a basic introduction to the idea of component-wise improvements and how they might impact metrics.

In [None]:
### YOUR CODE HERE

#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!