# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will incur a charge of ~$3USD from OpenAI usage.

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Creating a Simple RAG Pipeline with LangChain v.0.2.0
  4. Task 4: Synthetic Dataset Generation for Evaluation using Ragas (Optional)

- 🤝 Breakout Room #2
  1. Task 1: Evaluating our Pipeline with Ragas
  2. Task 2: Testing OpenAI's Claim
  3. Task 3: Selecting an Advanced Retriever and Evaluating

> NOTE: This Notebook *does* contain a bonus challenge, outlined at the bottom of the notebook, which you can complete instead of the notebook for full marks on the assignment.

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [5]:
import os
import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
# os.environ["OPENAI_API_KEY"] = openai.api_key
os.environ["LANGCHAIN_TRACING_V2"] = "false"    ### Not using it here and I started getting rate limit errors from tracing

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.2.0

Building on what we've been learning, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

- [`PyMuPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html)

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"

loader = PyMuPDFLoader(
    PDF_LINK                                        ### INSERT CODE
)

documents = loader.load()                           ### YOUR CODE HERE

In [7]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

- [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

CHUNK_SIZE = 200
CHUNK_OVERLAP = 50

text_splitter = RecursiveCharacterTextSplitter(
                                                      ### YOUR CODE HERE
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

documents = text_splitter.split_documents(documents)  ### YOUR CODE HERE

Let's confirm we've split our document.

In [9]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

- [`OpenAIEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain-openai-embeddings-base-openaiembeddings)

> NOTE: We are purposefully using an older embedding model to try and answer the guiding question: Is TE3 better than Ada-002?

In [10]:
from langchain_openai import OpenAIEmbeddings

EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(
    model = EMBEDDING_MODEL                            ### YOUR CODE HERE
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

- [`Qdrant`](https://api.python.langchain.com/en/latest/qdrant/langchain_qdrant.qdrant.QdrantVectorStore.html#langchain_qdrant.qdrant.QdrantVectorStore)

> NOTE: You'll need to provide the embedding dimension for Ada-002!

## CHANGED Qdrant TO LOCAL SERVER!

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

# LOCATION = ":memory:"
LOCATION = "localhost:55002"
COLLECTION_NAME = "PMarca Blogs"
VECTOR_SIZE = 1536

In [12]:
qdrant_client = QdrantClient(LOCATION)                  ### YOUR CODE HERE

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE,distance=Distance.COSINE)
    
                                                        ### YOUR CODE HERE
)

True

In [13]:


qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=embeddings
                                                         ### YOUR CODE HERE
)

qdrant_vector_store.add_documents(documents);           ### Added semicolon to suppress the irritating printout of each vector

#### Suppressed the output from above cell, but here is screenshot in my server dashboard:
!['Qdrant server'](server1.png)

#### ❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

---

#### ANSWER #1:
Qdrant is a vector database;  it calls vectors "points".  Unlike a traditional database with tables (rows and columnns), it is organized as individual vectors (that might be thought of as rows), consisting of an ID, the values of the vector dimensions, and a payload is a JSON object that can consist of anything, but in our use case, includes the text of our chunk.  The payload also contains metadata such as the pageno from our PDF document, etc.  

Similarity is measured by one of three common metrics and we are using COSINE similarity (others are DOT product and Euclidean distance.)

A set of named points or vectors is a collection.  I have implemented a local Qdrant server and it contains multiple named collections. When a collection is set up, Qdrant requires the dimensionality of the vectors and the distance metric, as this affects the way it will index the collection.

Search is sped up by using Hierarchical Navigable Small World (HNSW) to implement approximate nearest neighbors;  this is important because exhaustive k-NN would require significant compute for pairwise comparisons in large databases.  For our toy applications, memory is used as storage, but I have changed the code in this notebook to use a local server running in a Docker container so I can create different collections and compare the text embedding models easily.  When using a server with disk file persistence, Qdrant creates a MemMap, basically mapping parts of or all of the file into memory to speed up implementation.

---

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [14]:
retriever = qdrant_vector_store.as_retriever()      ### YOUR CODE HERE

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [15]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [26]:
for doc in retrieved_documents:
  print(doc)

page_content='the existing order — and make sure that those forces of change
have a reasonable chance at succeeding.
Second rule of thumb:
Once you have picked an industry, get right to the center of it' metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'e20ac675f61249eeb2f5a17939072f9b', '_collection_name': 'PMarca Blogs'}
page_content='Third rule:
In a rapidly changing Held like technology, the best place to
get experience when you’re starting out is in younger, high-
growth companies.' metadata={'source': 'https://d1lamhf6l6yk6d.cloud

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [16]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

  prompt = loads(json.dumps(prompt_object.manifest))


In [17]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [18]:
from langchain.prompts import ChatPromptTemplate

template = """
You are a helpful assistant.  Answer the question based on information in the context.  If you cannot answer the
question then say you do not know.

Question:
{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(template)   ### YOUR CODE HERE

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

### I am changing all the models to GPT-4o 
I also deleted the comments in the next cell.

In [19]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [23]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

A rule of thumb for selecting an industry to invest in, based on the provided context, is to choose an industry where significant changes are happening and where those changes have a reasonable chance of succeeding.


In [24]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I do not know. The provided context does not include any information about what Pink Floyd had to say about investing in a new industry.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '09a7efae-a9c7-4943-b167-25b114b1ee0c', '_collection_name': 'PMarca Blogs'}, page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.'), Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfr

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [25]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

#### ❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

---

#### ANSWER #2
If we split our documents in exactly the same ways we we did when we created our RAG pipeline, then the system would be "rigged" in favor the subsequent tests appearing positive.  It is somewhat analogous to making sure that our training and test data are separated in the fine tuning (or pretraining) situations.  

---

In [26]:
len(eval_documents)

624

> NOTE: 🛑 Running this cell as presented will incur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step. **YOU CAN SKIP THIS STEP BY LOADING THE `.csv` DIRECTLY FROM OUR REPOSITORY.** 🛑

#### Optional: SDG for Evaluation
---

### I chose to run this cell, and increased the number of examples.
Last year I bought $100 credit from OpenAI without realizing that it automatically expires in 12 months whether you use it or not.  I have a balance of $86 that I don't want to leave on the table!

I also changed the generator because during our breakout I was disappointed with the variability, so using gpt-4o for everything.

In addition, the original code set the embeddings to OpenAIEmbeddings() and I am not sure the default was really ADA so I changed the code to be explicit.

---

In [29]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 50 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
testset.to_pandas()

embedding nodes:   0%|          | 0/1248 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/50 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Why is it generally a good idea to end up with...,[redesign is that you want to tolerate overlap...,"By reducing the size of a team, and increasing...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How has evolution influenced the tendency of a...,[Four: Doubt-Avoidance Tendency\nThe brain of ...,Evolution has influenced the tendency of anima...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,What is the risk associated with the funding w...,[Here’s why you shouldn’t do that:\nWhat are t...,The risk associated with the funding window no...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,How can the culture of a startup be affected b...,"[(In case you were wondering, by the way, the ...",The culture of a startup can be affected by th...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,What are the intangibles that keep great peopl...,"[Closing thought\nIn general, the intangibles ...",The intangibles that keep great people are the...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,How can an entrepreneur reduce risk to secure ...,[as if it’s an onion. Just like you peel an on...,An entrepreneur can reduce risk to secure vent...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,Why should you focus on more at-bats rather th...,[becomes irrelevant to determining the success...,You should focus on more at-bats rather than i...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,Why might radical change be necessary in a sit...,"[and iteration will ultimately prove it out, v...",Radical change might be necessary in a situati...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What usually triggers Doubt-Avoidance Tendency?,[ing is forced. And one is required to so comp...,What usually triggers Doubt-Avoidance Tendency...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What is the Inconsistency-Avoidance Tendency a...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The Inconsistency-Avoidance Tendency refers to...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


Let's look at the output and see what we can learn about it!

In [30]:
testset.test_data[0]

DataRow(question='Why is it generally a good idea to end up with smaller team sizes after restructuring?', contexts=['redesign is that you want to tolerate overlap. So each product\ndivision has its own QA team — so what? Your division heads —\nwho are now your best people — will be able to move so much\nfaster that way that it’s worth it. Plus, you saved so much money\ntaking out the VIPs, summertime soldiers, and mediocre people\nthat you’re still ahead on headcount expense.\nRemember, it’s generally a good idea, once you do all of this\nrestructuring, to end up with smaller team sizes than you had\nbefore. By reducing the size of a team, and increasing the aver-'], ground_truth='By reducing the size of a team, and increasing the average quality of team members, divisions will be able to move much faster.', evolution_type='simple', metadata=[{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront

In [31]:
testset_df = testset.to_pandas()
testset_df.to_csv("testset.csv")

#### PREFERRED: Download `.csv` from DataRepository

I have to consume $85 before the end of September because credits I bought a year ago are going to expire!

In [32]:
# !git clone https://github.com/AI-Maker-Space/DataRepository.git

In [33]:
# !mv DataRepository/testset.csv .

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [34]:
import pandas as pd

test_df = pd.read_csv("testset.csv")

In [35]:
test_df[:5]

Unnamed: 0.1,Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,0,Why is it generally a good idea to end up with...,['redesign is that you want to tolerate overla...,"By reducing the size of a team, and increasing...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,1,How has evolution influenced the tendency of a...,['Four: Doubt-Avoidance Tendency\nThe brain of...,Evolution has influenced the tendency of anima...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,2,What is the risk associated with the funding w...,['Here’s why you shouldn’t do that:\nWhat are ...,The risk associated with the funding window no...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,3,How can the culture of a startup be affected b...,"['(In case you were wondering, by the way, the...",The culture of a startup can be affected by th...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,4,What are the intangibles that keep great peopl...,"['Closing thought\nIn general, the intangibles...",The intangibles that keep great people are the...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [36]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [110]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [111]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [112]:
response_dataset[0]

{'question': 'Why is it generally a good idea to end up with smaller team sizes after restructuring?',
 'answer': 'It is generally a good idea to end up with smaller team sizes after restructuring because by reducing the size of a team and increasing the average quality level within the team, you will usually speed things up while saving money.',
 'contexts': ['that you’re still ahead on headcount expense.\nRemember, it’s generally a good idea, once you do all of this\nrestructuring, to end up with smaller team sizes than you had',
  'before. By reducing the size of a team, and increasing the aver-\nage quality level within the team, you will usually speed things\nup, while saving money.',
  'Well, Hrst question: Since team is the thing you have the most\ncontrol over at the start, and everyone wants to have a great\nteam, what does a great team actually get you?',
  'answer, in part because in the beginning of a startup, you know\na lot more about the team than you do the product, whi

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [113]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [114]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/250 [00:00<?, ?it/s]

In [115]:
results

{'faithfulness': 0.7754, 'answer_relevancy': 0.6923, 'context_recall': 0.7400, 'context_precision': 0.6811, 'answer_correctness': 0.6022}

In [116]:
results_df = results.to_pandas()
results_df[:5]

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Why is it generally a good idea to end up with...,It is generally a good idea to end up with sma...,[that you’re still ahead on headcount expense....,"By reducing the size of a team, and increasing...",1.0,0.931375,1.0,1.0,0.978382
1,How has evolution influenced the tendency of a...,Evolution has influenced the tendency of anima...,[Four: Doubt-Avoidance Tendency\nThe brain of ...,Evolution has influenced the tendency of anima...,0.75,0.866072,1.0,1.0,0.744145
2,What is the risk associated with the funding w...,The risk associated with the funding window no...,"[Second, the funding window may not be open wh...",The risk associated with the funding window no...,1.0,0.996141,1.0,1.0,0.88376
3,How can the culture of a startup be affected b...,The culture of a startup can be significantly ...,"[(In case you were wondering, by the way, the ...",The culture of a startup can be affected by th...,1.0,0.965583,0.5,1.0,0.774824
4,What are the intangibles that keep great peopl...,I do not know. The provided context does not i...,[feel like they’re the B team) and the great p...,The intangibles that keep great people are the...,0.5,0.0,1.0,0.833333,0.212907


## Task : Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

#### 🏗️ Activity #1:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

---
#### Activity 1 Markdown
The next cell is changing the embedding model from ADA to te3-small so we can test the claim about superiority.
Notice above, by the say, that my result statistics were very good already, so it will be interesting to see if this changes
any results.

---

In [117]:
te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

---
### Activity 1 Markdown (continued)
Now we create a new collection (same name with appended TE3 so we can tell the difference).
We also create a vector store, and add the same documents as in the previous situation.  The
key difference is that the embedding model is changed.  I also suppressed the output of the cell.

---

### The following code is commented out because I already ran it and the collection already exists in Qdrant.

In [119]:
# qdrant_client.create_collection(
#     collection_name=COLLECTION_NAME+"TE3",
#     vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
# )

# qdrant_vector_store = QdrantVectorStore(
#     client=qdrant_client,
#     collection_name=COLLECTION_NAME+"TE3",
#     embedding=te3_embeddings,
# )

# qdrant_vector_store.add_documents(documents);

### Now make a retriever

In [120]:
te3_retriever = qdrant_vector_store.as_retriever()

### Creating a new abstraction.
The original code for the qa_retrieval_chain from earlier in the notebook is:

```python
retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)
```
create_stuff_documents_chain allows us to pass a list of Documents to a model; the first two parameters are the model and the prompt.  So document_chain is a Runnable
and we have set up the model and the prompt.  Then we use create_retrieval_chain to attach our retriever to our document_chain.  This abstracts away the details of the document_chain, so it makes the actions more understandable.  Maybe.

In [121]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

In [122]:
from langchain.chains import create_retrieval_chain

te3_retrieval_chain = create_retrieval_chain(te3_retriever, document_chain)

### Now we create our data for RAGAS.
We set up answers and contexts as empty lists, as we will populate these by invoking the te3-retrieval_chain that was created in the last cell.  We loop through all 50 of our questions, receive the answer and context that are in the LLM response, and stick those answers and contexts into a list.  Since the lists are in the same order as the questions in the test_questions data set, this will enable us to create the quadruples needed for the RAGAS evaluation.

In [123]:
answers = []
contexts = []

for question in test_questions:
  response = te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

### Which we now do, creating a Hugging Face Dataset:

In [50]:
te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

### Now we do the RAGAS evaluation
This will compare the various components of our data set to line up with the metrics that we imported earlier.

In [51]:
te3_advanced_retrieval_results = evaluate(te3_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/250 [00:00<?, ?it/s]

### Now look at the results of this run

In [124]:
te3_advanced_retrieval_results

{'faithfulness': 0.7271, 'answer_relevancy': 0.8679, 'context_recall': 0.8300, 'context_precision': 0.7167, 'answer_correctness': 0.6875}

### Read the results into a panda DataFrame and compare

In [125]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3
0,faithfulness,0.775389,0.727133,-0.048256
1,answer_relevancy,0.692285,0.86788,0.175596
2,context_recall,0.74,0.83,0.09
3,context_precision,0.681111,0.716667,0.035556
4,answer_correctness,0.602239,0.687484,0.085245


#### ❓ Question #3:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

---
### ANSWER #3:

Faithfulness took a small hit, and recall that this evaluates whether the generated answers are supported by the retrieved context.  This is important as we would like to prevent hallucinations, but the magnitude of this change has uncertain meaning, since it is only in the "context" of the questions that were generated earlier.  Importantly, all other metrics improved rather significantly, especially answer relevancy.  This measures how well the answer addressed the question, penalizing for irrelevant or redundant information.  Answer correctness always seems useful to me and it improved as well.

I would agree that the new OpenAI embedding model is superior to the older ADA model.

---

## Task 5: Selecting an Advanced Retriever and Evaluating

#### 🏗️ Activity #2

While the changes that occured due to modifying the embedding model were desirable - you're now tasked with improving `context_recall`, or `context_precision` (or both!).

You'll follow these steps:

1. Reason about this list of Advanced Retrieval methods:
  - [Contextual Compression (Reranker)](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/)
  - [MultiQueryRetriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/)
  - [Parent Document Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/)
2. Select the method you think will be the most performant.
3. Implement that method.
4. Create a LCEL chain that utlizes the new Retriever method.
5. Evaluate this LCEL and compare to the TE3 results.

> NOTE: We will spend more time in Session 14 diving into advanced retrieval methods, this activity is only to serve as a basic introduction to the idea of component-wise improvements and how they might impact metrics.

In [126]:
### Let see if we can implement the parent child system!  Seems most valuable for my main project ideas.

from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever

# Make sure we are starting where we started:
PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"
documents = PyMuPDFLoader(PDF_LINK).load()

# Set up two splitters - parent and child
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    add_start_index=True
)
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    add_start_index=True
)

te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


### The following code is commented because I already ran it and the collection already exists in Qdrant

In [64]:

# # vectorstore for child chunks
# child_qdrant_client = QdrantClient("localhost:55002")
# CHILD_COLLECTION_NAME = "CHILD-200-TE3"
# child_qdrant_client.create_collection(
#     collection_name=CHILD_COLLECTION_NAME,
#     vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
# )
# child_vector_store = QdrantVectorStore(
#     client=child_qdrant_client,
#     collection_name=CHILD_COLLECTION_NAME,
#     embedding=te3_embeddings
# )


In [127]:
# Set up store for parent documents
store = InMemoryStore()

# Now create the parent child retriever
pc_retriever = ParentDocumentRetriever(
    vectorstore=child_vector_store,
    docstore = store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)


In [128]:
# Add documents
pc_retriever.add_documents(documents)


In [130]:
# How many child docs are in the vector store?  ANSWER SHOULD BE 4588 (from local server)
collection_info = child_qdrant_client.get_collection(CHILD_COLLECTION_NAME)
num_vectors = collection_info.points_count
print(num_vectors)

9176


In [131]:
# Parent docs that are in memory
len(list(store.yield_keys()))

225

In [132]:
# Now let's do a similarity search from the vector store, and then try to retrieve the parent
sub_docs = qdrant_vector_store.similarity_search("Silicon Valley startups")
print(len(sub_docs))
for sub_doc in sub_docs:
    print(sub_doc.page_content)

4
section of Silicon Valley startups — so don’t think that anything
I am talking about is referring to one of my own companies:
most likely when I talk about a scenario I have seen or some-
companies as Sun, Cisco, Yahoo, and Google, so needless to say,
Silicon Valley VCs are continually on the prowl on the Stanford
engineering campus for the next Jerry Yang or Larry Page.
start new companies when they could just park it on a beach and
suck down mai tais?
First, in my experience, Silicon Valley entrepreneurs are all over
white! In Silicon Valley, for example, it can still make a lot of
sense for a young parent to take a risk on a hot new startup
because it will usually be easy to get another job if the startup


In [133]:
# Now let's use the pc_retriever to get parents documents
retrieved_parent_documents = pc_retriever.invoke("Silicon Valley startups")
print(len(retrieved_parent_documents))
print(retrieved_parent_documents[0].page_content)

2
Part 1: Why not to do a startup
In this series of posts I will walk through some of my accumu-
lated knowledge and experience in building high-tech startups.
My speciXc experience is from three companies I have co-
founded: Netscape, sold to America Online in 1998 for $4.2
billion; Opsware (formerly Loudcloud), a public soaware com-
pany with an approximately $1 billion market cap; and now
Ning, a new, private consumer Internet company.
But more generally, I’ve been fortunate enough to be involved
in and exposed to a broad range of other startups — maybe 40
or 50 in enough detail to know what I’m talking about — since
arriving in Silicon Valley in 1994: as a board member, as an angel
investor, as an advisor, as a friend of various founders, and as a
participant in various venture capital funds.
This series will focus on lessons learned from this entire cross-
section of Silicon Valley startups — so don’t think that anything
I am talking about is referring to one of my own companies:


In [134]:
# So this appears to work.  Now I need to build a RAG chain.  This is easy because of the earlier abstraction!!

pc_retrieval_chain = create_retrieval_chain(pc_retriever, document_chain)

In [135]:
# Now see if it works
question = "What is a rule of thumb for selecting an industry to invest in?"
print(pc_retrieval_chain.invoke({"input": question})["answer"])


The rule of thumb for selecting an industry to invest in is to pick an industry where the founders of the important companies are still alive and actively involved. This can be determined by looking at the CEO, chairman or chairwoman, and board of directors for the major companies in the industry. If the founders are currently serving in these roles, it indicates that the industry is likely still young, vital, and full of opportunities.


In [136]:
# Now I need to do the RAGAS evaluation with this retrieval chain
answers = []
contexts = []

for question in test_questions:
  response = pc_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])


response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})


In [137]:
pc_results = evaluate(response_dataset, metrics)
pc_results

Evaluating:   0%|          | 0/250 [00:00<?, ?it/s]

{'faithfulness': 0.7757, 'answer_relevancy': 0.8468, 'context_recall': 0.9900, 'context_precision': 0.9000, 'answer_correctness': 0.6594}

In [141]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])
df_parents = pd.DataFrame(list(pc_results.items()),columns=['Metric', 'PC'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')
df_merged = pd.merge(df_merged, df_parents, on = 'Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']
df_merged['% TE3'] = 100*(df_merged['TE3'] - df_merged['ADA'])/df_merged['ADA']
df_merged['Baseline -> PC'] = df_merged['PC'] - df_merged['ADA']
df_merged['% PC'] = 100*(df_merged['PC'] - df_merged['ADA'])/df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,PC,Baseline -> TE3,% TE3,Baseline -> PC,% PC
0,faithfulness,0.775389,0.727133,0.775739,-0.048256,-6.223436,0.00035,0.045189
1,answer_relevancy,0.692285,0.86788,0.846816,0.175596,25.364697,0.154531,22.321949
2,context_recall,0.74,0.83,0.99,0.09,12.162162,0.25,33.783784
3,context_precision,0.681111,0.716667,0.9,0.035556,5.220228,0.218889,32.137031
4,answer_correctness,0.602239,0.687484,0.659416,0.085245,14.154662,0.057178,9.494193


---
### ACTIVITY 2 COMMENTS
I chose to do the parent - child approach because in many of my clinical applications, this is a very desirable thing to try.

The results show that this advanced retrieval method enhanced context_recall and context_precision rather drastically, taking them each to over 0.90, and the context_recall reached 0.99.

---

#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!