# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.2.0](https://python.langchain.com/v0.2/docs/versions/v0_2/)
  4. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas) framework.
  

- 🤝 Breakout Room Part #2:
  1. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage `GPT-3.5-Turbo` as the `critic_llm`!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [4]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [5]:
!pip install -qU ragas 

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [6]:
!pip install -qU qdrant-client pymupdf pandas  

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [7]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [8]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

documents = loader.load()

In [9]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [11]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [12]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [13]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="PMarca Blogs",
)

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

**ANSWER**
Qdrant is a vector database that is designed to be performant and scalable. It uses a combination of techniques to do this, such as:
- Hashing: Qdrant uses a hashing algorithm to map high-dimensional vectors to a lower-dimensional space, which allows for efficient storage and retrieval of vectors.
- Indexing: Qdrant uses an index to speed up the retrieval of vectors. The index is a data structure that allows for efficient search of vectors based on their similarity to a query vector.
- Storage: Qdrant stores vectors in memory, which allows for fast access to vectors.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [14]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [15]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [16]:
for doc in retrieved_documents:
  print(doc)

page_content='the existing order — and make sure that those forces of change
have a reasonable chance at succeeding.
Second rule of thumb:
Once you have picked an industry, get right to the center of it' metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '807ef480cdea477bb6803a72971d9136', '_collection_name': 'PMarca Blogs'}
page_content='Third rule:
In a rapidly changing Held like technology, the best place to
get experience when you’re starting out is in younger, high-
growth companies.' metadata={'source': 'https://d1lamhf6l6yk6d.cloud

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [17]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [18]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [19]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [20]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

The chain above takes a question, retrieves relevant context, formats a prompt with the question and context, sends it to OpenAI/LLM, and returns the response including the used context.

Let's test it out!

In [21]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

The rule of thumb for selecting an industry to invest in, as mentioned in the context, is to pick an industry and get right to the center of it.


In [22]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'a1948dc5f56d4925bbc3dd4ed4176dec', '_collection_name': 'PMarca Blogs'}, page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.'), Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [23]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

**Answer**
Splitting documents using different parameters is important to ensure that our synthetic data is representative of the original data. We want to test our RAG pipeline on a variety of different contexts, so we need to split our documents using different parameters. It makes it more realistic given that real-world data comes in many different formats and lengths.

In [24]:
len(eval_documents)

624


> NOTE: 🛑 Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage GPT-3.5-Turbo as the critic_llm. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits. 🛑

In [26]:
import nest_asyncio
nest_asyncio.apply()
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

eval_documents = eval_documents[0:500]
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
# critic_llm = ChatOpenAI(model="gpt-3.5-turbo") <--- If you don't have GPT-4 access, or to reduce cost/rate limiting issues.
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

testset = generator.generate_with_langchain_docs(eval_documents, 20, distributions, is_async = False)
testset.to_pandas() 

embedding nodes:   0%|          | 0/1000 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the importance of putting yourself in ...,[— put yourself in situations where you will s...,Putting yourself in situations where you will ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How do VC bloggers encourage entrepreneurs to ...,[VCs who blog are interested in which kinds of...,VC bloggers may encourage entrepreneurs to com...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,Why is rigorous thinking important in the real...,"[•\nPlus, technical degrees teach you how thin...",Rigorous thinking is important in the real wor...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What is one reason why managing executives can...,"[Third, whenever possible, promote from within...",Managing executives can be challenging because...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,What is the importance of maintaining a clear ...,"[asked to do it.\nOr, better yet, just say you...",The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,Why is it important to seek out and create opp...,[sented with an opportunity like one of the ab...,Seeking out and creating opportunities is impo...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,Why should people seize opportunities when the...,"[do, just sit down and tease apart the risks —...",People should seize opportunities when they ar...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How does shifting from stock options to restri...,[favor of restricted stock at Microsoa. [Gates...,Shifting from stock options to restricted stoc...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,Why is it important to focus on strength rathe...,[manage executives. It turns out that just abo...,"When managing executives, it is important to f...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,How does winning impact employee retention in ...,[Part 2: Retaining great people\nThis post is ...,"This post is not about retention, but about wi...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

**ANSWER** 
simple: refers to straightforward, basic questions or tasks
reasoning: refers to more complex questions or tasks that require a chain of thought reasoning
multi_context: refers to questions or tasks that require the use of multiple context items

Let's look at the output and see what we can learn about it!

In [27]:
testset.test_data[0]

DataRow(question='What is the importance of putting yourself in situations where you will succeed or fail based on your own decisions and actions?', contexts=['— put yourself in situations where you will succeed or fail by\nyour own decisions and actions, and where that success or fail-\nure will be highly visible.\nBy failure I don’t mean getting a B or even a C, but rather: having\nyour boss yell at you in front of your peers for screwing up a\nproject, launching a product and seeing it tank, being unable to\nmeet a ship date, missing a critical piece of information in a\nXnancial report, or getting Xred.\nWhy? If you’re going to be a high achiever, you’re going to be\nin lots of situations where you’re going to be quickly making deci-'], ground_truth='Putting yourself in situations where you will succeed or fail based on your own decisions and actions is important because it allows you to learn from both success and failure. It provides opportunities for growth, development, and imp

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [28]:
test_df = testset.to_pandas()

In [29]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the importance of putting yourself in ...,[— put yourself in situations where you will s...,Putting yourself in situations where you will ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How do VC bloggers encourage entrepreneurs to ...,[VCs who blog are interested in which kinds of...,VC bloggers may encourage entrepreneurs to com...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,Why is rigorous thinking important in the real...,"[•\nPlus, technical degrees teach you how thin...",Rigorous thinking is important in the real wor...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What is one reason why managing executives can...,"[Third, whenever possible, promote from within...",Managing executives can be challenging because...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,What is the importance of maintaining a clear ...,"[asked to do it.\nOr, better yet, just say you...",The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,Why is it important to seek out and create opp...,[sented with an opportunity like one of the ab...,Seeking out and creating opportunities is impo...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,Why should people seize opportunities when the...,"[do, just sit down and tease apart the risks —...",People should seize opportunities when they ar...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How does shifting from stock options to restri...,[favor of restricted stock at Microsoa. [Gates...,Shifting from stock options to restricted stoc...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,Why is it important to focus on strength rathe...,[manage executives. It turns out that just abo...,"When managing executives, it is important to f...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,How does winning impact employee retention in ...,[Part 2: Retaining great people\nThis post is ...,"This post is not about retention, but about wi...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [30]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [41]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [42]:
from datasets import Dataset

TRUNC_SIZE = 10 # Truncating to avoid OpenAI rate limits
response_dataset = Dataset.from_dict({
    "question" : test_questions[0:TRUNC_SIZE],
    "answer" : answers[0:TRUNC_SIZE],
    "contexts" : contexts[0:TRUNC_SIZE],
    "ground_truth" : test_groundtruths[0:TRUNC_SIZE]
})

Let's take a peek and see what that looks like!

In [44]:
response_dataset[0]

{'question': 'What is the importance of putting yourself in situations where you will succeed or fail based on your own decisions and actions?',
 'answer': "The importance of putting yourself in situations where you will succeed or fail based on your own decisions and actions, as described in the provided context, is to expose yourself to real-world challenges and risks. This approach allows for personal growth and learning from direct experiences. It emphasizes the value of taking responsibility for one's own outcomes, which can be particularly appealing and beneficial for individuals who prefer autonomy and wish to avoid being directed by others. This can lead to a deeper understanding of one's capabilities and limitations, fostering self-reliance and decision-making skills.",
 'contexts': ['— put yourself in situations where you will succeed or fail by\nyour own decisions and actions, and where that success or fail-\nure will be highly visible.',
  'destiny — you get to succeed or f

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [46]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [47]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [48]:
results

{'faithfulness': 0.7892, 'answer_relevancy': 0.9845, 'context_recall': 0.7500, 'context_precision': 0.7556, 'answer_correctness': 0.5010}

In [49]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What is the importance of putting yourself in ...,The importance of putting yourself in situatio...,[— put yourself in situations where you will s...,Putting yourself in situations where you will ...,0.6,1.0,0.333333,1.0,0.772864
1,How do VC bloggers encourage entrepreneurs to ...,VC bloggers encourage entrepreneurs to communi...,"[At best, a VC blogger may encourage her reade...",VC bloggers may encourage entrepreneurs to com...,1.0,1.0,1.0,0.805556,0.619721
2,Why is rigorous thinking important in the real...,Rigorous thinking is important in the real wor...,"[reason, logic, and data. This is incredibly u...",Rigorous thinking is important in the real wor...,0.75,1.0,0.5,1.0,0.493018
3,What is one reason why managing executives can...,One reason why managing executives can be chal...,"[Managing\nFirst, manage your executives.\nIt’...",Managing executives can be challenging because...,0.666667,0.984871,1.0,1.0,0.230253
4,What is the importance of maintaining a clear ...,The importance of maintaining a clear gaze in ...,"[says, “You know you’re a good leader when peo...",The answer to given question is not present in...,0.5,1.0,1.0,0.0,0.183744
5,Why is it important to seek out and create opp...,"Based on the context provided, it is important...",[One of the single best ways you can maximize ...,Seeking out and creating opportunities is impo...,1.0,1.0,1.0,0.916667,0.535655
6,Why should people seize opportunities when the...,People should seize opportunities when they ar...,[of a great opportunity. They oaen won’t.\nOne...,People should seize opportunities when they ar...,0.625,1.0,0.666667,0.333333,0.353097
7,How does shifting from stock options to restri...,Shifting from stock options to restricted stoc...,[shiaing from stock options to restricted stoc...,Shifting from stock options to restricted stoc...,1.0,0.947798,1.0,1.0,0.618233
8,Why is it important to focus on strength rathe...,The context suggests that it is important to f...,[back — perhaps abstractly — about all of her ...,"When managing executives, it is important to f...",1.0,0.912358,0.25,0.5,0.680213
9,How does winning impact employee retention in ...,Winning impacts employee retention in companie...,[The only way a company in that situation can ...,"This post is not about retention, but about wi...",0.75,1.0,0.75,1.0,0.52325


## Task 2: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [50]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [51]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [52]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [53]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [54]:
print(response["answer"])

The provided text does not contain any information about Taylor Swift or any feuds involving her. If you have any other questions or need information on a different topic, feel free to ask!


In [55]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [56]:
print(response["answer"])

The text does not provide specific details about any individuals or entities feuding. It discusses various topics such as human nature, business practices, and historical patterns of behavior, but it does not mention a specific feud or conflict between particular parties. If you are referring to a specific feud mentioned in a different part of the text or context, please provide more details so I can help you better understand the situation.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [57]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [58]:
REDUCED_SIZE = int(TRUNC_SIZE/2) # reduced size to fit OpenAI rate limits
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions[0:REDUCED_SIZE],
    "answer" : answers[0:REDUCED_SIZE],
    "contexts" : contexts[0:REDUCED_SIZE],
    "ground_truth" : test_groundtruths[0:REDUCED_SIZE]
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [59]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

In [None]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are the potential risks and challenges fa...,Not raising enough money risks the survival of...,[Here’s why you shouldn’t do that:\nWhat are t...,Not raising enough money risks the survival of...,0.666667,0.921043,1.0,0.804167,0.346032
1,How can a transformative deal inject energy ba...,A transformative deal can inject energy back i...,[ossiXed that the clear opportunity exists to ...,,0.5,0.985384,0.0,0.0,0.180063
2,"Who is quoted as saying ""This fucking job ain'...","The quote ""This fucking job ain't that fucking...",[then doesn’t know what the friend is currentl...,The great Tommy Lasorda,1.0,0.979978,1.0,0.5,0.717253
3,What is Los Angeles known for in terms of oppo...,Los Angeles is known for opportunities and ind...,[the city where all the action is happening.\n...,Los Angeles is known for opportunities and ind...,1.0,0.999999,1.0,1.0,0.883862
4,How can blogging about their point of view hel...,Blogging about their point of view can help an...,"[esting things going on, about their point of ...",Blogging about their point of view can help an...,1.0,0.975714,1.0,1.0,0.747243
5,Are the anticipated career peaks for creators ...,The anticipated career peaks for creators who ...,"[early, end late, and produce at above-average...",,0.666667,0.982765,0.0,0.75,0.184018
6,What is the impact of accelerated vesting on c...,Accelerated vesting on change of control can i...,[Give her a stock-option grant with accelerate...,Give her a stock-option grant with accelerated...,0.25,0.927667,1.0,1.0,0.897261
7,What is the benefit of combining a useful grad...,Combining a useful graduate degree with a subs...,"[degree, you are much better oW combining it w...","You become a ""double threat"" by combining a us...",1.0,0.907999,1.0,0.833333,0.538305
8,What are some ways to bootstrap a business bef...,Some ways to bootstrap a business before raisi...,[This obviously raises the issue of how you’re...,"Try to raise angel money, or bootstrap oW of i...",1.0,1.0,1.0,1.0,0.559488
9,What is Richard Morgan's opinion of Peter Hami...,Richard Morgan views Peter Hamilton as the cle...,[ders and executions around the clock. Haven’t...,Richard Morgan blurbed Peter Hamilton's storyt...,0.0,0.939448,1.0,0.691667,0.614487


## Task 3: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [60]:
results

{'faithfulness': 0.7892, 'answer_relevancy': 0.9845, 'context_recall': 0.7500, 'context_precision': 0.7556, 'answer_correctness': 0.5010}

And see how our advanced retrieval modified our chain!

In [61]:
advanced_retrieval_results

{'faithfulness': 0.6800, 'answer_relevancy': 0.9887, 'context_recall': 0.6000, 'context_precision': 0.6842, 'answer_correctness': 0.4769}

In [62]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.789167,0.68,-0.109167
1,answer_relevancy,0.984503,0.9887,0.004197
2,context_recall,0.75,0.6,-0.15
3,context_precision,0.755556,0.684218,-0.071338
4,answer_correctness,0.501005,0.476857,-0.024147


## Task 4: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [87]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [88]:
# Recreate vector store with new embeddingstext-embedding-3-small model
vector_store = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="PMarca Blogs - TE3 - MQR",
)

In [89]:
# Cast vector store as retriever
new_retriever = vector_store.as_retriever()

In [90]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [91]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [95]:
import asyncio

# Generate responses again through the new retrieval chain above
answers = []
contexts = []

# Generate the necessary lists to align answer and context to enter into a dataset
async def process_questions():
    for question in test_questions:
        try:
            response = await asyncio.wait_for(new_retrieval_chain.ainvoke({"input": question}), timeout=60)  # 60 second timeout
            answers.append(response["answer"])
            contexts.append([context.page_content for context in response["context"]])
        except asyncio.TimeoutError:
            print(f"Timeout occurred for question: {question}")
            # Handle the timeout (e.g., skip this question or use a default answer)
            answers.append("Timeout occurred")
            contexts.append([])
        except Exception as e:
            print(f"Error occurred for question: {question}. Error: {str(e)}")
            answers.append("Error occurred")
            contexts.append([])

# Run the asynchronous function
await process_questions()

In [97]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions[0:TRUNC_SIZE],
    "answer" : answers[0:TRUNC_SIZE],
    "contexts" : contexts[0:TRUNC_SIZE],
    "ground_truth" : test_groundtruths[0:TRUNC_SIZE]
})

In [None]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

In [None]:
new_advanced_retrieval_results

In [None]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR
0,faithfulness,0.62549,0.712024,0.865833,0.15381,0.240343
1,answer_relevancy,0.852605,0.945499,0.945435,-6.4e-05,0.092831
2,context_recall,0.65,0.741667,0.679167,-0.0625,0.029167
3,context_precision,0.758333,0.711185,0.847174,0.13599,0.088841
4,answer_correctness,0.551982,0.552914,0.505525,-0.047388,-0.046457


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

**ANSWER**
Based on the limited data, it seems like it may be better than `ada`, but to be sure we should run this experiment again with more data and figure out the OpenAI rate limits.

## BONUS ACTIVITY: Using a Better Generator

Now that we've seen how much more effective a better Retrieval pipeline is, let's look at what impact a better(?) Generator is!

Adapt the above `TE3 + MQR` pipeline to use `GPT-4o` and compare the results below!

In [None]:
### YOUR CODE HERE