# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.2.0](https://python.langchain.com/v0.2/docs/versions/v0_2/)
  4. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas) framework.
  

- 🤝 Breakout Room Part #2:
  1. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage `GPT-3.5-Turbo` as the `critic_llm`!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [16]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [17]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [18]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [19]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [20]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

documents = loader.load()

In [21]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [22]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [23]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [24]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [35]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="PMarca Blogs",
)

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

Answer: 
- Qdrant can handle large scale datasets. 
- Reduces latency by including caching disk data in RAM and preloading data from disk to RAM. 
- Has effective storgae and indexing of high-dimensional data. 
- Highly scalable. 
- Can handle complex data types like images, videos. 
- Reduced development and deployment time. 
- Support for real-time analytics and queries.
- Commonly used distance metrics like Euclidean Distance, Cosine Similarity, and Dot Product, are fully supported.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [36]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [38]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [39]:
for doc in retrieved_documents:
  print(doc)

page_content='the existing order — and make sure that those forces of change\nhave a reasonable chance at succeeding.\nSecond rule of thumb:\nOnce you have picked an industry, get right to the center of it' metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'e92511d8e66842a39ef028003fe194f0', '_collection_name': 'PMarca Blogs'}
page_content='Third rule:\nIn a rapidly changing Held like technology, the best place to\nget experience when you’re starting out is in younger, high-\ngrowth companies.' metadata={'source': 'https://d1lamhf6l6yk6d

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [40]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [41]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [43]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [44]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

1. The chain is invoked by providing the `question` item. 
2. "question"` is chained into the `retriever`! (This is what ultimately collects the context through Qdrant.) 
3. Collected context is passed to a `RunnablePassthrough()` from the previous object.
4. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm` which is `primary_qa_llm`. 
5. We also, collect the `"context"` again so we can output it in the final response object.

Let's test it out!

In [45]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Get right to the center of the industry.


In [46]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.', metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '8ab4aafd154a418887e8dacf2fbf7b3b', '_collection_name': 'PMarca Blogs'}), Document(page_content='watching carefully — if everyone agrees right up front that\nwhatever you are doing makes total sense, it probably isn’t a new\nand radical enough idea to justify a

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [47]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

Improved data diversity: By using various splitting parameters, you can generate a more diverse and representative synthetic dataset. This helps capture a wider range of patterns and relationships present in the original data, leading to better model training and generalization.

Balanced representation: Splitting documents with different parameters allows you to create synthetic data that maintains a balanced representation of various classes or categories. This is particularly useful when dealing with imbalanced datasets, as it can help improve model performance on underrepresented classes

Mitigation of biases: Careful splitting of documents can help address biases present in the original data. By using different parameters, you can potentially reduce or eliminate unwanted biases that might otherwise be propagated in the synthetic dataset

Scalability and flexibility: By employing various splitting techniques, you can generate synthetic datasets of different sizes and characteristics, making it easier to scale your data generation process and adapt to different use cases


In [48]:
len(eval_documents)

624


> NOTE: 🛑 Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage GPT-3.5-Turbo as the critic_llm. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits. 🛑

In [49]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-3.5-turbo") #<--- If you don't have GPT-4 access, or to reduce cost/rate limiting issues.
#critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

testset = generator.generate_with_langchain_docs(eval_documents[:50], 20, distributions, is_async = False)
testset.to_pandas()

Filename and doc_id are the same for all nodes.                  
Generating:  65%|██████▌   | 13/20 [01:12<01:10, 10.10s/it]max retries exceeded for SimpleEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x7fc74a5b43d0>, nodes=[Node(page_content='The Pmarca Blog Archives\n(select posts from 2007-2009)\nMarc Andreessen', metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 1, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:2015011002041

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Why is it important for the culture of a start...,"[(In case you were wondering, by the way, the ...",It is important for the culture of a startup t...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,What is the importance of having credible foun...,[You may have to swap out one or more founders...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,"What is discussed in Part 1 of ""The PMARCA Gui...",[Contents\nTHE PMARCA GUIDE TO STARTUPS\nPart ...,Turnaround!,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What can a startup do to minimize or eliminate...,"[full onion, you realize it’s amazing that any...",Take a hard-headed look at each of the risks s...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,Why does the initial business plan of a startu...,"[The Pmarca Guide to\nStartups, Contents\nTHE ...",The initial business plan of a startup does no...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,What are some challenges and frustrations asso...,[Over and over and over.\nAnd when you do get ...,Hiring is a huge pain in the ass. You will be ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What is the impact of euphoric or funereal tim...,[I’ll proceed under the assumption that we’re ...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How should entrepreneurs interpret VCs saying ...,[Meeting with more VCs aaer a bunch have said ...,"VCs saying ""maybe"" during a pitch typically me...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,"What should you do when VCs say ""no"" to ventur...","[Part 2: When the VCs say ""no""\nThis post is a...","When VCs say ""no"" to venture financing for you...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What are some differences between a startup an...,"[Second, in a startup, absolutely nothing happ...","A startup lacks the established systems, rhyth...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

Answer: 
50% of the questions generated is expcted to be simple, 25%  from reasoning and 25% from multi context:

Reasoning: Rewrite the question in a way that enhances the need for reasoning to answer it effectively.

Multi-Context: Rephrase the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer.


What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

Let's look at the output and see what we can learn about it!

In [50]:
testset.test_data[0]

DataRow(question='Why is it important for the culture of a startup to be established and aligned with the values and mindset of the team?', contexts=['(In case you were wondering, by the way, the hours do com-\npound the stress.)\nSeventh, it’s really easy for the culture of a startup to go sideways.\nThis combines the Xrst and second items above.\nThis is the emotional rollercoaster wreaking havoc on not just\nyou but your whole company.\nIt takes time for the culture of any company to become “set” —\nfor the team of people who have come together for the Xrst time\nto decide collectively what they’re all about, what they value —\nand how they look at challenge and adversity.\nIn the best case, you get an amazing dynamic of people really', 'for the better.\nHowever, there are many more reasons to no\nnot\nt do a startup.\nFirst, and most importantly, realize that a startup puts you on\nan emotional rollercoaster unlike anything you have ever experi-\nenced.\nYou will Yip rapidly from a

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [51]:
test_df = testset.to_pandas()

In [52]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,Why is it important for the culture of a start...,"[(In case you were wondering, by the way, the ...",It is important for the culture of a startup t...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,What is the importance of having credible foun...,[You may have to swap out one or more founders...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,"What is discussed in Part 1 of ""The PMARCA Gui...",[Contents\nTHE PMARCA GUIDE TO STARTUPS\nPart ...,Turnaround!,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What can a startup do to minimize or eliminate...,"[full onion, you realize it’s amazing that any...",Take a hard-headed look at each of the risks s...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,Why does the initial business plan of a startu...,"[The Pmarca Guide to\nStartups, Contents\nTHE ...",The initial business plan of a startup does no...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,What are some challenges and frustrations asso...,[Over and over and over.\nAnd when you do get ...,Hiring is a huge pain in the ass. You will be ...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What is the impact of euphoric or funereal tim...,[I’ll proceed under the assumption that we’re ...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How should entrepreneurs interpret VCs saying ...,[Meeting with more VCs aaer a bunch have said ...,"VCs saying ""maybe"" during a pitch typically me...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,"What should you do when VCs say ""no"" to ventur...","[Part 2: When the VCs say ""no""\nThis post is a...","When VCs say ""no"" to venture financing for you...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What are some differences between a startup an...,"[Second, in a startup, absolutely nothing happ...","A startup lacks the established systems, rhyth...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [53]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [54]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [55]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [56]:
response_dataset[0]

{'question': 'Why is it important for the culture of a startup to be established and aligned with the values and mindset of the team?',
 'answer': 'It is important for the culture of a startup to be established and aligned with the values and mindset of the team because it leads to greater commitment on the part of the entrepreneur and the team.',
 'contexts': ['answer, in part because in the beginning of a startup, you know\na lot more about the team than you do the product, which hasn’t\nbeen built yet, or the market, which hasn’t been explored yet.',
  '(In case you were wondering, by the way, the hours do com-\npound the stress.)\nSeventh, it’s really easy for the culture of a startup to go sideways.\nThis combines the Xrst and second items above.',
  'you but your whole company.\nIt takes time for the culture of any company to become “set” —\nfor the team of people who have come together for the Xrst time',
  'usually a plus for a startup, since it leads to greater commitment\non 

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [57]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [58]:
results = evaluate(response_dataset, metrics)

Evaluating:  29%|██▉       | 22/75 [00:08<00:14,  3.72it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 75/75 [02:17<00:00,  1.83s/it]


In [59]:
results

{'faithfulness': 0.7262, 'answer_relevancy': 0.7594, 'context_recall': 0.3878, 'context_precision': 0.6963, 'answer_correctness': 0.4696}

In [60]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Why is it important for the culture of a start...,It is important for the culture of a startup t...,"[answer, in part because in the beginning of a...",It is important for the culture of a startup t...,1.0,1.0,0.5,0.805556,0.543744
1,What is the importance of having credible foun...,The importance of having credible founders in ...,[priate for your particular startup.\nWith a w...,,0.666667,0.0,0.0,0.0,0.18419
2,"What is discussed in Part 1 of ""The PMARCA Gui...",Turnaround!,"[The Pmarca Guide to Big\nCompanies, The Pmarc...",Turnaround!,,0.986024,1.0,0.75,1.0
3,What can a startup do to minimize or eliminate...,A startup can minimize or eliminate risks by r...,[risks — and any others that are speciXc to yo...,Take a hard-headed look at each of the risks s...,1.0,0.964233,0.5,0.805556,0.436712
4,Why does the initial business plan of a startu...,The initial business plan of a startup does no...,[Part 7: Why a startup's initial business plan...,The initial business plan of a startup does no...,1.0,0.906462,1.0,1.0,0.732605
5,What are some challenges and frustrations asso...,Some challenges and frustrations associated wi...,"[and the hours.\nBut, it’s really diZcult.\nTh...",Hiring is a huge pain in the ass. You will be ...,0.75,1.0,0.0,0.25,0.384787
6,What is the impact of euphoric or funereal tim...,The impact of euphoric or funereal times on th...,"[Third, retool your plan.\nThis is the hard pa...",,0.5,0.0,0.0,0.0,0.184186
7,How should entrepreneurs interpret VCs saying ...,I don't know.,"[Part 2: When the VCs say ""no""\nThis post is a...","VCs saying ""maybe"" during a pitch typically me...",0.0,0.0,0.0,0.333333,0.189866
8,"What should you do when VCs say ""no"" to ventur...",Retool your plan and lay the groundwork to go ...,"[Part 2: When the VCs say ""no""\nThis post is a...","When VCs say ""no"" to venture financing for you...",0.5,0.86267,0.25,1.0,0.187812
9,What are some differences between a startup an...,"A startup has none of the established systems,...",[ied and determined. Sales calls get made. The...,"A startup lacks the established systems, rhyth...",1.0,0.925847,0.333333,1.0,0.527882


## Task 2: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [61]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [62]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [63]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [64]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [65]:
print(response["answer"])

I'm sorry, but based on the context provided, I cannot determine who Taylor Swift is feuding with.


In [None]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [66]:
print(response["answer"])

I'm sorry, but based on the context provided, I cannot determine who Taylor Swift is feuding with.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [67]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [68]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [72]:
%pip install nest_asyncio

Note: you may need to restart the kernel to use updated packages.


In [73]:
# imports
from ragas import evaluate
from ragas.evaluation import Result
import random
import time
import openai
import nest_asyncio
nest_asyncio.apply()

# define a retry decorator
def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = (openai.RateLimitError,),
):
    """Retry a function with exponential backoff."""

    def wrapper(*args, **kwargs):
        # Initialize variables
        num_retries = 0
        delay = initial_delay

        # Loop until a successful response or max_retries is hit or an exception is raised
        while True:
            try:
                return func(*args, **kwargs)

            # Retry on specified errors
            except errors as e:
                # Increment retries
                num_retries += 1

                # Check if max retries has been reached
                if num_retries > max_retries:
                    raise Exception(
                        f"Maximum number of retries ({max_retries}) exceeded."
                    )

                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())

                # Sleep for the delay
                time.sleep(delay)

            # Raise exceptions for any errors not specified
            except Exception as e:
                raise e

    return wrapper

# Calculate the delay based on your rate limit
rate_limit_per_minute = 60
delay = 60.0 / rate_limit_per_minute

def evaluate_dataset_row_by_row(dataset, metrics):
    combined_scores = []

    for i in range(dataset.num_rows):
        # Extract a single row
        single_row_dataset = Dataset.from_dict({feature: [dataset[feature][i]] for feature in dataset.features})
        
        # Evaluate the single row dataset with metrics
        result = evaluate(single_row_dataset, metrics, is_async=True)
        
        # Assuming result is in the form of a dictionary or similar structure
        combined_scores.append(result)
        time.sleep(delay) # Sleep for the delay

    # Convert the list of results into a Dataset object
    scores_dataset = Dataset.from_list(combined_scores)

    # Construct the final Result object
    final_result = Result(scores=scores_dataset, dataset=dataset)

    return final_result

@retry_with_exponential_backoff
def evaluate_with_retry(response_dataset, metrics):
    return evaluate(response_dataset, metrics, is_async=True)

In [74]:
#advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)


advanced_retrieval_results = evaluate_with_retry(response_dataset, metrics)

Evaluating:  75%|███████▍  | 56/75 [01:07<00:40,  2.13s/it]No statements were generated from the answer.
Evaluating: 100%|██████████| 75/75 [02:33<00:00,  2.05s/it]


In [75]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,Why is it important for the culture of a start...,It is important for the culture of a startup t...,"[answer, in part because in the beginning of a...",It is important for the culture of a startup t...,1.0,1.0,0.5,0.805556,0.543744
1,What is the importance of having credible foun...,The importance of having credible founders in ...,[priate for your particular startup.\nWith a w...,,0.0,1.0,0.25,0.0,0.18419
2,"What is discussed in Part 1 of ""The PMARCA Gui...",Turnaround!,"[The Pmarca Guide to Big\nCompanies, The Pmarc...",Turnaround!,,0.986024,1.0,0.75,1.0
3,What can a startup do to minimize or eliminate...,A startup can minimize or eliminate risks by r...,[risks — and any others that are speciXc to yo...,Take a hard-headed look at each of the risks s...,0.5,0.964233,1.0,0.805556,0.722426
4,Why does the initial business plan of a startu...,The initial business plan of a startup does no...,[Part 7: Why a startup's initial business plan...,The initial business plan of a startup does no...,1.0,0.906462,1.0,1.0,0.732605
5,What are some challenges and frustrations asso...,Some challenges and frustrations associated wi...,"[and the hours.\nBut, it’s really diZcult.\nTh...",Hiring is a huge pain in the ass. You will be ...,1.0,1.0,0.0,0.25,0.218103
6,What is the impact of euphoric or funereal tim...,The impact of euphoric or funereal times on th...,"[Third, retool your plan.\nThis is the hard pa...",,0.5,0.0,0.0,0.0,0.184186
7,How should entrepreneurs interpret VCs saying ...,I don't know.,"[Part 2: When the VCs say ""no""\nThis post is a...","VCs saying ""maybe"" during a pitch typically me...",0.0,0.0,0.0,0.583333,0.189866
8,"What should you do when VCs say ""no"" to ventur...",Retool your plan and lay the groundwork to go ...,"[Part 2: When the VCs say ""no""\nThis post is a...","When VCs say ""no"" to venture financing for you...",0.5,0.808075,0.25,1.0,0.187812
9,What are some differences between a startup an...,"A startup has none of the established systems,...",[ied and determined. Sales calls get made. The...,"A startup lacks the established systems, rhyth...",1.0,0.92649,0.333333,0.916667,0.527882


## Task 3: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [76]:
results

{'faithfulness': 0.7262, 'answer_relevancy': 0.7594, 'context_recall': 0.3878, 'context_precision': 0.6963, 'answer_correctness': 0.4696}

And see how our advanced retrieval modified our chain!

In [77]:
advanced_retrieval_results

{'faithfulness': 0.6786, 'answer_relevancy': 0.8225, 'context_recall': 0.4600, 'context_precision': 0.7074, 'answer_correctness': 0.4638}

In [78]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.72619,0.678571,-0.047619
1,answer_relevancy,0.759366,0.822467,0.0631
2,context_recall,0.387778,0.46,0.072222
3,context_precision,0.696296,0.707407,0.011111
4,answer_correctness,0.46965,0.46382,-0.00583


## Task 4: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [79]:
#Create Open API embedding model from text-embedding-3-small
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [80]:
# Create In Memory Qdrant vector store
vector_store = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="PMarca Blogs - TE3 - MQR",
)

In [81]:
#Get Retriever to retrieve the documents from vector store.
new_retriever = vector_store.as_retriever()

In [82]:
#Create a muti query retriever which uses LLM to write sets of queries
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [83]:
#Create a retrieval chain using the MQR and the document chain
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [84]:
# Get answers and contexts from the chain created using the advanced retriever
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [85]:
#Create a dataset from the contexts and answers from above
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [89]:
# Evaluate the response with retry option to combat the Tokens per limit error from 3.5 turbo llm.
from ragas.run_config import RunConfig

new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics, run_config=RunConfig(max_workers=5))

Evaluating: 100%|██████████| 75/75 [03:15<00:00,  2.61s/it]


In [90]:
new_advanced_retrieval_results

{'faithfulness': 0.8277, 'answer_relevancy': 0.9331, 'context_recall': 0.4278, 'context_precision': 0.7179, 'answer_correctness': 0.4964}

In [92]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR
0,faithfulness,0.72619,0.678571,0.827662,0.149091,0.101472
1,answer_relevancy,0.759366,0.822467,0.933053,0.110586,0.173686
2,context_recall,0.387778,0.46,0.427778,-0.032222,0.04
3,context_precision,0.696296,0.707407,0.717876,0.010468,0.021579
4,answer_correctness,0.46965,0.46382,0.496395,0.032575,0.026745


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

## BONUS ACTIVITY: Using a Better Generator

Now that we've seen how much more effective a better Retrieval pipeline is, let's look at what impact a better(?) Generator is!

Adapt the above `TE3 + MQR` pipeline to use `GPT-4o` and compare the results below!

In [94]:
### YOUR CODE HERE

latest_qa_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

latest_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=latest_qa_llm)

latest_retrieval_chain = create_retrieval_chain(latest_retriever, document_chain)

answers = []
contexts = []

for question in test_questions:
  response = latest_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

latest_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

latest_results = evaluate_dataset_row_by_row(response_dataset_advanced_retrieval, metrics)

latest_results

Evaluating: 100%|██████████| 5/5 [00:08<00:00,  1.67s/it]
Evaluating: 100%|██████████| 5/5 [00:07<00:00,  1.57s/it]
Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]Task exception was never retrieved
future: <Task finished name='Task-5115' coro=<AsyncClient.aclose() done, defined at /home/gkbalu/anaconda3/envs/llmops-course/lib/python3.11/site-packages/httpx/_client.py:1967> exception=RuntimeError('Event loop is closed')>
Traceback (most recent call last):
  File "/home/gkbalu/anaconda3/envs/llmops-course/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/home/gkbalu/anaconda3/envs/llmops-course/lib/python3.11/site-packages/httpx/_client.py", line 1974, in aclose
    await self._transport.aclose()
  File "/home/gkbalu/anaconda3/envs/llmops-course/lib/python3.11/site-packages/httpx/_transports/default.py", line 365, in aclose
    await self._pool.aclose()
  File "/home/gkbalu/anaconda3/envs/llmops-course/lib/python3.1

{'faithfulness': 0.8941, 'answer_relevancy': 0.9155, 'context_recall': 0.5100, 'context_precision': 0.6789, 'answer_correctness': 0.4678}

In [95]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR
0,faithfulness,0.72619,0.678571,0.827662,0.149091,0.101472
1,answer_relevancy,0.759366,0.822467,0.933053,0.110586,0.173686
2,context_recall,0.387778,0.46,0.427778,-0.032222,0.04
3,context_precision,0.696296,0.707407,0.717876,0.010468,0.021579
4,answer_correctness,0.46965,0.46382,0.496395,0.032575,0.026745
