# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.2.0](https://python.langchain.com/v0.2/docs/versions/v0_2/)
  4. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas) framework.
  

- 🤝 Breakout Room Part #2:
  1. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage `GPT-3.5-Turbo` as the `critic_llm`!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
from rich import print
%load_ext rich

In [2]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [3]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [4]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [5]:
import os
import openai

from dotenv import load_dotenv

load_dotenv()

[3;92mTrue[0m

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

documents = loader.load()

In [7]:
documents[0].metadata


[1m{[0m
    [32m'source'[0m: [32m'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf'[0m,
    [32m'file_path'[0m: [32m'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf'[0m,
    [32m'page'[0m: [1;36m0[0m,
    [32m'total_pages'[0m: [1;36m195[0m,
    [32m'format'[0m: [32m'PDF 1.3'[0m,
    [32m'title'[0m: [32m'The Pmarca Blog Archives'[0m,
    [32m'author'[0m: [32m''[0m,
    [32m'subject'[0m: [32m''[0m,
    [32m'keywords'[0m: [32m''[0m,
    [32m'creator'[0m: [32m''[0m,
    [32m'producer'[0m: [32m'Mac OS X 10.10 Quartz PDFContext'[0m,
    [32m'creationDate'[0m: [32m"D:20150110020418Z00'00'"[0m,
    [32m'modDate'[0m: [32m"D:20150110020418Z00'00'"[0m,
    [32m'trapped'[0m: [32m''[0m
[1m}[0m

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [9]:
len(documents)

[1;36m1864[0m

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [11]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="PMarca Blogs",
)

#### ❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

#### Answer #1:

Vector databases are optimized for storing and querying these high-dimensional vectors efficiently, and they often using specialized data structures and indexing techniques such as Hierarchical Navigable Small World (HNSW) – which is used to implement Approximate Nearest Neighbors

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke(
    "What is a rule of thumb for selecting an industry to invest in?"
)

In [14]:
for doc in retrieved_documents:
  print(doc)

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

#### 🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

#### Answer:

It takes the question, retrieves documents and marks it as context, then passes the context and question to the QA LLM to generate an answer, along with the context.

Let's test it out!

In [19]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question": question})

print(result["response"].content)

In [20]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question": question})

print(result["response"].content)
print(result["context"])

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [21]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)

eval_documents = text_splitter_eval.split_documents(eval_documents)

#### ❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

#### Answer #2:

It is important to split our documents using different parameters when creating our synthetic data because it allows us to test the performance of our pipeline under different conditions, similar to the train / test split in machine learning.

In [22]:
len(eval_documents)

[1;36m624[0m


> NOTE: 🛑 Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage GPT-3.5-Turbo as the critic_llm. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits. 🛑

In [23]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
# critic_llm = ChatOpenAI(model="gpt-3.5-turbo") <--- If you don't have GPT-4 access, or to reduce cost/rate limiting issues.
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

distributions = {simple: 0.5, multi_context: 0.4, reasoning: 0.1}

In [24]:
testset = generator.generate_with_langchain_docs(
    eval_documents, 20, distributions, is_async=True, 
)
testset.to_pandas()


embedding nodes:   0%|          | 0/1248 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What are the criteria for evaluating candidate...,[How to hire the best people you've\never work...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How does the reluctance to change early-formed...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The reluctance to change early-formed habits c...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,How have modern programming technologies contr...,"[late 90’s — due to commodity hardware, open s...",Modern programming technologies have contribut...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What are some potential unanticipated setbacks...,[Here’s why you shouldn’t do that:\nWhat are t...,You may have unanticipated setbacks within you...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,Why is geographic locality still important in ...,[is a mistake. A lot of people — those who don...,Geographic locality is still important in acce...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,What factors determine the peak age for creati...,"[ods when productivity is highest, the peak ag...",The expected age optimum for quantity and qual...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What is the negative impact of a hyper-control...,[added humor value.]\nWhile I enjoyed Marc’s p...,The negative impact of a hyper-controlling man...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How important is it for a startup to focus on ...,"[developing a large market, as opposed to Xght...",Focusing on developing a large market is impor...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What is the best way to develop contacts with ...,[and identifying those VCs and screening out a...,"The best way to develop contacts with VCs, in ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What are the benefits of restructuring in term...,[redesign is that you want to tolerate overlap...,"By reducing the size of a team, restructuring ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


#### ❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

#### Answer #3:

This mapping refers to the distribution of the different types of questions that will be generated in the synthetic test set. 

Let's look at the output and see what we can learn about it!

In [25]:
testset.test_data[0]


[1;35mDataRow[0m[1m([0m
    [33mquestion[0m=[32m'What are the criteria for evaluating candidates when hiring the best people?'[0m,
    [33mcontexts[0m=[1m[[0m
        [32m"How to hire the best people you've\never worked with\nThere are many aspects to hiring great people, and various peo-\nple smarter than me have written extensively on the topic.\nSo I’m not going to try to be comprehensive.\nBut I am going to relay some lessons learned through hard\nexperience on how to hire the best people you’ve ever worked\nwith — particularly for a startup.\nI’m going to cover two key areas in this post:\n•\nCriteria: what to value when evaluating candidates.\n•\nAnd process: how to actually run the hiring process, and if\nnecessary the aaermath of making a mistake.\nCriteria 7rst"[0m
    [1m][0m,
    [33mground_truth[0m=[32m'nan'[0m,
    [33mevolution_type[0m=[32m'simple'[0m,
    [33mmetadata[0m=[1m[[0m
        [1m{[0m
            [32m'source'[0m: [32m'https://d

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [26]:
test_df = testset.to_pandas()

In [27]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What are the criteria for evaluating candidate...,[How to hire the best people you've\never work...,,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,How does the reluctance to change early-formed...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The reluctance to change early-formed habits c...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,How have modern programming technologies contr...,"[late 90’s — due to commodity hardware, open s...",Modern programming technologies have contribut...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,What are some potential unanticipated setbacks...,[Here’s why you shouldn’t do that:\nWhat are t...,You may have unanticipated setbacks within you...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,Why is geographic locality still important in ...,[is a mistake. A lot of people — those who don...,Geographic locality is still important in acce...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,What factors determine the peak age for creati...,"[ods when productivity is highest, the peak ag...",The expected age optimum for quantity and qual...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,What is the negative impact of a hyper-control...,[added humor value.]\nWhile I enjoyed Marc’s p...,The negative impact of a hyper-controlling man...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,How important is it for a startup to focus on ...,"[developing a large market, as opposed to Xght...",Focusing on developing a large market is impor...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,What is the best way to develop contacts with ...,[and identifying those VCs and screening out a...,"The best way to develop contacts with VCs, in ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,What are the benefits of restructuring in term...,[redesign is that you want to tolerate overlap...,"By reducing the size of a team, restructuring ...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [28]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [29]:
answers = []
contexts = []

for question in test_questions:
    response = retrieval_augmented_qa_chain.invoke({"question": question})
    answers.append(response["response"].content)
    contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [30]:
from datasets import Dataset

response_dataset = Dataset.from_dict(
    {
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": test_groundtruths,
    }
)

Let's take a peek and see what that looks like!

In [31]:
response_dataset[0]


[1m{[0m
    [32m'question'[0m: [32m'What are the criteria for evaluating candidates when hiring the best people?'[0m,
    [32m'answer'[0m: [32m'Criteria for evaluating candidates when hiring the best people include hiring for intelligence.'[0m,
    [32m'contexts'[0m: [1m[[0m
        [32m"How to hire the best people you've\never worked with\nThere are many aspects to hiring great people, and various peo-\nple smarter than me have written extensively on the topic."[0m,
        [32m'with your team as you interview candidates for the position.\nThis is one of the best ways for an organization to become really\ngood at hiring: by iterating the questions, you’re reXning what'[0m,
        [32m'for everything.\nNotably, for the really critical open jobs, go out and recruit the\nright person yourself, or better yet promote from within.'[0m,
        [32m'necessary the aaermath of making a mistake.\nCriteria 7rst\nLots of people will tell you to hire for intelligence.\nEspec

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [32]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

In [33]:
metrics[0]


[1;35mFaithfulness[0m[1m([0m
    [33mllm[0m=[3;35mNone[0m,
    [33mname[0m=[32m'faithfulness'[0m,
    [33mevaluation_mode[0m=[1m<[0m[1;95mEvaluationMode.qac:[0m[39m [0m[1;36m1[0m[39m>,[0m
[39m    [0m[33mnli_statements_message[0m[39m=[0m[1;35mPrompt[0m[1;39m([0m
[39m        [0m[33mname[0m[39m=[0m[32m'nli_statements'[0m[39m,[0m
[39m        [0m[33minstruction[0m[39m=[0m[32m'Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context.'[0m[39m,[0m
[39m        [0m[33moutput_format_instruction[0m[39m=[0m[32m'The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema [0m[32m{[0m[32m"properties": [0m[32m{[0m[32m"foo": [0m[32m{[0m[32m"title": "Foo"

All that's left to do is call "evaluate" and away we go!

In [34]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [35]:
results

[1m{[0m[32m'faithfulness'[0m: [1;36m0.7833[0m, [32m'answer_relevancy'[0m: [1;36m0.9510[0m, [32m'context_recall'[0m: [1;36m0.5857[0m, [32m'context_precision'[0m: [1;36m0.7667[0m, [32m'answer_correctness'[0m: [1;36m0.4815[0m[1m}[0m

In [36]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are the criteria for evaluating candidate...,Criteria for evaluating candidates when hiring...,[How to hire the best people you've\never work...,,1.0,0.987486,0.0,0.0,0.184406
1,How does the reluctance to change early-formed...,The reluctance to change early-formed habits c...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The reluctance to change early-formed habits c...,1.0,0.890371,0.5,0.805556,0.490907
2,How have modern programming technologies contr...,Modern programming technologies have contribut...,"[late 90’s — due to commodity hardware, open s...",Modern programming technologies have contribut...,1.0,0.969925,0.5,0.833333,0.675435
3,What are some potential unanticipated setbacks...,Not raising enough money risks the survival of...,[Here’s why you shouldn’t do that:\nWhat are t...,You may have unanticipated setbacks within you...,1.0,0.942212,0.0,1.0,0.205057
4,Why is geographic locality still important in ...,Geographic locality is still important in acce...,[— don’t want to hear it. But it’s true. Geogr...,Geographic locality is still important in acce...,1.0,1.0,1.0,1.0,0.742881
5,What factors determine the peak age for creati...,Productivity and total lifetime output are fac...,[creator’s most distinguished work will appear...,The expected age optimum for quantity and qual...,0.5,1.0,0.0,0.916667,0.224084
6,What is the negative impact of a hyper-control...,The negative impact of a hyper-controlling man...,[severe personality disorder who micromanages ...,The negative impact of a hyper-controlling man...,1.0,0.992703,1.0,0.583333,0.612689
7,How important is it for a startup to focus on ...,It is important for a startup to focus on deve...,"[competitor, be sure to take a step back and s...",Focusing on developing a large market is impor...,0.5,0.993499,1.0,1.0,0.547455
8,What is the best way to develop contacts with ...,The best way to develop contacts with VCs is t...,"[absolutely key.\nNow, on to developing contac...","The best way to develop contacts with VCs, in ...",1.0,1.0,1.0,0.0,0.429086
9,What are the benefits of restructuring in term...,The benefits of restructuring in terms of team...,"[before. By reducing the size of a team, and i...","By reducing the size of a team, restructuring ...",1.0,1.0,1.0,0.805556,0.664642


## Task 2: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [37]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever, llm=primary_qa_llm
)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [38]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [39]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [40]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [41]:
print(response["answer"])

In [42]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [43]:
print(response["answer"])

Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [44]:
answers = []
contexts = []

for question in test_questions:
    response = retrieval_chain.invoke({"input": question})
    answers.append(response["answer"])
    contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [45]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [46]:
response_dataset_advanced_retrieval.to_pandas()

Unnamed: 0,question,answer,contexts,ground_truth
0,What are the criteria for evaluating candidate...,"When hiring the best people, it is important t...",[with your team as you interview candidates fo...,
1,How does the reluctance to change early-formed...,The reluctance to change early-formed habits c...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The reluctance to change early-formed habits c...
2,How have modern programming technologies contr...,Modern programming technologies have significa...,"[late 90’s — due to commodity hardware, open s...",Modern programming technologies have contribut...
3,What are some potential unanticipated setbacks...,Some potential unanticipated setbacks that can...,[Here’s why you shouldn’t do that:\nWhat are t...,You may have unanticipated setbacks within you...
4,Why is geographic locality still important in ...,Geographic locality is still important in acce...,[— don’t want to hear it. But it’s true. Geogr...,Geographic locality is still important in acce...
5,What factors determine the peak age for creati...,The text mentions that the peak age for creati...,[creator’s most distinguished work will appear...,The expected age optimum for quantity and qual...
6,What is the negative impact of a hyper-control...,The negative impact of a hyper-controlling man...,[severe personality disorder who micromanages ...,The negative impact of a hyper-controlling man...
7,How important is it for a startup to focus on ...,It is very important for a startup to focus on...,"[answer, in part because in the beginning of a...",Focusing on developing a large market is impor...
8,What is the best way to develop contacts with ...,"The best way to develop contacts with VCs, acc...",[If\nyou\nengage\nin\na\nset\nof\nthese\ntechn...,"The best way to develop contacts with VCs, in ..."
9,What are the benefits of restructuring in term...,Restructuring by reducing the size of a team a...,"[before. By reducing the size of a team, and i...","By reducing the size of a team, restructuring ..."


Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [47]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [48]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are the criteria for evaluating candidate...,"When hiring the best people, it is important t...",[with your team as you interview candidates fo...,,1.0,0.937369,1.0,0.0,0.179971
1,How does the reluctance to change early-formed...,The reluctance to change early-formed habits c...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The reluctance to change early-formed habits c...,1.0,0.919766,1.0,0.833333,0.425984
2,How have modern programming technologies contr...,Modern programming technologies have significa...,"[late 90’s — due to commodity hardware, open s...",Modern programming technologies have contribut...,1.0,1.0,0.5,0.833333,0.845747
3,What are some potential unanticipated setbacks...,Some potential unanticipated setbacks that can...,[Here’s why you shouldn’t do that:\nWhat are t...,You may have unanticipated setbacks within you...,1.0,1.0,0.0,1.0,0.215343
4,Why is geographic locality still important in ...,Geographic locality is still important in acce...,[— don’t want to hear it. But it’s true. Geogr...,Geographic locality is still important in acce...,1.0,1.0,1.0,1.0,0.689856
5,What factors determine the peak age for creati...,The text mentions that the peak age for creati...,[creator’s most distinguished work will appear...,The expected age optimum for quantity and qual...,1.0,0.924469,0.0,0.8875,0.218227
6,What is the negative impact of a hyper-control...,The negative impact of a hyper-controlling man...,[severe personality disorder who micromanages ...,The negative impact of a hyper-controlling man...,1.0,0.990508,1.0,0.583333,0.537469
7,How important is it for a startup to focus on ...,It is very important for a startup to focus on...,"[answer, in part because in the beginning of a...",Focusing on developing a large market is impor...,0.6,0.97967,1.0,1.0,0.562863
8,What is the best way to develop contacts with ...,"The best way to develop contacts with VCs, acc...",[If\nyou\nengage\nin\na\nset\nof\nthese\ntechn...,"The best way to develop contacts with VCs, in ...",0.25,0.976858,1.0,0.5,0.61521
9,What are the benefits of restructuring in term...,Restructuring by reducing the size of a team a...,"[before. By reducing the size of a team, and i...","By reducing the size of a team, restructuring ...",1.0,0.826892,1.0,0.7,0.610893


## Task 3: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [49]:
results

[1m{[0m[32m'faithfulness'[0m: [1;36m0.7833[0m, [32m'answer_relevancy'[0m: [1;36m0.9510[0m, [32m'context_recall'[0m: [1;36m0.5857[0m, [32m'context_precision'[0m: [1;36m0.7667[0m, [32m'answer_correctness'[0m: [1;36m0.4815[0m[1m}[0m

And see how our advanced retrieval modified our chain!

In [50]:
advanced_retrieval_results

[1m{[0m[32m'faithfulness'[0m: [1;36m0.7543[0m, [32m'answer_relevancy'[0m: [1;36m0.9532[0m, [32m'context_recall'[0m: [1;36m0.6524[0m, [32m'context_precision'[0m: [1;36m0.7943[0m, [32m'answer_correctness'[0m: [1;36m0.5194[0m[1m}[0m

In [51]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.783333,0.754345,-0.028988
1,answer_relevancy,0.950992,0.953215,0.002223
2,context_recall,0.585714,0.652381,0.066667
3,context_precision,0.766667,0.794319,0.027653
4,answer_correctness,0.481508,0.51944,0.037933


## Task 4: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

#### 🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

##### Answer

1. Create a new embedding model using TE3
2. Create a new vector store in memory
3. Build the retriever using MQR, and the generator using GPT-3.5-Turbo
4. Generate the responses using the pipeline for all test questions
5. Convert the responses into a dataset for Ragas
6. Evaluate the pipeline using Ragas
7. Print the results and compare them to the baseline

In [52]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [53]:
vector_store = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="PMarca Blogs - TE3 - MQR",
)

In [54]:
new_retriever = vector_store.as_retriever()

In [55]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [56]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [57]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [58]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [59]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [60]:
new_advanced_retrieval_results

[1m{[0m[32m'faithfulness'[0m: [1;36m0.8754[0m, [32m'answer_relevancy'[0m: [1;36m0.9540[0m, [32m'context_recall'[0m: [1;36m0.5613[0m, [32m'context_precision'[0m: [1;36m0.7241[0m, [32m'answer_correctness'[0m: [1;36m0.5118[0m[1m}[0m

In [61]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR
0,faithfulness,0.783333,0.754345,0.875366,0.121021,0.092033
1,answer_relevancy,0.950992,0.953215,0.95396,0.000745,0.002968
2,context_recall,0.585714,0.652381,0.56131,-0.091071,-0.024405
3,context_precision,0.766667,0.794319,0.724075,-0.070244,-0.042591
4,answer_correctness,0.481508,0.51944,0.511753,-0.007687,0.030246


#### ❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

#### Answer #4:

Moving to TE3 increases the faithfulness of the pipeline, but sacrifices on recall and precision. It's a trade-off that may be worth it depending on the use case.

## BONUS ACTIVITY: Using a Better Generator

Now that we've seen how much more effective a better Retrieval pipeline is, let's look at what impact a better(?) Generator is!

Adapt the above `TE3 + MQR` pipeline to use `GPT-4o` and compare the results below!

In [62]:
### YOUR CODE HERE

advanced_qa_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [63]:
new_response_dataset_advanced_retrieval_gpt4o = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

new_advanced_retrieval_results_gpt4o = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

Failed to parse output. Returning None.


In [64]:
new_advanced_retrieval_results_gpt4o

[1m{[0m[32m'faithfulness'[0m: [1;36m0.8394[0m, [32m'answer_relevancy'[0m: [1;36m0.9530[0m, [32m'context_recall'[0m: [1;36m0.6104[0m, [32m'context_precision'[0m: [1;36m0.7092[0m, [32m'answer_correctness'[0m: [1;36m0.5331[0m[1m}[0m

In [65]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA + Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA + MQR'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'TE3 + MQR'])
df_gpt4o = pd.DataFrame(list(new_advanced_retrieval_results_gpt4o.items()), columns=['Metric', 'TE3 + MQR + GPT-4o'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")
df_merged = pd.merge(df_gpt4o, df_merged, on="Metric")

df_merged['ADA + MQR -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + MQR']
df_merged['Baseline -> TE3 + MQR'] = df_merged['TE3 + MQR'] - df_merged['ADA + Baseline']
df_merged['TE3 + MQR -> TE3 + MQR + GPT-4o'] = df_merged['TE3 + MQR + GPT-4o'] - df_merged['TE3 + MQR']
df_merged['Baseline -> TE3 + MQR + GPT-4o'] = df_merged['TE3 + MQR + GPT-4o'] - df_merged['ADA + Baseline']

df_merged

Unnamed: 0,Metric,TE3 + MQR + GPT-4o,ADA + Baseline,ADA + MQR,TE3 + MQR,ADA + MQR -> TE3 + MQR,Baseline -> TE3 + MQR,TE3 + MQR -> TE3 + MQR + GPT-4o,Baseline -> TE3 + MQR + GPT-4o
0,faithfulness,0.839377,0.783333,0.754345,0.875366,0.121021,0.092033,-0.035989,0.056044
1,answer_relevancy,0.952959,0.950992,0.953215,0.95396,0.000745,0.002968,-0.001001,0.001966
2,context_recall,0.610417,0.585714,0.652381,0.56131,-0.091071,-0.024405,0.049107,0.024702
3,context_precision,0.709214,0.766667,0.794319,0.724075,-0.070244,-0.042591,-0.014861,-0.057452
4,answer_correctness,0.533075,0.481508,0.51944,0.511753,-0.007687,0.030246,0.021322,0.051568
