# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage `GPT-3.5-Turbo` as the `critic_llm`!

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [None]:
!pip install -qU langchain langchain-openai langchain-huggingface langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [None]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [None]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [None]:
import os
import openai
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Please provide your OpenAI API Key: ")

Please provide your OpenAI API Key: ··········


In [None]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith AIMS@TMLS- {unique_id}"

In [None]:
os.environ["LANGCHAIN_API_KEY"] = getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ··········


## Task 3: Collecting Data for Synthetic Data Generation

Building on what we learned last week, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf",
)

documents = loader.load()

In [None]:
documents[0].metadata

{'source': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf',
 'file_path': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

eval_documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [None]:
len(eval_documents)

232

#### Loading HF Inference Endpoint Embedding Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.



In [None]:
HF_EMBED_URL = "YOUR_URL_HERE"

In [None]:
os.environ["HF_TOKEN"] = getpass("Please provide your Hugging Face API Key: ")

Please provide your Hugging Face API Key: ··········


In [None]:
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

embedding_model = HuggingFaceEndpointEmbeddings(
    model=HF_EMBED_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [None]:
from langchain_community.vectorstores import Qdrant

for i in range(0, len(documents), 32):
  if i == 0:
    vectorstore = Qdrant.from_documents(
        eval_documents[i:i+32],
        embedding_model,
        location=":memory:",
        collection_name="Elon's Complaint")
    continue
  vectorstore.add_documents(eval_documents[i:i+32])

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [None]:
retriever = vectorstore.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [None]:
retrieved_documents = retriever.invoke("What is this complaint about?")

In [None]:
for doc in retrieved_documents:
  print(doc)

page_content='In 2015, Mr. Altman wrote that the “[d]evelopment of superhuman machine intelligence (SMI) is' metadata={'source': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf', 'file_path': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf', 'page': 4, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': '', '_id': 'b779fea9522b494985aeeba1b56ea971', '_collection_name': "Elon's Complaint"}
page_content='bragging about performance. On information and belief, this secrecy is primarily driven by 
commercial considerations, not safety. Although developed by OpenAI using contributions from' metadata={'source': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf', 'file_path': 'https://www.courthousenews.com/wp-content/uploads/2024

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!


> NOTE: 🛑 Using this notebook as presented will occur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step - if you want to reduce costs, please use the provided commented code to leverage GPT-3.5-Turbo as the critic_llm. If you're attempting to create a lot of samples please be aware of cost, as well as rate limits. 🛑

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = embedding_model

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 10 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(eval_documents[:50], num_qa_pairs, distributions)
testset.to_pandas()

Let's look at the output and see what we can learn about it!

In [None]:
testset.test_data[0]

DataRow(question="What rights does Microsoft have to OpenAI's pre-AGI technology?", contexts=['regard, Microsoft’s own researchers have publicly stated that, “[g]iven the breadth and depth of \nGPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) \nversion of an artificial general intelligence (AGI) system.” Moreover, on information and belief, \nOpenAI is currently developing a model known as Q* (Q star) that has an even stronger claim to \nAGI. As noted, Microsoft only has rights to certain of OpenAI’s pre-AGI technology. But for'], ground_truth='Microsoft only has rights to certain of OpenAI’s pre-AGI technology.', evolution_type='simple', metadata=[{'source': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf', 'file_path': 'https://www.courthousenews.com/wp-content/uploads/2024/02/musk-v-altman-openai-complaint-sf.pdf', 'page': 7, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'au

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [None]:
test_df = testset.to_pandas()

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What rights does Microsoft have to OpenAI's pr...,"[regard, Microsoft’s own researchers have publ...",Microsoft only has rights to certain of OpenAI...,simple,[{'source': 'https://www.courthousenews.com/wp...,True
1,Where is the principal place of business for O...,"[19, 2018. OpenAI OpCo, LLC is registered as a...","1960 Bryant Street, San Francisco, CA 94110",simple,[{'source': 'https://www.courthousenews.com/wp...,True
2,What tasks were early AI programs capable of o...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,The answer to given question is not present in...,simple,[{'source': 'https://www.courthousenews.com/wp...,True
3,What measures have been publicly called for to...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,A variety of measures have been publicly calle...,simple,[{'source': 'https://www.courthousenews.com/wp...,True
4,How did AlphaZero become the strongest chess p...,[repeats. \n21. \nAlphaZero rapidly became th...,AlphaZero became the strongest chess playing s...,simple,[{'source': 'https://www.courthousenews.com/wp...,True
5,"What principle was upheld in OpenAI, Inc.'s Ce...",[25. \nThe Founding Agreement was also memoria...,The Certificate of Incorporation affirmed that...,multi_context,[{'source': 'https://www.courthousenews.com/wp...,True
6,What breach occurred with OpenAI and GPT-4 in ...,[The 2023 Breach Of The Founding Agreement \n2...,The 2023 Breach Of The Founding Agreement,multi_context,[{'source': 'https://www.courthousenews.com/wp...,True
7,"How might Strong AGI affect human economics, a...",[luminaries like Stephen Hawking and Sun Micro...,"Strong AGI, as noted by luminaries like Stephe...",reasoning,[{'source': 'https://www.courthousenews.com/wp...,True


In [None]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

### Setting Up RAG Chain to Test

We'll quickly set-up a familiar RAG pipeline that we can evaluate against our created dataset.

In [None]:
HF_LLM_URL = "YOUR_LLM_URL_HERE" + "/v1/"

In [None]:
hf_llm = ChatOpenAI(
    model="tgi",
    openai_api_base=HF_LLM_URL,
    openai_api_key=os.environ["HF_TOKEN"]
)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(documents)

In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model_oai = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_base="https://api.openai.com/v1/",
)

for i in range(0, len(documents), 32):
  if i == 0:
    rag_vectorstore = Qdrant.from_documents(
        rag_documents[i:i+32],
        embedding_model_oai,
        location=":memory:",
        collection_name="Elon's Complaint - RAG")
    continue
  rag_vectorstore.add_documents(rag_documents[i:i+32])

In [None]:
rag_retriever = rag_vectorstore.as_retriever()

In [None]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

base_rag_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | rag_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | hf_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = base_rag_chain.invoke({"question" : question})
  answers.append(response["response"])
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [None]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [None]:
response_dataset[0]

{'question': "What rights does Microsoft have to OpenAI's pre-AGI technology?",
 'answer': "Microsoft has the exclusive right to license certain of OpenAI's pre-AGI technology. The Microsoft license only applied to OpenAI's pre-AGI technology, and Microsoft obtained no rights to AGI. The determination of when OpenAI attains AGI is up to OpenAI, Inc.'s non-profit Board, not Microsoft.",
 'contexts': ['69. \nOn September 22, 2020, OpenAI announced that it exclusively licensed certain of its \npre-AGI technology to Microsoft. Consistent with the Founding Agreement, OpenAI’s website \nstates that AGI, which it describes as “a highly autonomous system that outperforms humans at most \neconomically valuable work” “is excluded from IP licenses and other commercial terms with',
  'commercial entities alike.  \n28. \nMr. Altman became OpenAI, Inc.’s CEO in 2019. On September 22, 2020, OpenAI \nentered into an agreement with Microsoft, exclusively licensing to Microsoft its Generative Pre-\nTrai

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings

eval_llm = ChatOpenAI(model="gpt-4o")

results = evaluate(
    response_dataset,
    metrics,
    llm=eval_llm
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [None]:
results

{'faithfulness': 0.9750, 'answer_relevancy': 0.9719, 'context_recall': 1.0000, 'context_precision': 0.9479, 'answer_correctness': 0.4596}

In [None]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What rights does Microsoft have to OpenAI's pr...,Microsoft has the exclusive right to license c...,"[69. \nOn September 22, 2020, OpenAI announced...",Microsoft only has rights to certain of OpenAI...,1.0,0.966469,1.0,1.0,0.450698
1,Where is the principal place of business for O...,The principal place of business for OpenAI OpC...,"[7. \nOpenAI GP, L.L.C. is a limited liability...","1960 Bryant Street, San Francisco, CA 94110",1.0,1.0,1.0,0.916667,0.20814
2,What tasks were early AI programs capable of o...,Early AI programs were capable of outperformin...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,The answer to given question is not present in...,1.0,0.984539,1.0,1.0,0.180842
3,What measures have been publicly called for to...,publicly called for a variety of measures to a...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,A variety of measures have been publicly calle...,1.0,0.936755,1.0,1.0,0.840398
4,How did AlphaZero become the strongest chess p...,AlphaZero became the strongest chess playing s...,[learns to play chess by playing itself with d...,AlphaZero became the strongest chess playing s...,1.0,1.0,1.0,1.0,0.451788
5,"What principle was upheld in OpenAI, Inc.'s Ce...","The principle upheld in OpenAI, Inc.'s Certifi...","[Inc.’s December 8, 2015 Certificate of Incorp...",The Certificate of Incorporation affirmed that...,1.0,0.990421,1.0,0.833333,0.795088
6,What breach occurred with OpenAI and GPT-4 in ...,"In 2023, a breach occurred with OpenAI and GPT...",[The 2023 Breach Of The Founding Agreement \n2...,The 2023 Breach Of The Founding Agreement,0.8,0.943875,1.0,0.833333,0.206821
7,"How might Strong AGI affect human economics, a...",Strong AGI might affect human economics by mak...,[the greatest existential threat we face today...,"Strong AGI, as noted by luminaries like Stephe...",1.0,0.953091,1.0,1.0,0.54339


In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model_oai_te3 = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_base="https://api.openai.com/v1/",
)

for i in range(0, len(documents), 32):
  if i == 0:
    rag_vectorstore = Qdrant.from_documents(
        rag_documents[i:i+32],
        embedding_model_oai_te3,
        location=":memory:",
        collection_name="Elon's Complaint - RAG - TE3")
    continue
  rag_vectorstore.add_documents(rag_documents[i:i+32])

In [None]:
rag_retriever_te3 = rag_vectorstore.as_retriever()

In [None]:
te3_rag_chain = (
    {"context": itemgetter("question") | rag_retriever_te3, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | hf_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = base_rag_chain.invoke({"question" : question})
  answers.append(response["response"])
  contexts.append([context.page_content for context in response["context"]])

In [None]:
from datasets import Dataset

response_dataset_te3 = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [None]:
results_te3 = evaluate(
    response_dataset_te3,
    metrics,
    llm=eval_llm
)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [None]:
results_te3

{'faithfulness': 0.9583, 'answer_relevancy': 0.9685, 'context_recall': 1.0000, 'context_precision': 0.9687, 'answer_correctness': 0.4655}

In [None]:
results_te3_df = results_te3.to_pandas()
results_te3_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What rights does Microsoft have to OpenAI's pr...,Microsoft has exclusive rights to OpenAI's pre...,"[69. \nOn September 22, 2020, OpenAI announced...",Microsoft only has rights to certain of OpenAI...,1.0,0.940534,1.0,1.0,0.23436
1,Where is the principal place of business for O...,The principal place of business for OpenAI OpC...,"[7. \nOpenAI GP, L.L.C. is a limited liability...","1960 Bryant Street, San Francisco, CA 94110",1.0,1.0,1.0,0.916667,0.208269
2,What tasks were early AI programs capable of o...,Early AI programs were capable of outperformin...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,The answer to given question is not present in...,1.0,0.984554,1.0,1.0,0.180881
3,What measures have been publicly called for to...,Publicly called for a variety of measures to a...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,A variety of measures have been publicly calle...,1.0,0.936755,1.0,1.0,0.841684
4,How did AlphaZero become the strongest chess p...,AlphaZero became the strongest chess playing s...,[learns to play chess by playing itself with d...,AlphaZero became the strongest chess playing s...,1.0,1.0,1.0,1.0,0.451788
5,"What principle was upheld in OpenAI, Inc.'s Ce...","In OpenAI, Inc.'s Certificate of Incorporation...","[Inc.’s December 8, 2015 Certificate of Incorp...",The Certificate of Incorporation affirmed that...,1.0,0.990418,1.0,1.0,0.795677
6,What breach occurred with OpenAI and GPT-4 in ...,"In 2023, the breach occurred with OpenAI and G...",[The 2023 Breach Of The Founding Agreement \n2...,The 2023 Breach Of The Founding Agreement,0.666667,0.935795,1.0,0.833333,0.204894
7,"How might Strong AGI affect human economics, a...",Strong AGI might affect human economics by mak...,[the greatest existential threat we face today...,"Strong AGI, as noted by luminaries like Stephe...",1.0,0.959691,1.0,1.0,0.806172


In [None]:
results

{'faithfulness': 0.9750, 'answer_relevancy': 0.9719, 'context_recall': 1.0000, 'context_precision': 0.9479, 'answer_correctness': 0.4596}

In [None]:
results_te3

{'faithfulness': 0.9583, 'answer_relevancy': 0.9685, 'context_recall': 1.0000, 'context_precision': 0.9687, 'answer_correctness': 0.4655}