# The Art of RAG Evaluation

In the following notebook we'll explore the following:

- Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
- Evaluating our pipeline with the [Ragas](https://github.com/explodinggradients/ragas) library
- Making an adjustment to our RAG pipeline
- Evaluating our adjusted pipeline against our baseline

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: You'll notice we're including a number of `pip install` commands relating to LangChain now - this is part of their v0.1.0 release! Keep in mind that not all of these are critical to building a LangChain pipeline - we're only using them to show the plethora of options we have with the LangChain package!

In [None]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu pypdf langchain-nomic

In [None]:
import langchain
print(f"LangChain Version: {langchain.__version__}")

LangChain Version: 0.1.11


Since we'll be using OpenAI to power our RAG pipeline and part of the functionality of the RAGAS library - we'll need an OpenAI API key!

In [None]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Building our RAG pipeline

While the version may have changed - the process of creating our RAG pipeline remains largely the same:

- Create an Index
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data - we'll be using the LangChain v0.1.0 blog to both keep things simple, and keep things meta.

> NOTE: You'll notice that some specific loaders, LLMs, etc., are in their own libraries now. This allows you to stay as lightweight as you'd like while using LangChain!

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/DoD_Data_Strategy.pdf")
documents = loader.load()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
documents[0].metadata

{'source': '/content/DoD_Data_Strategy.pdf', 'page': 0}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [None]:
len(documents)

75

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task! (soon we'll be able to leverage OpenAI's newest embedding model which is waiting on an approved PR to be merged as we speak!)

In [None]:
from langchain_nomic import NomicEmbeddings

embeddings = NomicEmbeddings(
    model="nomic-embed-text-v1.5"
)

In [None]:
! nomic login nk-VVMBchiQLy7Jk-9Tuq9Ig8faaAtZCVrnXsH7RmAye7I

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

We'll be leveraging Meta's FAISS for this task.

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [None]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [None]:
retrieved_documents = retriever.invoke("Why did they change to version 0.1.0?")

In [None]:
for doc in retrieved_documents:
  print(doc)

page_content='data to capitalize on strategic and tactical opportunities that are currently unavailable. We have a \nresponsibility to gain full value from DoD capabilities and investments, thereby earning the trust \nof the operational warfighter, the U.S. Congress, and the American people. Embracing new data\xad\ndriven concepts and leveraging commercial-sector innovations will improve military operations \nand increase lethality. \nTo enable this change, the Department is adopting new technologies as part of its Digital \nModernization program -from automation to Artificial Intelligence (Al) to 5G-enabled edge' metadata={'source': '/content/DoD_Data_Strategy.pdf', 'page': 3}
page_content='provide  real-world  outcomes  that will aid in prioritizing  data gaps,  as will lessons  from  the Army’s \nwork on data design principles and similar efforts by the other  MILDEPs.  \nWhen new data gaps are identified, the data governance community must work with mission area \nmanagers  to dete

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [None]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [None]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple - but we'll create our own to be a bit more specific!

In [None]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [None]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [None]:
question = "What are the major changes in v0.1.0?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

I don't know


In [None]:
question = "What is LangGraph?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content='for successful decision -making and joint military operations. Achieving semantic as well as  \nsyntactic interoperability  using  common  data formats  and machine -to-machine  communications  \naccelerates advanced algorithm development and provides a strategic advantage to the  Department.  \nDoD will know it has made progress toward making data interoperable when:  \nObjective 1: DoD documents and implements data exch ange specifications for all systems , \nincluding those of coalition partners.', metadata={'source': '/content/DoD_Data_Strategy.pdf', 'page': 12}), Document(page_content='vocabularies, includi ng enterprise standards.', metadata={'source': '/content/DoD_Data_Strategy.pdf', 'page': 11}), Document(page_content='known.  \n3.3. Governan ce \nData governance provides the principles, policies, processes, frameworks, tools, metrics, and \noversight required to effectively manage data at all levels, from creation to disposition. Data \ng

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Ragas Evaluation

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

#### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

Let's create a new set of documents to ensure we're not accidentally creating a sample test set that favours our base model too much!

In [None]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
documents = text_splitter.split_documents(documents)

In [None]:
len(documents)

59

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/118 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Let's look at the output and see what we can learn about it!

In [None]:
testset.test_data[0]

DataRow(question='How does data-driven decision-making rely on making data trustworthy?', contexts=['Page 8 DoD Data Strategy   Objective 6: Adaptive, intelligent systems monitor data streams and identify opportunities \nto transform, combine, or derive new data providing increased insights.  \n \n4.4. Goal:  Make Data  Linked  \nData -driven decision -making requires DoD data to be linked such that relationships and \ndependencies  can be uncovered  and maintained.  Adhering  to industry  best-practices  for open  data \nstandards, data catalogs, and metadata tagging, the Department ensures that connections across \ndispar ate sources can be made and leveraged for  analytics.  \nDoD will know it has made progress on making data linked when:  \nObjective 1: DoD implements globally unique identifiers so data can be easily discovered, \nlinked, retrieved, and referenced.  \nObjective 2: DoD utilizes common metadata standards that allow data to be joined and \nintegrated.  \n \n4.5. Goal:

#### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [None]:
test_df = testset.to_pandas()

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,How does data-driven decision-making rely on m...,[Page 8 DoD Data Strategy Objective 6: Adapt...,Data-driven decision-making relies on making d...,simple,True
1,What is the role of data stewards in the manag...,[These advantages will be reflected in more ra...,Data stewards establish policies governing dat...,simple,True
2,How are access and handling restriction metada...,[record retention rules are developed and impl...,Access and handling restriction metadata are b...,simple,True
3,What is the goal of Objective 5 in the DoD Dat...,[Page 7 DoD Data Strategy Objective 3: All ...,The goal of Objective 5 in the DoD Data Strate...,simple,True
4,How does the implementation of evidence-based ...,[Contractors at every echelon) will be increas...,,simple,True
5,How does the Department of Defense ensure trus...,[Page 8 DoD Data Strategy Objective 6: Adapt...,The Department of Defense ensures trustworthy ...,reasoning,True
6,How does the Department of Defense ensure acco...,[These advantages will be reflected in more ra...,DoD is defining roles and responsibilities for...,reasoning,True
7,How does the Department of Defense ensure data...,[Page 8 DoD Data Strategy Objective 6: Adapt...,The Department of Defense ensures data trustwo...,multi_context,True
8,"""Why is data ethics important in decision-maki...",[These advantages will be reflected in more ra...,Data ethics is important in decision-making an...,multi_context,True
9,What is the challenge in data collection and h...,"[used, and shared. As the Secretary of Defense...",The challenge in data collection is to discove...,simple,True


In [None]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [None]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [None]:
response_dataset[0]

{'question': 'How does data-driven decision-making rely on making data trustworthy?',
 'answer': 'Data-driven decision-making relies on making data trustworthy by ensuring that consumers can be confident in all aspects of the data they are using for decision-making.',
 'contexts': ['relationships.  \n5.) Make Data Trustworthy  – Consumers can be confident in all aspects of data for \ndecision -making.  \n6.) Make Data Interoperable  – Consumers have a common representation/  \ncomprehension of data.  \n7.) Make Data Secure  – Consumers know that data is protected from unauthorized \nuse/manipulation.  \nWay Ahead : To implement this Strategy, Components will develop measurable Data \nStrategy Implementation Plans , overseen by t he DoD CDO and  DoD Data  Council. The \ndata governance community and user communities will continue to partner to identify \nchallenges, develop solutions, and share best practices for all data  stakeholders.',
  'high quality, accurate, complete, timely, pro

#### Evaluating with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [None]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
results

{'faithfulness': 0.9630, 'answer_relevancy': 0.9646, 'context_recall': 0.8250, 'context_precision': 0.7111, 'answer_correctness': 0.5782}

In [None]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How does data-driven decision-making rely on m...,Data-driven decision-making relies on making d...,[relationships. \n5.) Make Data Trustworthy ...,Data-driven decision-making relies on making d...,1.0,0.932329,1.0,0.583333,0.530865
1,What is the role of data stewards in the manag...,Data stewards are responsible for defining pol...,[both immediate and lasting military advant...,Data stewards establish policies governing dat...,1.0,0.9731,1.0,1.0,0.73437
2,How are access and handling restriction metada...,Access and handling restriction metadata are b...,"[in use (within applications, with analytics, ...",Access and handling restriction metadata are b...,0.666667,0.963243,1.0,0.333333,0.740014
3,What is the goal of Objective 5 in the DoD Dat...,The goal of Objective 5 in the DoD Data Strate...,[Page 9 DoD Data Strategy Objective 2: Excha...,The goal of Objective 5 in the DoD Data Strate...,,1.0,1.0,1.0,0.7337
4,How does the implementation of evidence-based ...,The implementation of evidence-based policies ...,"[levels, from creation to di sposition. \n4.)...",,1.0,0.907364,0.0,0.0,0.179706
5,How does the Department of Defense ensure trus...,The Department of Defense ensures trustworthy ...,[Objective 1: DoD implements globally unique i...,The Department of Defense ensures trustworthy ...,1.0,0.986675,0.25,0.805556,0.41775
6,How does the Department of Defense ensure acco...,The Department of Defense ensures accountabili...,[evidence and Learning Agendas (see P.L. 115 -...,DoD is defining roles and responsibilities for...,1.0,0.98909,1.0,0.805556,0.425209
7,How does the Department of Defense ensure data...,The Department of Defense ensures data trustwo...,[FOREWORD \nThe Department of Defense's (DoD) ...,The Department of Defense ensures data trustwo...,1.0,0.960242,1.0,0.583333,0.742027
8,"""Why is data ethics important in decision-maki...",Data ethics is important in decision-making an...,"[analytics, ethical principles regarding th...",Data ethics is important in decision-making an...,1.0,0.961924,1.0,1.0,0.745788
9,What is the challenge in data collection and h...,The challenge in data collection is that the D...,[Page 2 DoD Data Strategy \n1.1. Problem S...,The challenge in data collection is to discove...,1.0,0.971566,1.0,1.0,0.532263


## Testing a More Performant Retriever

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [None]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [None]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "What are the major changes in v0.1.0?"})

In [None]:
print(response["answer"])

The context provided does not mention any specific version numbers or changes in v0.1.0. It primarily discusses the importance of data governance, data interoperability, software upgradability, cloud readiness, and the adoption of new technologies like Artificial Intelligence and 5G-enabled edge computing in the Department of Defense. If you have any other questions or need clarification on a different topic within the context provided, feel free to ask.


In [None]:
response = retrieval_chain.invoke({"input": "What is LangGraph?"})

In [None]:
print(response["answer"])

The context provided does not mention any specific version numbers or changes in v0.1.0. It primarily discusses the importance of data governance, data interoperability, software upgradability, cloud readiness, and the adoption of new technologies like Artificial Intelligence and 5G-enabled edge computing in the Department of Defense. If you have any other questions or need clarification on a different topic within the context provided, feel free to ask.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [None]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [None]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

### Comparing Results

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [None]:
results

{'faithfulness': 0.9688, 'answer_relevancy': 0.4502, 'context_recall': 0.6000, 'context_precision': 0.3917, 'answer_correctness': 0.5574}

And see how our advanced retrieval modified our chain!

In [None]:
advanced_retrieval_results

{'faithfulness': 0.8750, 'answer_relevancy': 0.4772, 'context_recall': 0.7000, 'context_precision': 0.3918, 'answer_correctness': 0.5425}

In [None]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.96875,0.875,-0.09375
1,answer_relevancy,0.450168,0.477159,0.026991
2,context_recall,0.6,0.7,0.1
3,context_precision,0.391667,0.391786,0.000119
4,answer_correctness,0.557359,0.542514,-0.014845


We can see that our faithfulness has improved - as well as our answer relevancy - but we lost a significant amount of answer correctness.

We'd need to do some more experimentation to determine how to improve our pipeline!