# The Art of RAG Evaluation with Unstructured.io

In the following notebook we'll explore the following:

- Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
- Evaluating our pipeline with the [Ragas](https://github.com/explodinggradients/ragas) library

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu langchain-nomic

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.5/73.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.8/123.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [2]:
!pip install -r /content/requirements.txt

Collecting unstructured[all-docs] (from -r /content/requirements.txt (line 3))
  Downloading unstructured-0.12.6-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter (from -r /content/requirements.txt (line 5))
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting backoff==2.2.1 (from unstructured[all-docs]->-r /content/requirements.txt (line 3))
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting dataclasses-json-speakeasy==0.5.11 (from unstructured[all-docs]->-r /content/requirements.txt (line 3))
  Downloading dataclasses_json_speakeasy-0.5.11-py3-none-any.whl (28 kB)
Collecting emoji==2.10.1 (from unstructured[all-docs]->-r /content/requirements.txt (line 3))
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hColl

In [1]:
import langchain
print(f"LangChain Version: {langchain.__version__}")

LangChain Version: 0.1.13


In [2]:
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "/content/DoD_Data_Strategy.pdf", mode="elements"
)
docs = loader.load()
docs[:5]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[Document(page_content='Executive Summary: DoD Data Strategy Unleashing Data to Advance the National Defense Strategy', metadata={'source': '/content/DoD_Data_Strategy.pdf', 'coordinates': {'points': ((134.66, 74.58800000000008), (134.66, 110.45263999999997), (484.33, 110.45263999999997), (484.33, 74.58800000000008)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': '/content', 'filename': 'DoD_Data_Strategy.pdf', 'languages': ['eng'], 'last_modified': '2024-03-20T23:43:45', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),
 Document(page_content='BLUF: The DoD Data Strategy supports the National Defense Strategy and Digital Modernization by providing the overarching vision, focus areas, guiding principles, essential capabilities, and goals necessary to transform the Department into a data-centric enterprise. Success cannot be taken for granted…it is the responsibility of all DoD leaders to treat data as a weapon sy

Since we'll be using OpenAI to power our RAG pipeline and part of the functionality of the RAGAS library - we'll need an OpenAI API key!

In [3]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Building our RAG pipeline

While the version may have changed - the process of creating our RAG pipeline remains largely the same:

- Create an Index
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data - we'll be using the LangChain v0.1.0 blog to both keep things simple, and keep things meta.

> NOTE: You'll notice that some specific loaders, LLMs, etc., are in their own libraries now. This allows you to stay as lightweight as you'd like while using LangChain!

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(docs)

Let's confirm we've split our document.

In [5]:
len(documents)

349

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task! (soon we'll be able to leverage OpenAI's newest embedding model which is waiting on an approved PR to be merged as we speak!)

In [6]:
from langchain_nomic import NomicEmbeddings

embeddings = NomicEmbeddings(
    model="nomic-embed-text-v1.5"
)

In [7]:
! nomic login nk-VVMBchiQLy7Jk-9Tuq9Ig8faaAtZCVrnXsH7RmAye7I

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

We'll be leveraging Meta's FAISS for this task.

In [8]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [9]:
retriever = vector_store.as_retriever()

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [10]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [11]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple - but we'll create our own to be a bit more specific!

In [12]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [13]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

## Ragas Evaluation

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

#### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

Let's create a new set of documents to ensure we're not accidentally creating a sample test set that favours our base model too much!

In [14]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
documents = text_splitter.split_documents(documents)

In [15]:
len(documents)

339

In [16]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

  generator = TestsetGenerator.with_openai()


embedding nodes:   0%|          | 0/678 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Let's look at the output and see what we can learn about it!

In [17]:
testset.test_data[0]

DataRow(question='How do data catalogs contribute to data-driven decision-making in the Department of Defense?', contexts=['Data-driven decision-making requires DoD data to be linked such that relationships and dependencies can be uncovered and maintained. Adhering to industry best-practices for open data standards, data catalogs, and metadata tagging, the Department ensures that connections across disparate sources can be made and leveraged for analytics.'], ground_truth='Data catalogs contribute to data-driven decision-making in the Department of Defense by adhering to industry best-practices for open data standards, data catalogs, and metadata tagging. This ensures that connections across disparate sources can be made and leveraged for analytics.', evolution_type='simple', metadata=[{'source': '/content/DoD_Data_Strategy.pdf', 'coordinates': {'points': ((72.024, 138.452), (72.024, 191.85199999999998), (543.3159999999999, 191.85199999999998), (543.3159999999999, 138.452)), 'system': 

#### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [18]:
test_df = testset.to_pandas()

In [19]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How do data catalogs contribute to data-driven...,[Data-driven decision-making requires DoD data...,Data catalogs contribute to data-driven decisi...,simple,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
1,How does superior situational awareness contri...,"[Warfighters at all echelons require tested, s...",Superior situational awareness contributes to ...,simple,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
2,How does DoD research contribute to the transi...,"[When new data gaps are identified, the data g...",,simple,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
3,How does the Department of Defense plan to cul...,[Moving the Department to a data-centric organ...,The Department of Defense plans to cultivate d...,simple,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
4,What is the purpose of developing measurable D...,"[Way Ahead: To implement this Strategy, Compon...",The purpose of developing measurable Data Stra...,simple,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
5,How do algorithmic models contribute to AI int...,[Artificial Intelligence (AI) is a long-term d...,Algorithmic models contribute to AI integratio...,reasoning,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
6,What characteristics make data a strategic ass...,[A core tenet of the DoD Data Strategy is the ...,Data is considered a strategic asset for the D...,reasoning,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
7,How do principles in data governance contribut...,"[Data governance provides the principles, poli...",,multi_context,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
8,How can user feedback and ongoing initiatives ...,[Data policies and standards alone cannot stre...,User feedback and ongoing initiatives can impr...,multi_context,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True
9,What is the role of data governance in managin...,"[Data governance provides the principles, poli...","Data governance provides the principles, polic...",reasoning,"[{'source': '/content/DoD_Data_Strategy.pdf', ...",True


In [20]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [21]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [22]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [23]:
response_dataset[0]

{'question': 'How do data catalogs contribute to data-driven decision-making in the Department of Defense?',
 'answer': 'Data catalogs contribute to data-driven decision-making in the Department of Defense by ensuring that connections across disparate sources can be made and leveraged for analytics.',
 'contexts': ['Data-driven decision-making requires DoD data to be linked such that relationships and dependencies can be uncovered and maintained. Adhering to industry best-practices for open data standards, data catalogs, and metadata tagging, the Department ensures that connections across disparate sources can be made and leveraged for analytics.',
  'Moving the Department to a data-centric organization requires a cultural transformation with the DoD workforce at its heart. DoD will continue to evolve its decision-making culture to one soundly based upon data and analytics enabled by technology. A modern, agile, information-advantaged DoD workforce (leaders, service members, civilians,

#### Evaluating with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [24]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [25]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [26]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.8704, 'context_recall': 1.0000, 'context_precision': 0.7750, 'answer_correctness': 0.5451}

In [27]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How do data catalogs contribute to data-driven...,Data catalogs contribute to data-driven decisi...,[Data-driven decision-making requires DoD data...,Data catalogs contribute to data-driven decisi...,1.0,1.0,1.0,0.75,0.742591
1,How does superior situational awareness contri...,Superior situational awareness contributes to ...,"[Warfighters at all echelons require tested, s...",Superior situational awareness contributes to ...,1.0,1.0,1.0,1.0,0.674351
2,How does DoD research contribute to the transi...,I don't know.,[Moving the Department to a data-centric organ...,,,0.0,1.0,0.0,0.198202
3,How does the Department of Defense plan to cul...,The Department of Defense plans to cultivate d...,[Moving the Department to a data-centric organ...,The Department of Defense plans to cultivate d...,1.0,0.967799,1.0,1.0,0.621262
4,What is the purpose of developing measurable D...,The purpose of developing measurable Data Stra...,"[Way Ahead: To implement this Strategy, Compon...",The purpose of developing measurable Data Stra...,1.0,0.968797,1.0,0.916667,0.532234
5,How do algorithmic models contribute to AI int...,Algorithmic models contribute to AI integratio...,[algorithmic models will increasingly become t...,Algorithmic models contribute to AI integratio...,1.0,0.940402,1.0,0.583333,0.432678
6,What characteristics make data a strategic ass...,Data is considered a strategic asset for the D...,[A core tenet of the DoD Data Strategy is the ...,Data is considered a strategic asset for the D...,1.0,0.959596,1.0,1.0,0.998678
7,How do principles in data governance contribut...,Principles in data governance contribute to th...,[Data policies and standards alone cannot stre...,,1.0,0.952556,1.0,0.5,0.175878
8,How can user feedback and ongoing initiatives ...,User feedback and ongoing initiatives can impr...,[Data policies and standards alone cannot stre...,User feedback and ongoing initiatives can impr...,1.0,1.0,1.0,1.0,0.460651
9,What is the role of data governance in managin...,The role of data governance is to provide the ...,"[Data governance provides the principles, poli...","Data governance provides the principles, polic...",1.0,0.915233,1.0,1.0,0.614657
