
# Benchmarking a RAG Pipeline with Llama3 and LangChain

Retrieval Augmented Generation (RAG) pipelines are one of the most popular use cases of Generative AI and Large Language Models (LLMs).  Evaluating these pipelines is crucial for understanding their performance, limitations, and potential for improvement. 

This notebook demonstrates how to benchmark a RAG pipeline using Llama3-8B as the retrieval model and Llama3-70B as both the dataset generator and the evaluation judge.

This tutorial will walk you through how to:

1. Set up the environment and dependencies.
2. Load and index the data into the vectorstore.
3. Set up the RAG pipeline.
4. Generate evaluation dataset.
5. Evaluate the RAG pipeline.
6. Display and save the evaluation metrics.

For the fastest LLM inference speed in the world, we use Llama3-8B and Llama3-70B powered by Groq. You can create a developer account for free at https://console.groq.com/ and generate a free API key to follow this tutorial!

## 1. Set up the Environment and Dependencies

In this section, we install and import the necessary libraries required for our benchmarking task.

In [1]:
!pip install langchain -q
!pip install langchain_chroma -q
!pip install langchain_community -q
!pip install langchain_groq -q
!pip install grandalf -q
!pip install numpy -q
!pip install pandas -q
!pip install sentence-transformers -q

In [2]:
import nest_asyncio
nest_asyncio.apply()

### Set Up Environment Variables

To use [Groq](https://groq.com), you need to make sure that `GROQ_API_KEY` is specified as an environment variable.

In [19]:
import os

os.environ["GROQ_API_KEY"] = "gsk_..."
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To suppress huggingface warnings


### Instantiate LLM and Embeddings Model

We then instantiate the `ChatGroq` class to use Llama3 through Groq and `HuggingFaceEmbeddings` to use the embeddings model.

NOTE: The `BBAI/bge-small-en-v1.5` HuggingFace embeddings model will download the model locally on your computer. The model is ~135MB.

You can swap out any of the Chat and Embeddings models with any of Langchain's [Chat Model integrations](https://python.langchain.com/v0.2/docs/integrations/chat/) and [Embeddings Model integrations](https://python.langchain.com/v0.2/docs/integrations/text_embedding/).

In [4]:
from langchain_groq import ChatGroq
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
rag_llm = ChatGroq(model="llama3-8b-8192") # Used for RAG
qa_llm = ChatGroq(model="llama3-70b-8192", temperature=0.1) # Used to create eval dataset
benchmark_llm = ChatGroq(model="llama3-70b-8192") # Used to evaluate (Judge)

  from .autonotebook import tqdm as notebook_tqdm


## 2. Load and Index Data into the Vectorstore

We will be using one of Paul Graham's essays as the data to query for the RAG pipeline.

In [5]:
# Download Paul Graham's Essay
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-06 11:50:27--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-06-06 11:50:27 (3.87 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



### Load Data, Split into Chunks, and Index into Vectorstore

We are using Langchain's `DirectoryLoader` to load the data from the directory we just downloaded it to.

To split the data, we are using the `RecursiveCharacterTextSplitter` to recursively split the data by characters. 

Finally, we are using `ChromaDB` as our vectorstore provider through Langchain's `Chroma` class.

Again, all of this is swappable based on your needs through Langchain's integrations and modules:
- [Document Loaders](https://python.langchain.com/v0.1/docs/integrations/document_loaders/)
- [Text Splitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)
- [Vector stores](https://python.langchain.com/v0.1/docs/integrations/vectorstores/)

In [6]:
from langchain_chroma import Chroma
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Use popular github repo instead
loader = DirectoryLoader("./data/paul_graham/", use_multithreading=True, loader_cls=TextLoader)
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n", 
        "\n", 
        " ",
        "",
        # For multilingual/non-english text
        # "\u200b",  # Zero-width space
        # "\uff0c",  # Fullwidth comma
        # "\u3001",  # Ideographic comma
        # "\uff0e",  # Fullwidth full stop
        # "\u3002",  # Ideographic full stop
    ],
    chunk_size=3000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)
documents = loader.load_and_split(text_splitter=text_splitter) # Load text
vectorstore = Chroma.from_documents(documents, embedding=embed_model, collection_name="groq_rag")
retriever = vectorstore.as_retriever()
print(f"Documents indexed: {len(documents)}")

Documents indexed: 27


In [7]:
await retriever.ainvoke("What did paul graham do growing up?")

[Document(page_content="Asterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans. You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993. It's called Yorkville, and that was my new home. Now I was a New York artist — in the strictly technical sense of making paintings and living in New York.\n\nI was nervous about money, because I could sense that Interleaf was on the way down. Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky. So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined myself living frugally off the royalties and spending all my time painting. (The painting on the cover of this 

## 3. Set Up the RAG Pipeline

Now that we have the vectorstore with the documents indexed, we can create the RAG pipeline.

In [8]:
from langchain_core.documents import Document
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from typing import List, Dict

RAG_SYSTEM_PROMPT = """\
You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context given within delimiters to answer the human's questions.
```
{context}
```
If you don't know the answer, just say that you don't know.\
""" # adapted from https://smith.langchain.com/hub/rlm/rag-prompt-llama3

RAG_HUMAN_PROMPT = "{input}"

RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", RAG_SYSTEM_PROMPT),
    ("human", RAG_HUMAN_PROMPT)
])

def format_docs(docs: List[Document]):
    """Format the retrieved documents"""
    return "\n".join(doc.page_content for doc in docs)

rag_chain = (
    {
        "context": retriever | format_docs, # Use retriever to retrieve docs from vectorstore -> format the documents into a string
        "input": RunnablePassthrough() # Propogate the 'input' variable to the next step
    } 
    | RAG_PROMPT # format prompt with 'context' and 'input' variables
    | rag_llm # get response from LLM using the formatteed prompt
    | StrOutputParser() # Parse through LLM response to get only the string response

)

Let's test our RAG pipeline!

In [9]:
await rag_chain.ainvoke("What did paul graham do growing up?")

"According to the provided context, Paul Graham grew up in a family that encouraged his interest in computers. He started writing programs in 9th grade (around 13-14 years old) using an IBM 1401 computer with an early version of Fortran. He and his friend Rich Draves got permission to use the computer in the basement of their junior high school. He tried to write programs, but was puzzled by the 1401 and couldn't figure out what to do with it."

## 4. Generate Evaluation Dataset

To evaluate our RAG pipeline, we need to generate a dataset with questions that are relevant to our data.

Groq supports providing output JSON by using the `json_mode` feature. We will use `json_mode` to generate formatted questions for each document from our data.

We create the `qa_chain` that takes in a chunk of text and returns 3 questions in JSON format.

In [10]:
# Create questions for eval pipeline
from typing import TypedDict
# Response object structure
class QAResponse(TypedDict):
    question_1: str
    question_2: str
    question_3: str

QA_HUMAN_PROMPT = """\
You are a Teacher/ Professor. Your task is to setup questions for an upcoming \
quiz/examination. The questions should be diverse in nature across the document. \
Given the context information and not prior knowledge, generate only questions based on the below context. \
Restrict the questions to the context information provided within the delimiters.
```
{text}
```
Output the questions in JSON format with the keys question_1, question_2 and question_3 \
and make sure to escape any special characters to output clean, valid JSON.\
""" # adapted from https://arize.com/blog/evaluate-rag-with-llm-evals-and-benchmarking/

QA_PROMPT = ChatPromptTemplate.from_messages([
    ("human", QA_HUMAN_PROMPT)
])

qa_chain = (
{"text": RunnablePassthrough()}
| QA_PROMPT
| qa_llm.with_structured_output(method='json_mode', schema=QAResponse) # either 'json_mode' or 'function_calling' to get responses always in JSON format
)

We can generate the questions using the `qa_chain` in batches using Langchain's `abatch` method.

In [11]:
texts = [doc.page_content for doc in documents] # Create a list with all the text from the document chunks in the vectorstore

questions: List[Dict] = await qa_chain.abatch(texts)

In [12]:
print(f"From document: \n{texts[0]}\n")
print(f"Questions generated:")
for i, q in enumerate(questions[0].values(), 1): print(f'{i}: {q}')

From document: 
What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then s

## 5. Evaluate the RAG Pipeline

Now let's create a pipeline to evaluate our RAG pipeline on our generated set of questions. To get an evaluation, we need to provide the question, document that the question was generated from, and an answer from the RAG pipeline. We pass all this into our `benchmark_llm` and use `json_mode` to return a score as well as an evaluation of the answer.

In [13]:
from pydantic import BaseModel

# Response object structure
class EvalResponse(BaseModel):
    score: int
    explanation: str

EVAL_HUMAN_PROMPT = """\
You are given a question, an answer and reference text within marked delimiters. \
You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
```Question
{question}
```
```Reference
{context}
```
```Answer
{answer}
```
Respond with a valid JSON object containing two fields:
{{
    "score": "int: a score between 0-10, 10 being highest, on whether the question is correctly and fully answered by the answer",
    "explanation": "str: Provide an explanation as to why the score was given."
}} 
Make sure to escape any special characters to output clean, valid JSON.\
"""
# adapted from https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/q-and-a-on-retrieved-data

EVAL_PROMPT = ChatPromptTemplate.from_messages([
    ("human", EVAL_HUMAN_PROMPT)
])

eval_chain = (
    {
    "context": RunnablePassthrough(), # Propogate all input vars to next step in pipeline
    "question": RunnablePassthrough(), 
    "answer": RunnablePassthrough(),
    }
    | EVAL_PROMPT
    | benchmark_llm.with_structured_output(schema=EvalResponse, method='json_mode') # Parse response according to EvalResponse object
)



Lets test a question and context/answer pair in our newly created `benchmark_chain`.

In [14]:
q1 = questions[-1]['question_1']
t1 = texts[-1]


print(f"Question: {q1}")
a1 = await rag_chain.ainvoke(q1)
print(f"Answer: {a1}")
eval_input = {
    'context': t1,
    'question': q1,
    'answer': a1
}
response = await eval_chain.ainvoke(eval_input)
print("---------------------")
print(f"Score: {response['score']}")
print(f"Explanation: {response['explanation']}")
print("---------------------")

Question: What is the problem with Hacker News (HN) when you both write essays and run a forum?
Answer: According to the given context, when you both write essays and run a forum, a bizarre edge case occurs. When you write essays, people post highly imaginative misinterpretations of them on the forum. Individually, these phenomena are tedious but bearable. However, when combined, they become disastrous.
---------------------
Score: 10
Explanation: The answer accurately summarizes the problem that occurs when you both write essays and run a forum on Hacker News, which is a bizarre edge case where people post misinterpretations of your essays and you have to respond to them, and the combination of these two phenomena becomes disastrous.
---------------------


We can create a function that takes in a list of questions and their corresponding document chunks to run our whole evaluation pipeline.

In [15]:
# Lets do it for all now
from typing import TypedDict
from time import time

class EvalResult(TypedDict): # For type hinting
    question: str
    answer: str
    context: str
    score: int # Score between 0 - 10
    explanation: str # Explanation on why the score was given

async def evaluate(questions: List[Dict] = questions, texts: List[str] = texts) -> List[EvalResult]:
    # Prepare inputs
    batch_rag_inputs: List[Dict] = []
    evals: List[Dict] = []
    for q_dict, context in zip(questions, texts): 
        for question in q_dict.values(): 
            batch_rag_inputs.append(question)
            evals.append({'question': question, 'context': context})

    print(f"Running RAG pipeline for {len(batch_rag_inputs)} questions")
    start = time()
    answers = await rag_chain.abatch(batch_rag_inputs, config={'max_concurrency': 2}) # Reduce concurrency to avoid hitting rate limits
    end = time()
    print(f"Time taken: {end - start}")

    # Update eval_input with the answers from the rag_chain
    for eval_input, answer in zip(evals, answers):
        eval_input.update({'answer': answer})
    
    # Run eval_chain to get evaluation
    print(f"Evaluating RAG pipeline...")
    start = time()
    batch_score_explanations = await eval_chain.abatch(evals, config={'max_concurrency': 2}) # Pass in eval which contains List of 'answer', 'context', 'question'
    end = time()
    print(f"Time taken: {end - start}")
    
    # Update eval variable with the score and explanation
    for eval, score_exp_dict in zip(evals, batch_score_explanations):
        eval.update({
            'score': score_exp_dict['score'],
            'explanation': score_exp_dict['explanation']
        })
    
    return evals

In [16]:
evaluations = await evaluate(questions[:5], texts[:5]) # Remove the `:5` to evaluate all the questions on all your data

Running RAG pipeline for 15 questions
Time taken: 8.43082308769226
Evaluating RAG pipeline...
Time taken: 10.45450210571289


## 6. Display and Save the Evaluation Metrics

Let's now display some statistics of our newly created evaluations. For simplicity, we are printing out the results but in production, you should consider using an analytics/evaluations platform.

In [17]:
import csv
import statistics

score_threshold  = 5 # Display all results below 5

# Calculating basic statistics
scores = [eval['score'] for eval in evaluations]
average_score = sum(scores) / len(scores)
std_dev_score = statistics.stdev(scores)

# Lowest and highest scores
lowest_score = min(scores)
highest_score = max(scores)
lowest_count = scores.count(lowest_score)
highest_count = scores.count(highest_score)

# Display results
print("Average Score:", average_score)
print("Standard Deviation of Score:", std_dev_score)
print("Lowest Score:", lowest_score, "Count:", lowest_count)
print("Highest Score:", highest_score, "Count:", highest_count)

print(f"Evals lower than {score_threshold}")
count = 1
for eval in evaluations:
    if eval['score'] <= score_threshold:
        print("--------------------------")
        print(f"{count}. Score: {eval['score']}")
        print(f"Question: {eval['question']}")
        print(f"Answer: {eval['answer']}")
        print(f"Explanation: {eval['explanation']}")

Average Score: 9.333333333333334
Standard Deviation of Score: 2.581988897471611
Lowest Score: 0 Count: 1
Highest Score: 10 Count: 14
Evals lower than 5
--------------------------
1. Score: 0
Question: What was the topic the author chose for their dissertation, which they wrote in a short span of 5 weeks?
Answer: The author does not mention writing a dissertation in this context. The text does not mention a dissertation or a specific topic for a dissertation.
Explanation: The answer is incorrect because the text does mention the topic of the dissertation, which is applications of continuations.


Our RAG pipeline did very well! 
The pipeline answered most of the questions accurately apart from just one, in which it scored 0. Based on the answer and explanation we can infer that the pipeline might not have retrieved the right documents from the vectorstore. To improve the pipeline we can try switching the embeddings model and/or increase the `k` parameter from our `retriever` to return more number of documents from the vectorstore.

Let's save the evaluations in the `evaluations.csv` file.

In [18]:
csv_file = 'evaluations.csv'
with open(csv_file, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Context', 'Score', 'Question', 'Answer', 'Explanation'])
    for eval in evaluations:
        writer.writerow([eval['context'], eval['score'], eval['question'], eval['answer'], eval['explanation']])

print(f"Evaluations saved to {csv_file}")

Evaluations saved to evaluations.csv


### Conclusion

In this notebook, we demonstrated how to set up and evaluate a Retrieval Augmented Generation (RAG) pipeline using Llama3 models powered by Groq and LangChain. The goal of this notebook was to provide an easy-to-follow guide on setting up and evaluating a simple RAG pipeline. We now have valuable metrics and insights that we can use to further optimize and improve our RAG pipeline.

It's important to note that this is just one approach to evaluating your pipeline. In practice, you should consider using dedicated frameworks such as [RAGAs](https://github.com/explodinggradients/ragas) to obtain a variety of detailed evaluation metrics to improve your RAG pipeline.