# Evaluating RAG Pipeline Using LlamaIndex

## Building a RAG System

In [12]:
!pip install trulens-eval llama-index openai --quiet

In [13]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI

import os
import pandas as pd

### Mention OpenAI API Key

In [14]:
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "add your OpenAI api key"

## Download Data

Let's use Paul Graham Essay text for building RAG pipeline.

In [16]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   489k      0 --:--:-- --:--:-- --:--:--  491k


## Load Data and Build Index

In [18]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Define an LLM
llm = OpenAI(model="gpt-4")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

### Build query engine

In [19]:
query_engine = vector_index.as_query_engine()

In [20]:
response_vector = query_engine.query("What did the author do growing up?")

### Check the response

In [22]:
response_vector.response

'The author worked on writing short stories and programming, particularly on an IBM 1401 computer using an early version of Fortran.'

## Let's check the text in the retrieved nodes

By default it retrieves two similar nodes/ chunks.

In [24]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

In [25]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office.\n\nOne night in October 2003 there was a big party at my house. It was a clever idea of my friend Maria Daniels, who was one of the thursday diners. Three separate hosts would all invite their friends to one party. So for every guest, two thirds of the other guests would be people they didn't know but would probably like. One of the guests was someone I didn't know but would turn out to like a lot: a woman called Jessica Livingston. A couple days later I asked her out.\n\nJessica was in charge of marketing at a Boston investment bank. This bank thought it understood startups, but over the next year, as she met friends of mine from the startup world, she was surprised how different reality was. And 

We have built a RAG pipeline and now need to evaluate its performance. Let's make use of LlamaIndex's tools to do that. 

## Let's generate the question-context pairs for RAG evaluation

In [26]:
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

100%|██████████| 59/59 [03:15<00:00,  3.32s/it]


## Retrieval Evaluation

This assesses the accuracy and relevance of the information retrieved by the system.

Our next step involves initiating retrieval evaluations. We’ll employ the RetrieverEvaluator with the evaluation dataset we’ve prepared.

In [29]:
retriever = vector_index.as_retriever(similarity_top_k=2)

### We use Hit Rate and MRR metrics to evaluate our Retriever.

#### Hit Rate:
Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

#### Mean Reciprocal Rank (MRR):
MRR serves as a metric for evaluating the accuracy of a system by examining the rank of the highest-placed relevant document for each query. It calculates the average of the reciprocals of these ranks across all queries. For instance, if the first relevant document is ranked highest, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so forth.

In [30]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [32]:
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

Let's define a function to display the Retrieval evaluation results in table format

In [33]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

In [34]:
display_results("OpenAI Embedding Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.813559,0.677966


The Retriever with OpenAI Embedding demonstrates a performance with a hit rate of 0.813559, while the MRR, at 0.677966, suggests there's room for improvement in ensuring the most relevant results appear at the top. 

## Response Evaluation

This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

In [35]:
# Get the list of queries from the above created dataset

queries = list(qa_dataset.queries.values())

### Faithfulness Evaluator

We will use gpt-3.5-turbo it for generating responses for a given query gpt-3.5-turbo-16k-0613 and gpt-4 for evaluation.

In [41]:
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

gpt4 = OpenAI(temperature=0, model="gpt-3.5-turbo-16k-0613")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

gpt35T = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt35T)

  service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)
  service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)
  service_context_gpt4 = ServiceContext.from_defaults(llm=gpt35T)


Create a QueryEngine with gpt-3.5-turbo service_context to generate response for the query.

In [42]:
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()

#### Let's create a FaithfulnessEvaluator and evaluate on one question

In [43]:
from llama_index.core.evaluation import FaithfulnessEvaluator
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

eval_query = queries[10]

print(eval_query)

Based on the author's experience and observations, why did he conclude that the approach to Artificial Intelligence during his time in grad school was a hoax? Provide specific examples from the text to support your answer.


#### Generate response first and use faithfull evaluator

In [44]:
response_vector = query_engine.query(eval_query)

eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)

eval_result.passing

True

### Relevancy Evaluator

RelevancyEvaluator is useful to measure if the response and source nodes (retrieved context) match the query. Useful to see if response actually answers the query.

Instantiate RelevancyEvaluator for relevancy evaluation with gpt-4

In [45]:
from llama_index.core.evaluation import RelevancyEvaluator
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

Let's do relevancy evaluation for one of the query

In [49]:
query = queries[10]

print(query)

response_vector = query_engine.query(query)

eval_result = relevancy_gpt4.evaluate_response(
    query=query, response=response_vector
)

Based on the author's experience and observations, why did he conclude that the approach to Artificial Intelligence during his time in grad school was a hoax? Provide specific examples from the text to support your answer.


In [72]:
# You can check passing parameter in eval_result if it passed the evaluation.
result = eval_result.passing
print(result)

True


In [73]:
# You can get the feedback for the evaluation.
feedback = eval_result.feedback
print(feedback)

YES


In this notebook, we have explored how to build and evaluate a RAG pipeline using LlamaIndex, with a specific focus on evaluating the retrieval system and generated responses within the pipeline.