pip install ragas, rapidfuzz

https://docs.ragas.io/en/stable/howtos/integrations/_llamaindex/

In [1]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="C:/Praveen/Projects/RAG/data/llama2.pdf")

In [None]:
import os

#add open api key here 

In [3]:
from ragas.testset import TestsetGenerator

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-4o")
embeddings = OpenAIEmbedding(model="text-embedding-3-large")

generator = TestsetGenerator.from_llama_index(
    llm=generator_llm,
    embedding_model=embeddings,
)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents,
    testset_size=5,
)

Applying HeadlinesExtractor:  36%|███▌      | 26/73 [00:08<00:04,  9.93it/s]Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TH1tC97gqLEx1hlPbbwt0DlT on tokens per min (TPM): Limit 30000, Used 30000, Requested 940. Please try again in 1.879s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TH1tC97gqLEx1hlPbbwt0DlT on tokens per min (TPM): Limit 30000, Used 30000, Requested 1063. Please try again in 2.126s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.O

In [6]:
df = testset.to_pandas()
df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"What Llama 2 do in AI, how it train and safe, ...",[Contents 1 Introduction 3 2 Pretraining 5 2.1...,Llama 2 involves pretraining with specific dat...,single_hop_specifc_query_synthesizer
1,How does Meta address the environmental impact...,[Time (GPU hours) Power Consumption (W) Carbon...,Meta addresses the environmental impact of pre...,single_hop_specifc_query_synthesizer
2,What are the challenges associated with red te...,"[<1-hop>\n\nYet, when it comes to the ""product...",Red teaming in Large Language Models (LLMs) pr...,multi_hop_abstract_query_synthesizer
3,What are the challenges associated with red te...,"[<1-hop>\n\nYet, when it comes to the ""product...",Red teaming in Large Language Models (LLMs) pr...,multi_hop_abstract_query_synthesizer
4,How does the implementation of GAtt and Safety...,[<1-hop>\n\nmodify the original instruction ha...,The implementation of GAtt and Safety RLHF sig...,multi_hop_specific_query_synthesizer


### Building the QueryEngine

In [7]:
# build query engine
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

Lets try an sample question from the generated testset to see if it is working

In [8]:
# convert it to pandas dataset
df = testset.to_pandas()
df["user_input"][0]

'What Llama 2 do in AI, how it train and safe, what about ethical stuff?'

In [9]:
response_vector = query_engine.query(df["user_input"][0])

print(response_vector)

Llama 2 is a family of pretrained and fine-tuned models with scales ranging from 7 billion to 70 billion parameters. Efforts have been made to tune these models to avoid topics like misinformation, bioterrorism, and cybercrime. The safety tuning of Llama 2 aims to balance safety with helpfulness, although in some cases, it may err on the side of caution. Users of Llama 2 are advised to be cautious and follow the Responsible Use Guide for tuning and deployment. The release of Llama 2 is done openly to encourage responsible AI innovation and collaboration within the AI community to address potential misuse. The aim is to democratize access to foundational models, promote transparency, and stimulate innovation while eliminating barriers to entry for small businesses. The developers have also provided code examples to replicate safe generations with Llama 2-Chat and apply basic safety techniques.


### Evaluating the QueryEngine

In [10]:
# import metrics
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
)

# init metrics with evaluator LLM
from ragas.llms import LlamaIndexLLMWrapper

evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
]

In [11]:
# convert to Ragas Evaluation Dataset
ragas_dataset = testset.to_evaluation_dataset()
ragas_dataset

EvaluationDataset(features=['user_input', 'reference_contexts', 'reference'], len=6)

Finally lets run the evaluation

In [12]:
from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ragas_dataset,
)

Running Query Engine: 100%|██████████| 6/6 [00:04<00:00,  1.20it/s]
Evaluating:  38%|███▊      | 9/24 [00:13<00:16,  1.09s/it]Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TH1tC97gqLEx1hlPbbwt0DlT on tokens per min (TPM): Limit 30000, Used 30000, Requested 566. Please try again in 1.132s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TH1tC97gqLEx1hlPbbwt0DlT on tokens per min (TPM): Limit 30000, Used 30000, Requested 566. Please try again in 1.132s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exc

In [13]:
# final scores
print(result)

{'faithfulness': 0.6986, 'answer_relevancy': 0.9632, 'context_precision': 0.5000, 'context_recall': 0.5778}


You can convert into a pandas dataframe to run more analysis on it.

In [14]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,"What Llama 2 do in AI, how it train and safe, ...",[Not everyone who uses AI models has good inte...,[Contents 1 Introduction 3 2 Pretraining 5 2.1...,Llama 2 is a family of pretrained and fine-tun...,Llama 2 involves pretraining with specific dat...,0.916667,0.868971,1.0,0.666667
1,How does Meta address the environmental impact...,"[Adam Roberts, Denny Zhou, Quoc V. Le, and Jas...",[Time (GPU hours) Power Consumption (W) Carbon...,Meta addresses the environmental impact of pre...,Meta addresses the environmental impact of pre...,0.0,0.97611,0.0,0.0
2,What are the challenges associated with red te...,[The red teamers probed our models across a wi...,"[<1-hop>\n\nYet, when it comes to the ""product...",The challenges associated with red teaming in ...,Red teaming in Large Language Models (LLMs) pr...,1.0,0.96641,0.0,0.0
3,What are the challenges associated with red te...,[The red teamers probed our models across a wi...,"[<1-hop>\n\nYet, when it comes to the ""product...",The challenges associated with red teaming in ...,Red teaming in Large Language Models (LLMs) pr...,0.833333,0.999263,1.0,1.0
4,How does the implementation of GAtt and Safety...,[modify the original instruction half of the t...,[<1-hop>\n\nmodify the original instruction ha...,The implementation of GAtt in Llama 2-Chat hel...,The implementation of GAtt and Safety RLHF sig...,0.727273,0.968329,0.0,1.0
5,How do red teaming efforts contribute to impro...,[The red teamers probed our models across a wi...,"[<1-hop>\n\nYet, when it comes to the ""product...",Red teaming efforts contribute to improving th...,Red teaming efforts play a crucial role in imp...,0.714286,1.0,1.0,0.8
