In [1]:
%load_ext autoreload
%autoreload 2

# LlamaIndex

[LlamaIndex](https://github.com/run-llama/llama_index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for llamaIndex and your data you need a object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your `QueryEngine` and gives you the confidence to tweak the configuration to get hightest score.

This guide assumes you have familarity with the LlamaIndex framework.

## Building the Testset

You will need an testset to evaluate your `QueryEngine` against. You can either build one yourself or use the [Testset Generator Module](../../getstarted/testset_generation.md) in Ragas to get started with a small synthetic one.

Let's see how that works with Llamaindex

In [None]:
pip install llama_index ragas

In [6]:
# load the documents
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()

Now  lets init the `TestsetGenerator` object with the corresponding generator and critic llms

In [7]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# generator with openai models
generator_llm = OpenAI(model="gpt-3.5-turbo-16k")
critic_llm = OpenAI(model="gpt-4")
embeddings = OpenAIEmbedding()

generator = TestsetGenerator.from_llama_index(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

  from .autonotebook import tqdm as notebook_tqdm


Now you are all set to generate the dataset

In [8]:
# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents,
    test_size=5,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

Filename and doc_id are the same for all nodes.                                                         
Generating:  20%|███████████▍                                             | 1/5 [00:08<00:34,  8.75s/it]Retrying llama_index.llms.openai.base.OpenAI._achat in 0.20008403704904898 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4 in organization org-SZPrZ1v0zxrmBrD75hcE5wGU on tokens per min (TPM): Limit 10000, Used 9364, Requested 2064. Please try again in 8.568s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 0.9367278355177181 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4 in organization org-SZPrZ1v0zxrmBrD75hcE5wGU on tokens per min (TPM): Limit 10000, Used 8809, Requested 1602. Please try again in 2.466s. Visit https://p

In [9]:
df = testset.to_pandas()
df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the significance of the New York Publi...,[ Others cite the end of the crack epidemic an...,The New York Public Library (NYPL) has the lar...,simple,[{'file_path': '/Users/leiyu/Projects/llm/note...,True
1,What is the size of the Dominican American pop...,"[ immigrants, respectively, and large-scale Ch...",The Dominican American population in New York ...,simple,[{'file_path': '/Users/leiyu/Projects/llm/note...,True
2,What caused the decline in the Lenape populati...,"[ British raids. In 1626, the Dutch colonial D...",Several intertribal wars among the Native Amer...,reasoning,[{'file_path': '/Users/leiyu/Projects/llm/note...,True
3,How is NYC's fast pace described in terms of i...,[ these universities are ranked among the top ...,The city of New York is home to numerous prest...,multi_context,[{'file_path': '/Users/leiyu/Projects/llm/note...,True
4,How did the Lenape population diminish between...,"[ British raids. In 1626, the Dutch colonial D...",Several intertribal wars among the Native Amer...,simple,[{'file_path': '/Users/leiyu/Projects/llm/note...,True


with a test dataset to test our `QueryEngine` lets now build one and evaluate it.

## Building the `QueryEngine`

To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it. 

Since we already loaded the dataset into `documents` lets use that.

In [10]:
# build query engine
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings

vector_index = VectorStoreIndex.from_documents(documents)

query_engine = vector_index.as_query_engine()

Lets try an sample question from the generated testset to see if it is working

In [11]:
# convert it to pandas dataset
df = testset.to_pandas()
df["question"][0]

'What is the significance of the New York Public Library in the city?'

In [12]:
response_vector = query_engine.query(df["question"][0])

print(response_vector)

The New York Public Library (NYPL) has the largest collection of any public library system in the United States and is considered a significant cultural institution in the city.


## Evaluating the `QueryEngine`

Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it. 

In order to run an evaluation with Ragas and LlamaIndex you need 3 things

1. LlamaIndex `QueryEngine`: what we will be evaluating
2. Metrics: Ragas defines a set of metrics that can measure different aspects of the `QueryEngine`. The available metrics and their meaning can be found [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)
3. Questions: A list of questions that ragas will test the `QueryEngine` against. 

first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example question.

Now lets import the metrics we will be using to evaluate

In [13]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
]

now lets init the evaluator model

In [14]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# using GPT 3.5, use GPT 4 / 4-turbo for better accuracy
evaluator_llm = OpenAI(model="gpt-3.5-turbo")

the `evaluate()` function expects a dict of "question" and "ground_truth" for metrics. You can easily convert the `testset` to that format

In [15]:
# convert to HF dataset
ds = testset.to_dataset()

ds_dict = ds.to_dict()
ds_dict["question"]
ds_dict["ground_truth"]

['The New York Public Library (NYPL) has the largest collection of any public library system in the United States. It is considered a significant cultural institution in the city.',
 'The Dominican American population in New York City is the largest overall Hispanic population in the United States, numbering 4.8 million.',
 'Several intertribal wars among the Native Americans and some epidemics caused sizeable population losses for the Lenape between the years 1660 and 1670.',
 "The city of New York is home to numerous prestigious universities and colleges, including Princeton University and Yale University. It also hosts smaller private institutions such as Pace University, St. John's University, The Juilliard School, and many more. These universities contribute to the city's vibrant cultural and educational scene, making it a hub for intellectual and artistic pursuits.",
 'Several intertribal wars among the Native Americans and some epidemics brought on by contact with the Europeans 

Finally lets run the evaluation

In [16]:
from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ds_dict,
    llm=evaluator_llm,
    embeddings=OpenAIEmbedding(),
)

Running Query Engine: 100%|███████████████████████████████████████████████| 5/5 [00:01<00:00,  3.85it/s]
Evaluating:   0%|                                                                | 0/25 [00:00<?, ?it/s]n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
Evaluating:   4%|██▏                                                     | 1/25 [00:02<00:49,  2.08s/it]n values greater than 1 not support for LlamaIndex LLMs
n values greater than 1 not support for LlamaIndex LLMs
Evaluating: 100%|███████████████████████████████████████████████████████| 25/25 [00:06<00:00,  3.65it/s]


In [17]:
# final scores
print(result)

{'faithfulness': 0.4333, 'answer_relevancy': 0.8918, 'context_precision': 1.0000, 'context_recall': 0.6667, 'harmfulness': 0.0000}


You can convert into a pandas dataframe to run more analysis on it.

In [18]:
result.to_pandas()

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness,answer_relevancy,context_precision,context_recall,harmfulness
0,What is the significance of the New York Publi...,[==== Firefighting ====\n\nThe Fire Department...,The New York Public Library (NYPL) has the lar...,The New York Public Library (NYPL) has the lar...,0.5,0.856317,1.0,1.0,0
1,What is the size of the Dominican American pop...,"[By 1900, Germans constituted the largest immi...",The Dominican American population in New York ...,The Dominican American population in New York ...,0.0,0.848461,1.0,0.0,0
2,What caused the decline in the Lenape populati...,[=== Dutch rule ===\n\nA permanent European pr...,Intertribal wars among the Native Americans an...,Several intertribal wars among the Native Amer...,1.0,0.97462,1.0,1.0,0
3,How is NYC's fast pace described in terms of i...,[=== Pace ===\n\nOne of the most common traits...,NYC's fast pace is described in terms of its c...,The city of New York is home to numerous prest...,0.0,0.826091,1.0,0.333333,0
4,How did the Lenape population diminish between...,[=== Dutch rule ===\n\nA permanent European pr...,Several intertribal wars among the Native Amer...,Several intertribal wars among the Native Amer...,0.666667,0.953719,1.0,1.0,0
