![ragas_image](images/ragas.png)

# RAG Evaluations Demonstration: RAGAS

"Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLMâ€™s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in."

#### Load Evaluation Dataset 

In [3]:
# Download amnesty_qa dataset
from datasets import load_dataset, Dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2", trust_remote_code=True)
eval_data = amnesty_qa['eval']
eval_data

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 20
})

The following dataset from Exploding Gradients is representative of how we would need to gather data to run evaluations of our RAG Pipeline.

The following data fields are needed:
- question/input: The initial prompt provided to the llm (prior to RAG)
- answer/output: The generated response from the llm (after RAG)
- context: The context picked up by the retrieval stage of RAG
- ground truth: A foundational answer to the question that the llm output can be compared to

In [4]:
from pprint import pprint
# sample the data

pprint(eval_data[4])

{'answer': 'Amnesty International made several recommendations to the Special '
           'Rapporteur on Human Rights Defenders. These recommendations '
           'include:\n'
           '\n'
           '1. Urging states to fully implement the UN Declaration on Human '
           'Rights Defenders and ensure that national laws and policies are in '
           'line with international human rights standards.\n'
           '\n'
           '2. Calling on states to create a safe and enabling environment for '
           'human rights defenders, including by adopting legislation that '
           'protects defenders from threats, attacks, and reprisals.\n'
           '\n'
           '3. Encouraging states to establish effective mechanisms for the '
           'protection of human rights defenders, such as national human '
           'rights institutions and specialized units within law enforcement '
           'agencies.\n'
           '\n'
           '4. Urging states to investigate and h

RAGAS performs analysis directly on dataset objects

In [5]:
dataset = Dataset.from_dict(eval_data[2:7])

The easiest way to use local LLMs with RAGAS is to use langchain. RAGAS has a built in wrapper that automatically accepts langchain models. In this case, we'll use an Ollama server running a local instance of llama3 as our evaluation model (LLM-as-a-judge).

**Note: It's likely possible to use models from HuggingFace and skirt the need to use Ollama by writing a custom class using the RAGAS BaseModel class, but I had lots of difficulty getting this to work

In [6]:
from langchain_community.llms import Ollama

# create LLM using ollama
llama3 = Ollama(model="llama3")

In [7]:
from langchain_community.embeddings import OllamaEmbeddings

# Create an embeddings object that 
embeddings = OllamaEmbeddings(model='llama3')

#### Run the Evaluation

In [8]:
# import the metrics from RAGAS
from ragas.metrics import (
    answer_relevancy, 
    context_precision,
    context_recall
)
# import the evaluate function from RAGAS
from ragas import evaluate

# run the evaluation
results = evaluate(
    dataset=dataset,
    llm=llama3,
    embeddings=embeddings,
    metrics=[
        answer_relevancy, 
        context_precision,
        context_recall
    ],
)


Evaluating:   0%|          | 0/15 [00:00<?, ?it/s]

#### Show the results

In [9]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)
results_df = results.to_pandas()
results_df

Unnamed: 0,question,ground_truth,answer,contexts,answer_relevancy,context_precision,context_recall
0,Which private companies in the Americas are the largest GHG emitters according to the Carbon Maj...,The largest private companies in the Americas that are the largest GHG emitters according to the...,"According to the Carbon Majors database, the largest private companies in the Americas that are ...",[The issue of greenhouse gas emissions has become a major concern for environmentalists and poli...,0.89088,1.0,0.666667
1,What action did Amnesty International urge its supporters to take in response to the killing of ...,Amnesty International urged its supporters to send appeals for the defenders' freedom to Nigeria...,"Amnesty International urged its supporters to write letters to the Nigerian government, calling ...","[In the case of the Ogoni 9, Amnesty International called on its supporters to take action by si...",0.919549,1.0,1.0
2,What are the recommendations made by Amnesty International to the Special Rapporteur on Human Ri...,The recommendations made by Amnesty International to the Special Rapporteur on Human Rights Defe...,Amnesty International made several recommendations to the Special Rapporteur on Human Rights Def...,"[In recent years, Amnesty International has focused on issues such as the increasing threats fac...",0.905215,1.0,1.0
3,Who are the target audience of the two books created by Amnesty International on child rights?,The target audience of the two books created by Amnesty International on child rights are childr...,The target audience of the two books created by Amnesty International on child rights are likely...,"[In addition to children, parents, teachers, and caregivers are also key target audiences for Am...",0.936374,1.0,1.0
4,"Which right guarantees access to comprehensive information about past human rights violations, i...",The right that guarantees access to comprehensive information about past human rights violations...,The right that guarantees access to comprehensive information about past human rights violations...,[The right to truth is a fundamental human right that seeks to uncover the full extent of past h...,0.876583,1.0,1.0


Let's look closer at how one of these metrics works:

Answer Relevancy assesses how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.

The Answer Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which are generated (reverse engineered) based on the answer.

![answer_relvancy_eq](images/answer_relevancy_eq.png)

This means that in order to have a good answer relevancy score, you need to not only have accurate retrieval context, synthesis, and generation, but you also need to have a strong evaluator (in regards to both generation and embeddings). If your evaluator is unable to generate good questions based off the answers, then the cosine similarity may suffer and lower the overall score.

#### RAGAS Pros
- Ease of use (the simplest/most lightweight of the three frameworks)
- Test set generation tools built in


#### RAGAS Cons
- No custom metrics (LLM-as-a-judge only)
- No reasoning supplied by scores (lots of inherit trust)
- Customizing your evaluator LLM is a little more difficult
- Only supports RAG metrics (not well suited for overall evaluations)
