<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/end_to_end/04_evaluating_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [80]:
!pip install -q ragas sacrebleu rouge_score rapidfuzz langchain openai python-dotenv

In [26]:
from ragas import SingleTurnSample
from ragas.metrics import BleuScore, RougeScore
import pandas as pd

## Traditional non LLM Metrics



### Non LLM String Simiarity

Non LLM String Similarity:
- Metric That measures the distance between generated and reference text using distance measures such as Levenshtein, Hamming and Jaro.
- The metric doesnot rely on LLM models but it is heuristic based method.
- The metric score is between 0 to 1

How **NonLLMStringSimilarity** is calculated with examples for the three common metrics we discussed:

### 1. Levenshtein Distance
**Calculation**: Measures the minimum number of single-character edits needed to transform one string into another (insertions, deletions, or substitutions).

**Example**:
- **String 1**: "kitten"
- **String 2**: "sitting"
- **Levenshtein Distance**: 3 (replace 'k' with 's', replace 'e' with 'i', insert 'g')

### 2. Hamming Distance
**Calculation**: Measures the number of positions at which the corresponding characters in two strings differ. Applicable to strings of the same length.

**Example**:
- **String 1**: "karolin"
- **String 2**: "kathrin"
- **Hamming Distance**: 3 (positions 3, 4, and 6 are different)

### 3. Jaro-Winkler Distance
**Calculation**: Measures the similarity between two strings by considering the number and order of matching characters, with adjustments for common prefixes.

**Example**:
- **String 1**: "martha"
- **String 2**: "marhta"
- **Jaro-Winkler Distance**: Higher similarity score compared to "martha" and "marie" because "martha" and "marhta" have more characters in common and in the same order.

### Summary
- **Levenshtein Distance**: Focuses on the number of edits required to make two strings identical. Lower value = more similar.
- **Hamming Distance**: Focuses on character differences in strings of equal length. Lower value = more similar.
- **Jaro-Winkler Distance**: Focuses on matching characters and their order, with higher values indicating greater similarity.

These traditional string similarity measures help determine how close two strings are based on their character content, rather than semantic meaning.

In [41]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import NonLLMStringSimilarity, DistanceMeasure

In [37]:
sample = SingleTurnSample(
    response = "The Eiffel Tower is located in India",
    reference = "The Eiffel Tower is located in Paris"
)

In [39]:
scorer = NonLLMStringSimilarity()
await scorer.single_turn_ascore(sample)

0.8888888888888888

Distance based:

In [44]:
scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.HAMMING)
await scorer.single_turn_ascore(sample)

0.8888888888888888

### BLEU Score


- The BLUE score compares the response with the reference text based on the n-gram precision and brevity penalty.
- It was originally created for machine translation systems, but also used for NLP task
- BLEU Score ranges from 0 to 1
- Non LLM Metric
- It is also case sensitive

In [49]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import BleuScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = BleuScore()
await scorer.single_turn_ascore(sample)

0.7071067811865478

### ROUGE Score

- This score is primarily used for evaluating the quality of NLP text generations.
- It measures the overlaps between the response and the reference based on n-gram recall, precision and F1-score.
- It ranges from 0-1


There are four types of ROUGE score:
1. ROUGE-N: measures the n-gram overlaps. For example ROUGE-1 measures unigrams (single words) and ROUGE-2 measures the bigrams (two word sequences)

2. ROUGE-L:
- Measures longest matching sequence of words without requiring them to be consecutive.
- useful for capturing sentence structure and fluency.

3. ROUGE-W (Weight LCS): Similar to ROUGE-L but assigns higher weight to longer consecutive matches.

4. ROUGE-S (Skip-Bigrams): Measures overlap of pairs of words that appear in the same order in both summaries, allowing for gaps between them irrespective to the position of these bigrams.


How does ROUGE calculate the Precision, Recall?

1. Recall (R) = ovelapping n-grams / total n-grams in reference

High recall means the summary includes most of the important reference words.

2. Precision (P) = ovelappingn-grams / total n-grams in generated text

High precision means the summary is concise and contains less unnecessary content

3. F1 score: which is the harmonic mean of precision and recall.


In [52]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore()
await scorer.single_turn_ascore(sample)

0.8571428571428571

### Exact Match

It checks whether the generated text is exact match of the reference. It will compare word to word

- Either `1` if it is an Exact Match
- Else `0`

In [56]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ExactMatch

sample = SingleTurnSample(
    response="India",
    reference="Paris"
)

scorer = ExactMatch()
await scorer.single_turn_ascore(sample)

0.0

In [58]:
sample = SingleTurnSample(
    response="Paris",
    reference="Paris"
)

scorer = ExactMatch()
await scorer.single_turn_ascore(sample)

1.0

### String Presence

Metric checks if the response contains the reference text. It is useful to check whether certain keywords is present in the generated text.

- Can be useful for data extraction
- Remember it is a case sensitive

In [64]:
from ragas.metrics import StringPresence

sample = SingleTurnSample(
    response="This is not Spam mail",
    reference="spam"
)

scorer = StringPresence()
await scorer.single_turn_ascore(sample)

0.0

## Evaluating Text Summarization

In [67]:
summary_test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
test_sample_data = SingleTurnSample(**summary_test_data)

In [68]:
metric = BleuScore()
metric.single_turn_score(test_sample_data)

0.13718598426177148

In [69]:
rouge_metric = RougeScore()
rouge_metric.single_turn_score(test_sample_data)

0.5666666666666667

## Evaluating RAG Pipeline

Now we can evaluate our pipeline into 2 parts

1. evaluating the retrieval part
    - Context Precision
    - Context Recall
2. evaluating the generation part
    - Answer Relevance
    - Faithfulness - factual or not/ is hallucinating




### Context Precision

In a RAG pipeline, during retrieval we get say 5 documents, we will calculate the relevance of each retrieved documents with the query.

In RAGAS, the context precision metric requires LLM as a judge to find the relevance of the retrieved text with the user-query

To estimate if a retrieved context is relevant or not, this method uses the LLM to compare each of the:

`retrieved context or chunk present in retrieved_contexts` with `response`.

In [84]:
from ragas.llms import LangchainLLMWrapper
from langchain.chat_models import ChatOpenAI
from dotenv import load_dotenv
import os

In [85]:
if os.path.exists(".env"):
    os.remove(".env")

from google.colab import files
uploaded = files.upload()
if uploaded:
    if load_dotenv(".env"):
        print("Uploaded and Loaded Sucessfully")

Saving .env to .env
Uploaded and Loaded Sucessfully


In [87]:
chat_model = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
evaluator_llm = LangchainLLMWrapper(chat_model)

In [88]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference

context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
)


await context_precision.single_turn_ascore(sample)

0.9999999999

In [89]:
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference

context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)

sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
)

await context_precision.single_turn_ascore(sample)

0.9999999999

In [90]:
from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextPrecisionWithReference

context_precision = NonLLMContextPrecisionWithReference()

sample = SingleTurnSample(
    retrieved_contexts=["The Eiffel Tower is located in Paris."],
    reference_contexts=["Paris is the capital of France.", "The Eiffel Tower is one of the most famous landmarks in Paris."]
)

await context_precision.single_turn_ascore(sample)

0.9999999999

## BERTScore

### What is BERTScore?
- **BERTScore** is a token-based similarity evaluation metric.
- It leverages a large language model (LLM) like BERT to measure the similarity between two texts.

### How BERTScore Works
1. **Tokenization**: The texts are tokenized into smaller units (tokens), usually words or subwords.
2. **Embedding**: Each token is converted into a high-dimensional vector (embedding) using a pre-trained BERT model.
3. **Similarity Calculation**: Cosine similarity is used to compare the embeddings of tokens from the two texts.
4. **Aggregation**: Token similarities are aggregated to compute an overall similarity score. This involves precision, recall, and F1-score calculations.

### Example
```python
from bert_score import score

candidate = ["The cat sat on the mat."]
references = ["The cat was on the mat."]

P, R, F1 = score(candidate, references, lang='en', model_type='bert-base-uncased')
print(f"Precision: {P}\nRecall: {R}\nF1: {F1}")
```

### When to Use BERTScore
1. Text Summarization: To evaluate the quality of generated summaries by comparing them with reference summaries.
2. Machine Translation: To assess the accuracy and relevance of translated text compared to reference translations.
3. Paraphrasing: To measure how closely a paraphrased text matches the original text in meaning.
4. Text Generation: To evaluate the semantic similarity between generated text and reference text in various natural language generation tasks.

### Advantages of BERTScore
1. Captures semantic similarity by considering contextual meaning.
2. Provides a more nuanced evaluation compared to traditional string-based metrics.
3. Effective for evaluating tasks where meaning and context are important.
