---

## Topic 1: Evaluation in Retrieval-Augmented Generation (RAG)

---

### Concept: Why Evaluate RAG Separately?

In Retrieval-Augmented Generation (RAG), model outputs depend not only on prompt quality or fine-tuning, but also on:

1. **Retrieval Quality** (Did the retriever fetch relevant context?)
2. **Groundedness** (Did the output stay faithful to the context?)
3. **Answer Relevance** (Did the answer actually help answer the query?)

Traditional metrics like accuracy or BLEU are not enough to evaluate these aspects.

---

### What We Measure in RAG

| Metric                 | What it Measures                                                        |
| ---------------------- | ----------------------------------------------------------------------- |
| **Groundedness**       | Whether the generated output is strictly based on retrieved content     |
| **Faithfulness**       | Whether the answer is factually consistent with context                 |
| **Relevance**          | Whether the answer is helpful or matches the intent of the question     |
| **Source Attribution** | Whether the answer correctly refers to which document it was drawn from |

---

### Real-World Significance

* In enterprise settings (legal, healthcare, HR), **hallucinations** are unacceptable
* Evaluating faithfulness and groundedness helps ensure trust in the system

---

## Problem Statement

We want to **evaluate the effectiveness** of a RAG pipeline by analyzing how well the generated answers:

* Stick to retrieved context (groundedness)
* Are factually correct (faithfulness)
* Are relevant to the query (relevance)

We assume that:

* Retrieval has already happened
* LLM outputs are available for a set of questions
* We manually score them using simple metrics

---

## Objectives

* Understand how to **quantify quality** of retrieved documents
* Manually rate the **generated answers** based on **relevance**, **faithfulness**, and **groundedness**
* Automate simple scoring for fast analysis

---

## Sample Inputs

Assume we have 5 questions, their retrieved contexts, and the LLM-generated answers.

We will score each answer from 0 to 1 for:

* `groundedness_score`
* `faithfulness_score`
* `relevance_score`

---

### Sample Output

```
Example 1
Question      : What are the key highlights of the 2024 AFGI Report?
Groundedness  : 0.8
Faithfulness  : 1.0
Relevance     : 1.0
```

---



In [1]:
# Step 1: Define evaluation data
data = [
    {
        "question": "What are the key highlights of the 2024 AFGI Report?",
        "retrieved_context": "The 2024 AFGI Report highlighted improvements in access to financing, digital inclusion, and gender-balanced credit participation.",
        "generated_answer": "The AFGI Report 2024 focuses on financial access and gender inclusion.",
    },
    {
        "question": "What was the public debt level in 2024?",
        "retrieved_context": "Public debt rose by 12% due to post-pandemic recovery spending and inflationary pressures.",
        "generated_answer": "In 2024, public debt increased due to high inflation.",
    },
    {
        "question": "How did rural banking evolve in the last year?",
        "retrieved_context": "Rural banking expanded to over 30,000 villages, supported by fintech partnerships and mobile banking.",
        "generated_answer": "Rural banking saw major digital growth with mobile banking reaching thousands of villages.",
    },
]

# Step 2: Manual scoring function
def evaluate_rag_outputs(dataset):
    results = []
    for entry in dataset:
        # Dummy logic: simulate manual review
        # In real case: Use annotators or GPT-based judgment
        g_score = 1 if entry["generated_answer"].lower() in entry["retrieved_context"].lower() else 0.8
        f_score = 1 if "increased" in entry["generated_answer"].lower() or "focuses" in entry["generated_answer"].lower() else 0.6
        r_score = 1 if any(word in entry["generated_answer"].lower() for word in ["financial", "banking", "debt"]) else 0.5

        results.append({
            "question": entry["question"],
            "groundedness": g_score,
            "faithfulness": f_score,
            "relevance": r_score,
        })
    return results

# Step 3: Run evaluation
evaluation_scores = evaluate_rag_outputs(data)

# Step 4: Print results
for idx, res in enumerate(evaluation_scores):
    print(f"\nExample {idx+1}")
    print(f"Question      : {data[idx]['question']}")
    print(f"Groundedness  : {res['groundedness']}")
    print(f"Faithfulness  : {res['faithfulness']}")
    print(f"Relevance     : {res['relevance']}")



Example 1
Question      : What are the key highlights of the 2024 AFGI Report?
Groundedness  : 0.8
Faithfulness  : 1
Relevance     : 1

Example 2
Question      : What was the public debt level in 2024?
Groundedness  : 0.8
Faithfulness  : 1
Relevance     : 1

Example 3
Question      : How did rural banking evolve in the last year?
Groundedness  : 0.8
Faithfulness  : 0.6
Relevance     : 1


---

## **Output Breakdown**

### **Example 1**

**Question**: What are the key highlights of the 2024 AFGI Report?
**Generated Answer**: "The AFGI Report 2024 focuses on financial access and gender inclusion."
**Context**: Talks about **financing**, **digital inclusion**, **gender-balanced credit**.

| Metric       | Score | Reasoning                                            |
| ------------ | ----- | ---------------------------------------------------- |
| Groundedness | 0.8   | Not a direct phrase match, partial overlap only.     |
| Faithfulness | 1.0   | Includes key ideas from context (access, inclusion). |
| Relevance    | 1.0   | Fully answers the question on highlights.            |

### **Example 2**

**Question**: What was the public debt level in 2024?
**Generated Answer**: "In 2024, public debt increased due to high inflation."
**Context**: "Public debt rose by 12% due to post-pandemic recovery spending and inflationary pressures."

| Metric       | Score | Reasoning                                            |
| ------------ | ----- | ---------------------------------------------------- |
| Groundedness | 0.8   | Partial match; doesn’t mention 12%, omits one cause. |
| Faithfulness | 1.0   | Correctly states debt increased due to inflation.    |
| Relevance    | 1.0   | Answers the debt-related question directly.          |

### **Example 3**

**Question**: How did rural banking evolve in the last year?
**Generated Answer**: "Rural banking saw major digital growth with mobile banking reaching thousands of villages."
**Context**: Describes **30,000 villages**, **fintech partnerships**, and **mobile banking**.

| Metric       | Score | Reasoning                                                        |
| ------------ | ----- | ---------------------------------------------------------------- |
| Groundedness | 0.8   | Close match, but lacks precision (e.g., "30,000" not mentioned). |
| Faithfulness | 0.6   | Overgeneralized; omits fintech support, exaggerates slightly.    |
| Relevance    | 1.0   | Clearly addresses the question.                                  |

---

## **Overall Assessment**

| Metric       | Average Score (3 samples)        | Evaluation                                                |
| ------------ | -------------------------------- | --------------------------------------------------------- |
| Groundedness | (0.8 + 0.8 + 0.8) / 3 = **0.80** | Decent — answers are mostly grounded in retrieved context |
| Faithfulness | (1.0 + 1.0 + 0.6) / 3 = **0.87** | Good — mostly factually correct, minor exaggeration       |
| Relevance    | 1.0 across all                   | **Excellent** — answers are always on-topic               |

---

## **Is This Good or Bad?**

### **Good:**

* **Relevance is consistently high (1.0)**, which is critical in RAG.
* **Faithfulness is strong**, though one answer (Example 3) could be more careful in phrasing.
* **Groundedness is decent**, suggesting the model uses retrieved context well but doesn’t quote or align verbatim.

### **Could Be Improved:**

* Groundedness could improve by more tightly integrating actual phrases or details from the context.
* Faithfulness in Example 3 suggests a need for more accurate numeric retention (e.g., "30,000" vs. "thousands").

---

## **What We Understand From This**

1. The **RAG system is generally performing well**: answers are accurate, relevant, and fairly grounded.
2. Most quality issues are around **minor generalizations or partial grounding**, not misinformation.
3. The evaluation logic is simple and could be enhanced using:

   * **semantic similarity models (e.g., BERTScore)**
   * **LLM-based evaluation (e.g., GPT-4 as a judge)**

---




## 🔍 What is RAGAS?

https://docs.ragas.io/en/stable/

**RAGAS** stands for **Retrieval-Augmented Generation Assessment**.  
It’s an **open-source evaluation framework** that helps you **measure the quality** of your RAG pipelines.

In simple words:  
> “It tells you how good your RAG system is — at retrieving useful chunks, generating accurate and faithful answers, and being self-aware when it doesn’t know.”

---

## 🧱 Why Do We Need RAGAS?

In RAG systems:
- The LLM answers a question based on documents retrieved by a search engine.
- We want to **ensure each step works well** — both the **retrieval** and the **generation**.

🛠 Traditional LLM evaluation (e.g., BLEU, ROUGE) doesn't work well here because:
- Answers might be factually correct but worded differently.
- We care more about **truthfulness**, **faithfulness**, and **context grounding**.

---

## 🔑 Key Metrics in RAGAS

Let’s walk through **each metric**, explain it in **simple words**, and give a **formula + example**.

---

### 📘 1. **Faithfulness**

**What it asks:**  
> "Is the generated answer truly supported by the retrieved context?"

- It checks if the answer **can be traced back** to the documents retrieved.

**Why it matters:**  
If the LLM "hallucinates" (i.e., makes up stuff), faithfulness will be low.

#### ✅ Formula (High-level):
Faithfulness = Similarity(generated_answer, grounded_facts_from_context)

- Evaluated using **Natural Language Inference (NLI)** models.
- Uses labels like “entailment”, “neutral”, “contradiction”

#### 🧠 Example:
**Context:** "Employees are eligible for 24 days of leave."  
**Question:** "How many leave days do I get?"  
**Answer:** "You get 30 days of leave." ❌  
➡️ Not faithful (answer contradicts the context)

---

### 🔍 2. **Context Precision**

**What it asks:**  
> "How much of the retrieved context is actually useful for answering the question?"

- It checks how much of the **retrieved info is relevant**.

#### ✅ Formula:
Context Precision = (# relevant context chunks) / (# total retrieved chunks)

#### 🧠 Example:
You retrieve 5 chunks, but only 2 were useful →  
Context Precision = 2/5 = 0.4 (low)

---

### 🧠 3. **Answer Relevancy**

**What it asks:**  
> "How relevant is the generated answer to the actual question?"

- Even if it’s not totally correct, was it **on topic**?

#### ✅ Formula:
Evaluated using a cross-encoder that measures semantic similarity between **question ↔ answer**

#### 🧠 Example:
**Question:** "What’s the deadline to claim reimbursements?"  
**Answer:** "Employees must serve a 30-day notice period."  
➡️ Low relevancy (off-topic)

---

### ✅ 4. **Answer Correctness**

**What it asks:**  
> "Is the answer factually correct compared to ground truth?"

- Needs labeled test data with correct answers.
- Often used during **benchmark testing**.

#### ✅ Formula:
Correctness = Similarity(predicted_answer, ground_truth_answer)

#### 🧠 Example:
**Ground Truth:** "10 business days"  
**Answer:** "Ten working days" → ✅ High correctness

---

### 🚫 5. **Context Recall**

**What it asks:**  
> "Did we retrieve all the chunks needed to answer the question completely?"

- Especially useful in **multi-hop QA** (where you need multiple pieces of info).

#### ✅ Formula:
Context Recall = (# relevant retrieved chunks) / (# relevant total chunks in corpus)

This metric is still **experimental** and not always enabled.

---

## 🔬 Additional Components

### ✅ 6. **RAGAS Score** (Overall Composite)

This is a **weighted average** of all metrics:  
> Faithfulness, Context Precision, Answer Relevancy, Correctness

Weights can be configured based on your use case.

---

## 🧪 Example RAGAS Output

| Metric              | Score |
|---------------------|-------|
| Faithfulness        | 0.88  |
| Context Precision   | 0.60  |
| Answer Relevancy    | 0.95  |
| Answer Correctness  | 0.90  |
| **RAGAS Score**     | 0.83  |

---

## 🛠 How to Use RAGAS in Code

```python
from ragas.metrics import faithfulness, context_precision, answer_relevancy, answer_correctness
from ragas.evaluator import evaluate
from ragas.dataset import Dataset
from datasets import Dataset as HFDataset

# Load your question-answer-context triples
hf_dataset = HFDataset.from_dict({
    "question": [...],
    "contexts": [...],
    "answer": [...],
    "ground_truth": [...],  # optional for correctness
})

# Convert to RAGAS Dataset
ragas_dataset = Dataset(hf_dataset)

# Evaluate
results = evaluate(
    ragas_dataset,
    metrics=[faithfulness, context_precision, answer_relevancy, answer_correctness]
)
print(results)
```

---

## 🧠 When Should You Use RAGAS?

- ✅ Benchmarking different RAG pipelines
- ✅ Comparing vector DBs (e.g., FAISS vs Chroma)
- ✅ Checking model hallucination
- ✅ Monitoring production quality of GenAI apps
- ✅ Deciding fine-tuning vs retrieval

---

## 📦 Installation

```bash
pip install ragas
```

Also requires:
```bash
pip install datasets transformers evaluate
```

---

## 📚 Summary Table

| Metric            | Measures                         | Needs Ground Truth? | Model Used           |
|------------------|----------------------------------|----------------------|----------------------|
| Faithfulness      | Is answer supported by context? | ❌                   | NLI model            |
| Context Precision | How much context is useful?     | ❌                   | Semantic similarity  |
| Answer Relevancy  | Is answer on-topic?             | ❌                   | Cross-encoder        |
| Answer Correctness| Is answer factually right?      | ✅                   | Text similarity      |
| Context Recall    | All info retrieved?             | Optional             | Semantic check       |

---

In [4]:
!pip install sacrebleu ragas -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m184.3/190.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
from ragas import SingleTurnSample
from ragas.metrics import BleuScore

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
metric = BleuScore()
# BLEU (Bilingual Evaluation Understudy) measures how close a generated response is to a human-written reference.
# Higher BLEU = better quality generation
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)

# It compares the "response" with the "reference" and returns a score between 0 and 1.
# A score closer to 1 means the generated response is very close to the ideal answer.
# Example output: 0.1371 means low similarity — possibly due to word choice differences or structure mismatch.

0.13718598426177148

In [14]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import BleuScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = BleuScore()
await scorer.single_turn_ascore(sample)

0.7071067811865478

## 🔴 What is ROUGE?

**ROUGE** stands for:  
> **Recall-Oriented Understudy for Gisting Evaluation**

It’s a set of **metrics used to evaluate automatic summarization or generation tasks**, like:

- Summarization
- Question Answering
- Retrieval-Augmented Generation (RAG)

---

## 🧠 Simple Explanation

> ROUGE measures **how much of the reference (ground truth) answer appears in the generated answer**.

While **BLEU** focuses on **precision** (how much of the generated content is correct),  
**ROUGE focuses more on recall** (how much of the reference was captured).

---

## 🔢 ROUGE Variants (Most Used)

| Type        | Meaning                                                       |
|-------------|---------------------------------------------------------------|
| **ROUGE-1** | Overlap of **unigrams** (single words)                        |
| **ROUGE-2** | Overlap of **bigrams** (pairs of words)                       |
| **ROUGE-L** | Longest Common Subsequence (LCS) between reference & result   |

---

## 📐 How is ROUGE Calculated?

Let’s use **ROUGE-1** as an example.

---

### ✅ 1. **Unigrams (ROUGE-1)**

Suppose:

- **Generated Answer:** "Employees get 24 leave days"
- **Reference Answer:** "Employees get 24 paid leave days"

Unigrams in Generated = {employees, get, 24, leave, days}  
Unigrams in Reference = {employees, get, 24, paid, leave, days}

**Overlap (common):** employees, get, 24, leave, days → 5 words

---

### ✅ 2. **Formula: ROUGE Recall, Precision, F1**

Let’s define:

- **A** = words in generated answer  
- **B** = words in reference answer  
- **Overlap** = common words in both

#### 📘 ROUGE Recall:
\[
\text{Recall} = \frac{\text{# overlapping words}}{\text{# words in reference}}
\]

\[
= \frac{5}{6} = 0.83
\]

---

#### 📙 ROUGE Precision:
\[
\text{Precision} = \frac{\text{# overlapping words}}{\text{# words in generated}}
\]

\[
= \frac{5}{5} = 1.0
\]

---

#### 📗 ROUGE-F1:

![image.png](attachment:image.png)

---

## 🧪 Example Summary Table

| Metric      | Score |
|-------------|-------|
| ROUGE-1     | 0.91  |
| ROUGE-2     | 0.85  |
| ROUGE-L     | 0.92  |

---

## 🔬 ROUGE vs BLEU

| Feature              | BLEU                         | ROUGE                        |
|----------------------|------------------------------|------------------------------|
| Focus                | Precision                    | Recall                       |
| Goal                 | How much of gen is correct   | How much of ref is covered   |
| Good For             | Translation                  | Summarization, QA            |
| Common in            | Machine Translation          | Summarization & RAG          |
| Metric               | n-gram overlap (gen → ref)   | n-gram overlap (ref → gen)   |

---

## 🔍 ROUGE in RAGAS

ROUGE is **not officially part of core RAGAS metrics** (which uses faithfulness, answer relevancy, etc.), but:

- You **can compute ROUGE** to compare **model-generated answers vs. ground truth**
- It helps when evaluating **extractive summarization** or **QA answer overlap**

---

## 🛠 How to Compute ROUGE in Python

```python
from evaluate import load

rouge = load("rouge")

# Sample prediction and reference
predictions = ["Employees get 24 leave days."]
references = ["Employees get 24 paid leave days."]

results = rouge.compute(predictions=predictions, references=references)

print(results)
```

### Output:
```python
{
  'rouge1': 0.83,
  'rouge2': 0.75,
  'rougeL': 0.85,
  'rougeLsum': 0.85
}
```

---

## 📚 Real-World RAG Example

**Question:** "How many paid leaves do I get?"

- **Reference:** "Employees get 24 paid leave days."
- **Model A Answer:** "Employees get 24 leave days."
- **Model B Answer:** "Employees receive 30 days off."

| Metric     | Model A     | Model B     |
|------------|-------------|-------------|
| ROUGE-1    | 0.91        | 0.67        |
| ROUGE-2    | 0.82        | 0.44        |
| ROUGE-L    | 0.92        | 0.56        |

🧠 **Model A** is closer in language and content. ROUGE confirms that.

---

## ✅ Summary

| Term        | Meaning |
|-------------|---------|
| ROUGE-1     | Recall-based overlap of unigrams (words)  
| ROUGE-2     | Bigram overlap (word pairs)  
| ROUGE-L     | Longest common sequence between reference and generated  
| ROUGE Recall| % of reference captured  
| ROUGE F1    | Balance of precision and recall

---

## 🚀 Use ROUGE When:
- Evaluating **QA or summarization** output quality
- Comparing **generated answers to ground-truth**
- Complementing semantic metrics (like **faithfulness** in RAGAS)

---

In [8]:
!pip install rouge_score -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [9]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore()
await scorer.single_turn_ascore(sample)

0.8571428571428571

In [10]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore(rouge_type="rouge1")
await scorer.single_turn_ascore(sample)

0.8571428571428571

In [11]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore(mode="recall")
await scorer.single_turn_ascore(sample)

0.8571428571428571

In [12]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ExactMatch

sample = SingleTurnSample(
    response="India",
    reference="Paris"
)

scorer = ExactMatch()
await scorer.single_turn_ascore(sample)

0.0

In [13]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import StringPresence

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="Eiffel Tower"
)
scorer = StringPresence()
await scorer.single_turn_ascore(sample)

1.0

# Happy Learning