# Part 4: Evaluating Retrieval-Augmented LLMs (RAG)

This notebook demonstrates how to compute common retrieval metrics—Precision@k, Recall@k, Hit@k, Mean Reciprocal Rank (MRR), and normalized Discounted Cumulative Gain (nDCG)—on a small simulated dataset. These metrics help assess how well a retrieval system surfaces relevant documents for each query.

## Simulated Dataset

We simulate three queries with sets of relevant documents. Two models (A and B) return ranked lists of candidate documents. We compute metrics for the top **k=3** retrieved items.

In [None]:

import math
import pandas as pd

# Queries and ground-truth relevant documents
queries = ['q1', 'q2', 'q3']
relevant_docs = {
    'q1': {'doc1', 'doc3', 'doc4'},
    'q2': {'doc2', 'doc5'},
    'q3': {'doc3', 'doc6', 'doc7', 'doc9'}
}

# Retrieval results for two models
predictions_A = {
    'q1': ['doc1', 'doc2', 'doc3', 'doc4', 'doc5'],
    'q2': ['doc2', 'doc5', 'doc6', 'doc1', 'doc3'],
    'q3': ['doc4', 'doc3', 'doc7', 'doc2', 'doc8', 'doc6', 'doc9']
}

predictions_B = {
    'q1': ['doc2', 'doc4', 'doc5', 'doc1', 'doc3'],
    'q2': ['doc5', 'doc6', 'doc2', 'doc3', 'doc1'],
    'q3': ['doc9', 'doc6', 'doc7', 'doc3', 'doc8', 'doc2']
}


def precision_at_k(predicted, relevant, k):
    return sum(doc in relevant for doc in predicted[:k]) / k

def recall_at_k(predicted, relevant, k):
    return sum(doc in relevant for doc in predicted[:k]) / len(relevant)

def hit_at_k(predicted, relevant, k):
    return 1.0 if any(doc in relevant for doc in predicted[:k]) else 0.0

def mrr(predicted, relevant):
    for idx, doc in enumerate(predicted, start=1):
        if doc in relevant:
            return 1.0 / idx
    return 0.0

def dcg(predicted, relevant):
    return sum(1.0 / math.log2(i + 2) for i, doc in enumerate(predicted) if doc in relevant)

def idcg(relevant):
    return sum(1.0 / math.log2(i + 2) for i in range(len(relevant)))

def ndcg(predicted, relevant):
    denom = idcg(relevant)
    return dcg(predicted, relevant) / denom if denom > 0 else 0.0

k = 3


In [1]:

# Compute metrics for each model
models = {'Model A': predictions_A, 'Model B': predictions_B}
metrics = []
for model_name, preds in models.items():
    p_sum = r_sum = hit_sum = mrr_sum = ndcg_sum = 0
    for q in queries:
        rel = relevant_docs[q]
        res = preds[q]
        p_sum += precision_at_k(res, rel, k)
        r_sum += recall_at_k(res, rel, k)
        hit_sum += hit_at_k(res, rel, k)
        mrr_sum += mrr(res, rel)
        ndcg_sum += ndcg(res, rel)
    n = len(queries)
    metrics.append({
        'Model': model_name,
        f'Precision@{k}': round(p_sum / n, 3),
        f'Recall@{k}': round(r_sum / n, 3),
        f'Hit@{k}': round(hit_sum / n, 3),
        'MRR': round(mrr_sum / n, 3),
        'nDCG': round(ndcg_sum / n, 3)
    })

metrics_df = pd.DataFrame(metrics)
metrics_df


  Model  Precision@3  Recall@3  Hit@3   MRR  nDCG
Model A        0.667     0.722    1.0 0.833 0.872
Model B        0.667     0.694    1.0 0.833 0.866

### Discussion

* **Precision@k** measures the fraction of the top-k retrieved documents that are relevant.
* **Recall@k** measures the fraction of relevant documents that appear in the top-k.
* **Hit@k** is a binary metric indicating whether any relevant documents are retrieved in the top-k.
* **MRR** (mean reciprocal rank) rewards retrieving a relevant document early in the ranking.
* **nDCG** accounts for the position of all relevant documents, giving higher weight to those at the top.

In this simulation, both models retrieve at least one relevant document in the top 3 for every query, but **Model A** has slightly higher recall and nDCG than **Model B**.

---

This notebook is part of a series on evaluating LLMs. You can find the accompanying blog post and other notebooks in the repository.