# Evaluación de Sistemas RAG

Este notebook demuestra cómo evaluar el rendimiento de un sistema de búsqueda textual utilizando métricas estándar como **Hit Rate** y **Mean Reciprocal Rank (MRR)**. La evaluación se hace sobre un conjunto de preguntas generadas a partir de FAQs y documentos indexados, usando tanto **Elasticsearch** como **Minsearch**.

## 1. Propósito de la Evaluación

Evaluar el rendimiento de los sistemas de recuperación es clave para:
- Comparar distintos motores (por ejemplo, Elasticsearch vs. Minsearch).
- Ajustar parámetros y configuraciones de búsqueda.
- Tomar decisiones basadas en evidencia cuantitativa y no solo en percepciones.

## 2. Fundamentos: Ground Truth (Datos de Referencia)

### ¿Qué es?

El Ground Truth es un conjunto de datos que enlaza preguntas (queries) con documentos relevantes. Sirve como base para calcular métricas de evaluación.

### Estructura

Contiene los siguientes campos:
- `question`: una consulta simulada de usuario.
- `document`: el ID del documento relevante.
- `course`: contexto o categoría del documento.

### ¿Cómo se genera?

- **Manual**: mediante anotaciones humanas.
- **Automática**: generada con un LLM como GPT-4.
- **Comportamiento real**: basado en clics y búsquedas de usuarios.

### Buenas prácticas

- Asignar un ID único e irrepetible a cada documento (por ejemplo, con MD5 hash).
- Mantener trazabilidad con el contenido fuente.
- Evitar inconsistencias entre los documentos y sus identificadores.


## 3. Métricas Clave de Evaluación

### Hit Rate (Recall@k)

Mide si el documento relevante aparece entre los primeros `k` resultados. 
Es binaria: 1 si aparece, 0 si no.

### MRR (Mean Reciprocal Rank)

Valora en qué posición del ranking aparece el documento relevante.
Cuanto más arriba, mejor es el valor. Premia respuestas rápidas.

### Otras métricas posibles

- MAP (Mean Average Precision)
- NDCG (Normalized Discounted Cumulative Gain)
- Precision@k
- HR (Hit Rate), ERR, AUC-ROC, F1 Score

## 4. Optimización de Parámetros de Búsqueda

Ajustar parámetros puede marcar la diferencia. Algunos aspectos clave:

- Tipo de búsqueda: BM25, búsqueda densa o híbrida.
- Campos utilizados: cuerpo del texto, sección, pregunta, título, etc.
- Boosting: priorizar campos como la pregunta.
- Filtros: aplicar `filter` por curso, categoría u otros metadatos para mejorar precisión y rendimiento.

----

### 1. Carga de Datos

In [2]:
import json

with open('data/documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

### 2. Configuración e Indexación en Elasticsearch

Ejecutar el siguiente comando en la terminal

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:9.0.3
```

In [6]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

### 3. Indexación de Documentos

In [7]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

### 4. Función de Búsqueda en Elasticsearch

In [9]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

### 5. Ejemplo de Búsqueda

In [10]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

### 6. Carga del Ground Truth

In [11]:
import pandas as pd

In [12]:
df_ground_truth = pd.read_csv('data/ground-truth-data.csv')

In [13]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [15]:
ground_truth[:5]

[{'question': 'What is the exact date and time for the commencement of the course?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I stay updated about course announcements?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What should I do before the course begins?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'Where can I find the link to register for the course?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'Is there a specific platform for live Office Hours?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

### 7. Busqueda con Elasticsearch

In [16]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4735 [00:00<?, ?it/s]

In [21]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

### 8. Definición de Métricas

In [18]:
def hit_rate(relevance_total):
    cnt = 0.0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)    

In [36]:
def hit_rate_v2(relevance_total):
    hits = [1 for line in relevance_total if True in line]
    
    return sum(hits) / len(relevance_total)    

In [52]:
def mrr(relevance_total):
    total_score = 0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)

In [53]:
def mrr_v2(relevance_total):
    hits = [1 / (line.index(True) + 1)  if True in line else 0 for line in relevance_total]

    return sum(hits) / len(relevance_total)

### 9. Ejemplo Manual

In [23]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

In [38]:
hit_rate(example)

0.5833333333333334

In [39]:
hit_rate_v2(example)

0.5833333333333334

In [54]:
mrr(example)

0.5277777777777778

In [55]:
mrr_v2(example)

0.5277777777777778

### 10. Resultados Elasticsearch

In [56]:
hit_rate(relevance_total), mrr(relevance_total)

(0.6329461457233369, 0.48975712777191127)

In [57]:
hit_rate_v2(relevance_total), mrr_v2(relevance_total)

(0.6329461457233369, 0.48975712777191127)

### 11. Implementación con Minsearch

In [59]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x7ff2cb80c590>

Función de Búsqueda

In [61]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

Ejemplo de Búsqueda

In [62]:
minsearch_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'cours

### 12. Busqueda con Minsearch

In [63]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4735 [00:00<?, ?it/s]

In [64]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7146779303062302, 0.5827138331573397)

In [65]:
hit_rate_v2(relevance_total), mrr_v2(relevance_total)

(0.7146779303062302, 0.5827138331573389)

### 13. Función Generalizada de Evaluación

In [69]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
        'hit_rate_v2': hit_rate_v2(relevance_total),
        'mrr_v2': mrr_v2(relevance_total),
    }

### 14. Comparación de Motores

In [70]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

  0%|          | 0/4735 [00:00<?, ?it/s]

{'hit_rate': 0.6329461457233369,
 'mrr': 0.48975712777191127,
 'hit_rate_v2': 0.6329461457233369,
 'mrr_v2': 0.48975712777191127}

In [71]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

  0%|          | 0/4735 [00:00<?, ?it/s]

{'hit_rate': 0.7146779303062302,
 'mrr': 0.5827138331573397,
 'hit_rate_v2': 0.7146779303062302,
 'mrr_v2': 0.5827138331573389}

Notas Finales
- El proceso de evaluación es sistemático, repetible y escalable.
- Utiliza métricas clave que reflejan la experiencia del usuario.
- Permite comparar motores y configuraciones de búsqueda de forma objetiva.
- Puedes integrar fácilmente nuevos sistemas de recuperación, manteniendo el mismo pipeline de evaluación.