# Evaluating Retrieval

We'll evaluate different search approaches using our synthetic dataset. We'll compare keyword-based search, vector search, and hybrid approaches to see which performs best.

This systematic evaluation will help us understand which search methods work better for different types of queries and use cases.

## Setting Up the Search System

First, let's set up our document index and search function:

In [None]:
import docs

github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)
chunks = docs.chunk_documents(parsed_data)

Unlike the previous lesson, here we chunk the documents into smaller pieces. This allows us to match specific sections rather than entire documents.

Let's index the chunked documents:

In [None]:
from minsearch import Index
from typing import Any, Dict, List, TypedDict

index = Index(
    text_fields=["content", "filename", "title", "description"],
)

index.fit(chunks)

Now we define our baseline search function using keyword-based search:

In [None]:
class SearchResult(TypedDict):
    """Represents a single search result entry."""
    start: int
    content: str
    title: str
    description: str
    filename: str


def search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """
    return index.search(
        query=query,
        num_results=5,
    )


## Loading Ground Truth Data

We need to load the synthetic dataset we created earlier. This will serve as our evaluation benchmark:

In [None]:
import pandas as pd

df_ground_truth = pd.read_csv('ground_truth_evidently.csv')
ground_truth = df_ground_truth.to_dict(orient='records')

The ground truth data contains questions paired with their expected source documents, which allows us to measure search accuracy.

## Collecting Search Results for Evaluation

Now let's run our search function against all questions in the ground truth dataset:

In [None]:
from tqdm.auto import tqdm

all_search_results = []

for gt_rec in tqdm(ground_truth):
    sr = search(gt_rec['question'])
    filename = gt_rec['filename']
    relevance = [filename == sr_rec['filename'] for sr_rec in sr]
    all_search_results.append(relevance)

This creates a list of relevance scores for each query. Each inner list shows whether the search results are relevant (True) or not (False).

This is how the relevance data looks:

In [None]:
[[True, True, True, True, True],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, True],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

## Evaluation Metrics

We will implement two search evaluation metrics:

- **Hit Rate**: The percentage of queries where at least one relevant document appears in the top results.

- **Mean Reciprocal Rank (MRR)**: The average of reciprocal ranks of the first relevant document. It rewards finding relevant documents early in the result list.

Here's how to calculate them with an example.

With hitrate we only care if we managed to hit the document with the right id in the results:

In [None]:
example = [
    [True, False, False, False, False],  # 1
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [True, False, False, False, False],  # 1
    [False, True, False, False, False],  # 1
    [False, False, True, False, False],  # 1
    [False, False, False, True, False],  # 1
    [True, False, False, False, False],  # 1
    [False, False, True, False, False],  # 1
    [False, False, False, False, False], # 0
]

Hitrate here is 7/10.

Let's implement it:

In [None]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

For MRR, we also look at the position:

- if the relevant result is at position 1, score is 1
- position 2 => 1/2
- position 3 => 1/3
- position 4 => 1/4

In [None]:
example = [
    # 1      2       3      4      5
    [True, False, False, False, False],  # 1
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0
    [True, False, False, False, False],  # 1
    [False, True, False, False, False],  # 1/2
    [False, False, True, False, False],  # 1/3
    [False, False, False, True, False],  # 1/4
    [True, False, False, False, False],  # 1
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]


Let's implement it:

In [None]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)

We can put them together:

In [None]:
def evaluate(
        ground_truth,
        search_function,
        question_column='question',
        id_column='filename'
):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q[id_column]
        results = search_function(q[question_column])
        relevance = [d[id_column] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Baseline Performance Evaluation

Let's evaluate our keyword-based search to establish a baseline:

In [None]:
evaluate(ground_truth, search)

The baseline performance shows:

In [None]:
{'hit_rate': 0.4386694386694387, 'mrr': 0.3626472626472625}

This means our keyword search finds relevant documents in only 44% of queries, with an average reciprocal rank of 36%. This isn't great, but it gives us a starting point. The synthetic data we generated is quite challenging, which makes it a good benchmark.

## Vector Search Implementation

Keyword search struggles with semantic similarity. Let's try vector search, which can understand the meaning behind queries.

We need to add `sentence-transformer`

```bash
uv add sentence-transformers
```

First, we need to set up the embedding model:

In [None]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

Create embeddings for all document chunks:

In [None]:
import numpy as np
from tqdm.auto import tqdm

embeddings = []

for d in tqdm(chunks):
    text = d.get('title', '') + ' ' + d.get('description', '') + ' ' + d.get('content', '')
    text = text.strip()
    v = embedding_model.encode(text)
    embeddings.append(v)

embeddings = np.array(embeddings)

And index them with vector search:

In [None]:
from minsearch import VectorSearch

vindex = VectorSearch()
vindex.fit(embeddings, chunks)

Define the vector search function:

In [None]:
def v_search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """

    q = embedding_model.encode(query)

    return vindex.search(
        q,
        num_results=5,
    )


## Vector Search Evaluation

Let's see how vector search performs:

In [None]:
evaluate(ground_truth, v_search)

The results show significant improvement:

In [None]:
{'hit_rate': 0.762993762993763, 'mrr': 0.5922383922383921}

Vector search achieves 76% hit rate and 59% MRR - much better than keyword search. This demonstrates the power of semantic understanding over exact keyword matching.

## Hybrid Search Approach

Can we get even better results by combining both approaches? Let's try a hybrid search that uses both vector and keyword search:

In [None]:
def h_search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - title (str): The title of the matched document.
            - description (str): A short description of the document.
            - filename (str): The path or name of the source file.
    """

    return v_search(query) + search(query)


## Hybrid Search Evaluation

Let's evaluate the hybrid approach:

In [None]:
evaluate(ground_truth, h_search)

The hybrid search shows the best performance:

In [None]:
{'hit_rate': 0.8045738045738046, 'mrr': 0.5981569481569481}

With an 80% hit rate, the hybrid approach finds relevant documents for 4 out of 5 queries. The MRR also improved slightly, showing better ranking quality.

## Summary

Our systematic evaluation reveals clear performance differences between search approaches:

- Keyword Search: 44% hit rate
- Vector Search: 76% hit rate
- Hybrid Search: 80% hit rate

The synthetic dataset is very useful for this evaluation. It provided a large set of queries that allowed us to quantify search quality objectively and select the best search method for our search system.

But we can also use it for evaluating our agent.