# 3.3. Evaluation retrieval: ranking evaluation: text search 

### Notebook Summary

#### Purpose:
The notebook evaluates a **text-based semantic search system** using **Elasticsearch** for course-related Q&A data. It assesses how well standard text matching can retrieve relevant documents.

#### Main Steps:

- **Data Loading and Index Setup**:
  - Loads Q&A data and configures an Elasticsearch index with fields like `question`, `text`, and `section`.

- **Indexing**:
  - Stores each document in Elasticsearch, preparing it for text-based retrieval.

- **Text-Based Search**:
  - Defines `elastic_search()`, a function using Elasticsearch’s `multi_match` query to search across multiple text fields (`question`, `text`, `section`).
  - Adjusts search relevance by boosting the `question` field (`question^3`) to prioritize direct question matches.

- **Ground Truth Evaluation**:
  - Loads ground truth data to validate search effectiveness.
  - For each query, it checks if the correct document (`id` field) appears in the top search results and stores relevance outcomes.

#### Key Metrics:
- **Hit Rate**: Measures the fraction of queries where the correct document is retrieved.
- **MRR (Mean Reciprocal Rank)**: Evaluates how early in the result list the correct document appears.

#### Conclusions:
- **Field Prioritization**: Boosting specific fields, like `question`, improves relevance when queries focus on specific phrases.
- **Text Search Limitations**: Text-based search is effective but may miss nuanced document relationships compared to vector-based models in semantic search setups.

In [33]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [3]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [4]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [6]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [7]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp',
  'id': 'a482086d'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud ac

In [8]:
import pandas as pd

In [9]:
df_ground_truth = pd.read_csv('ground-truth-data.csv')

In [10]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [14]:
# Initialize an empty list to store the relevance results for each query
relevance_total = []

# Loop through each query in the ground truth data using tqdm to show progress
for q in tqdm(ground_truth):
    # Extract the 'document' field from the current query, which is the correct document (ground truth)
    doc_id = q['document']
    
    # Run an ElasticSearch query using the 'question' and 'course' fields from the current query
    results = elastic_search(query=q['question'], course=q['course'])
    
    # Create a list of boolean values by comparing each document's 'id' from the search results
    # with the correct document 'doc_id'. 'True' if the document matches, 'False' otherwise.
    relevance = [d['id'] == doc_id for d in results]
    
    # Append the relevance list (e.g., [True, False, False, ...]) to the relevance_total list
    relevance_total.append(relevance)

  0%|          | 0/4631 [00:00<?, ?it/s]

In [13]:
relevance

[True, False, False, False, False]

In [15]:
relevance_total

[[True, False, False, False, False],
 [True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, True, False, False],
 [False, False, False, False, False],
 [True, False, False, False, False],
 [False, False, False, False, True],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, True, False, False, False],
 [False, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [False, False, False, False, F

In [18]:
def hit_rate(relevance_total):
    # Initialize a counter to keep track of how many queries have at least one relevant document
    cnt = 0

    # Iterate over each relevance list in relevance_total
    for line in relevance_total:
        # Check if there is at least one 'True' in the relevance list, meaning the correct document was found
        if True in line:
            # Increment the counter if the correct document was found for this query
            cnt = cnt + 1

    # Return the hit rate, which is the ratio of queries that had at least one correct document
    return cnt / len(relevance_total)


The following function `mrr` calculates the Mean Reciprocal Rank (MRR), a metric that evaluates the rank of the first correct document in search results:

In [17]:
def mrr(relevance_total):
    # Initialize total_score to accumulate the Mean Reciprocal Rank (MRR) score
    total_score = 0.0

    # Loop over each relevance list in relevance_total (each query's results)
    for line in relevance_total:
        # Loop through the ranks (positions) in the current relevance list
        for rank in range(len(line)):
            # Check if the document at this rank is relevant (True means it's the correct document)
            if line[rank] == True:
                # Add the reciprocal rank to the total score (rank + 1 because rank is zero-indexed)
                total_score = total_score + 1 / (rank + 1)
                # Stop after the first relevant document is found for this query (since MRR only cares about the first correct result)
                break

    # Return the mean of the reciprocal ranks by dividing the total_score by the number of queries
    return total_score / len(relevance_total)


In [20]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]


In [21]:
hit_rate(example)

0.5833333333333334

In [22]:
mrr(example)

0.5277777777777778

In [23]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7499460159792701, 0.6056683221766367)

# Doing the same thing but for minsearch

In [25]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-10-21 11:55:37--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-10-21 11:55:37 (61.3 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [26]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.Index at 0x7857928f7590>

In [27]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [28]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4631 [00:00<?, ?it/s]

In [29]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7905419995681279, 0.6709062117613196)

Compare with elasticsearch: `(0.7499460159792701, 0.6056683221766367)`

## Make the evaluation more generic so we can more easily apply it

In [30]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [31]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.7499460159792701, 'mrr': 0.6056683221766367}

In [32]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.7905419995681279, 'mrr': 0.6709062117613196}