
### Purpose of notebook:
The notebook sets up and evaluates a vector-based semantic search system using **Elasticsearch** for a Q&A dataset related to various courses. It explores different methods for creating vector embeddings for document retrieval and assesses the effectiveness of each approach.

### Main Steps:

- **Data Loading**: 
  - Loads course-related questions and answers from a JSON file to use in semantic search.

- **Model Setup**: 
  - Installs and initializes a `SentenceTransformer` model (`multi-qa-MiniLM-L6-cos-v1`) that generates vector embeddings representing the semantic meaning of text.

- **Elasticsearch Index Creation**: 
  - Defines an Elasticsearch index with three distinct **dense vector fields** to store embeddings:
    - **`question_vector`**: Embedding generated solely from the question text in each document.
    - **`text_vector`**: Embedding generated solely from the main text content of each document.
    - **`question_text_vector`**: Combined embedding generated from concatenating the question and main text, capturing the document’s overall meaning.
  - Each vector field supports **cosine similarity** for efficient similarity-based searches.

- **Vector Encoding**: 
  - For each document in the dataset:
    - The model encodes **`question`**, **`text`**, and **`question + text`** separately, generating embeddings for each respective vector field in the Elasticsearch index.

- **K-Nearest Neighbors (KNN) Search Functions**:
  - Implements separate KNN search functions using each vector embedding to retrieve the most similar documents based on different aspects of the text.
    - **Single Field Searches**:
      - **`text_vector_knn`**: Searches only on the `text_vector` field, comparing embeddings of the main content.
      - **`question_text_vector_knn`**: Searches only on the `question_text_vector` field, capturing combined question and text meaning.
    - **Combined Search**:
      - **`elastic_search_knn_combined`**: Uses a custom scoring script in Elasticsearch to aggregate similarity scores across `question_vector`, `text_vector`, and `question_text_vector`. This approach integrates information from the question, text, and combined context, potentially enhancing retrieval accuracy.

- **Evaluation Functions**:
  - Evaluates each search approach using:
    - **Hit Rate**: Measures the percentage of queries where at least one relevant document is retrieved.
    - **Mean Reciprocal Rank (MRR)**: Evaluates how early the first relevant document appears in the search results, with higher scores indicating more accurate and immediate relevance.

### Main Conclusions:

- **Effectiveness of Different Embeddings**:
  - **Single Embeddings**: Using `question_vector` or `text_vector` alone provides search results that focus on a specific aspect of the document (question vs. main text), potentially missing some context when the document's meaning relies on both.
  - **Combined Embedding (`question_text_vector`)**: Including both question and text content creates a broader context, which can improve accuracy by capturing the document's overall meaning.
  
- **Combined Vector Scoring**:
  - The **combined search function** (`elastic_search_knn_combined`) integrates scores from all three embeddings, allowing the search to consider multiple aspects of document relevance (question meaning, content context, and combined). This approach can improve performance by leveraging the strengths of each vector type, particularly in complex queries that benefit from a holistic view of the document.

- **Performance Insights**:
  - The notebook provides insights into the retrieval accuracy for each vector search approach, highlighting that a combined approach typically performs better for semantic search, as measured by higher Hit Rate and MRR.


In [34]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [2]:
# Install Sentence Transformers library to work with pre-trained transformer models
# !pip install sentence_transformers
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [3]:
# Load a pre-trained sentence transformer model designed for semantic similarity and QA tasks
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)

In [4]:
# Encode a query sentence into a dense vector representation (this vector will be used for search)
v = model.encode('I just discovered the course. Can I still join?')

In [5]:
len(v)

384

In [11]:
# Import Elasticsearch Python client to interact with the Elasticsearch server
from elasticsearch import Elasticsearch

# Create an Elasticsearch client that connects to a locally running Elasticsearch instance
es_client = Elasticsearch('http://localhost:9200') 

# Define the settings and mappings for the Elasticsearch index
index_settings = {
    "settings": {
        "number_of_shards": 1,  # Number of primary shards for the index
        "number_of_replicas": 0  # No replicas (for this example, faster writes)
    },
    "mappings": {
        "properties": {
            # Define field types in Elasticsearch for course-related data
            "text": {"type": "text"},  # Text data (course content)
            "section": {"type": "text"},  # Section of the course
            "question": {"type": "text"},  # Question related to the course
            "course": {"type": "keyword"},  # Course name, stored as a keyword
            "id": {"type": "keyword"},  # Unique identifier for each document
            "question_vector": {
                # Vector representing the encoded question (used for semantic search)
                "type": "dense_vector",
                "dims": 384,  # Dimension of the vector from the SentenceTransformer model
                "index": True,  # Index this vector for similarity search
                "similarity": "cosine"  # Use cosine similarity for matching vectors
            },
            "text_vector": {
                # Vector for the course text content
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "question_text_vector": {
                # Combined vector for both question and text (question + text)
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

# The name of the index in Elasticsearch where the documents will be stored
index_name = "course-questions"

# Delete the index if it already exists to start fresh
es_client.indices.delete(index=index_name, ignore_unavailable=True)

# Create a new index with the specified settings and mappings
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [12]:
from tqdm.auto import tqdm

# Loop over each document in the loaded dataset
for doc in tqdm(documents):
    question = doc['question']  # Extract the question part of the document
    text = doc['text']  # Extract the text part of the document
    qt = question + ' ' + text  # Combine the question and text for joint encoding

    # Encode the question, text, and combined question+text into vectors
    doc['question_vector'] = model.encode(question)
    doc['text_vector'] = model.encode(text)
    doc['question_text_vector'] = model.encode(qt)

  0%|          | 0/948 [00:00<?, ?it/s]

In [13]:
# Index (store) each document with its vectors in the Elasticsearch index
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)


  0%|          | 0/948 [00:00<?, ?it/s]

In [14]:
# Define a new query to search for in the documents
query = 'I just discovered the course. Can I still join it?'

# Encode the query into a dense vector representation using the same model
v_q = model.encode(query)

In [15]:
def elastic_search_knn(field, vector, course):
    """This function performs a k-nearest neighbors (KNN) search in Elasticsearch, using a pre-encoded vector to find similar documents."""
    knn = {
        "field": field,  # The field in Elasticsearch where the dense vector is stored
        "query_vector": vector,  # The vector we are using to search for similar vectors
        "k": 5,  # We want the 5 nearest neighbors (top 5 results)
        "num_candidates": 10000,  # Elasticsearch will scan up to 10,000 candidates to find the top k results
        "filter": {
            "term": {
                "course": course  # A filter to only search within a specific course
            }
        }
    }

    search_query = {
        "knn": knn,  # The actual KNN query to Elasticsearch
        "_source": ["text", "section", "question", "course", "id"]  # Fields we want to return from the search results
    }

    es_results = es_client.search(
        index=index_name,  # The name of the Elasticsearch index to search in
        body=search_query  # The query body that defines the KNN search and filters
    )
    
    result_docs = []
    
    # Loop over the search results and extract the source fields (the document data)
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs  # Return the list of matched documents


In [16]:
def question_vector_knn(q):
    """This function takes a dictionary containing a question and course, encodes the question into a vector, 
    and then uses elastic_search_knn() to find similar questions in the index."""
    question = q['question']  # Extract the question text
    course = q['course']  # Extract the course name

    v_q = model.encode(question)  # Encode the question into a dense vector using the SentenceTransformer model

    return elastic_search_knn('question_vector', v_q, course)  # Perform KNN search using the vector and course filter


### Load ground truth data

In [17]:
import pandas as pd
df_ground_truth = pd.read_csv('ground-truth-data.csv')  # Load the ground truth data from a CSV file
ground_truth = df_ground_truth.to_dict(orient='records')  # Convert the DataFrame into a list of dictionaries (records)
ground_truth[0]  # Display the first record from the ground truth data


{'question': 'When does the course begin?',
 'course': 'data-engineering-zoomcamp',
 'document': 'c02e79ef'}

#### The following functions This code evaluates the performance of a search function in retrieving relevant documents by calculating Hit Rate and Mean Reciprocal Rank (MRR), two commonly used metrics in information retrieval.

In [18]:
# Function to calculate Hit Rate
def hit_rate(relevance_total):
    cnt = 0

    # Count the number of queries where at least one relevant document is retrieved
    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    # Calculate the hit rate by dividing relevant hits by the total number of queries
    return cnt / len(relevance_total)


In [19]:
# Function to calculate Mean Reciprocal Rank (MRR)
def mrr(relevance_total):
    total_score = 0.0

    # Iterate over each query's relevance list
    for line in relevance_total:
        # Look for the first relevant document and calculate its reciprocal rank
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)  # Reciprocal rank
                break  # Stop after finding the first relevant document

    # Return the average of the reciprocal ranks across all queries
    return total_score / len(relevance_total)


In [20]:
# Function to evaluate a search function's performance using Hit Rate and MRR
def evaluate(ground_truth, search_function):
    relevance_total = []

    # Loop over each query in the ground truth data
    for q in tqdm(ground_truth):
        doc_id = q['document']  # Expected document ID for this query
        results = search_function(q)  # Run the search function on the query
        # Check if the document IDs in the results match the expected document ID
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)  # Add the relevance list for this query

    # Return a dictionary containing Hit Rate and MRR
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

# Run the evaluation using ground truth data and the question_vector_knn search function
evaluate(ground_truth, question_vector_knn)


  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.7903260634852084, 'mrr': 0.6793709062117617}

ES text only: 0.7395720769397017, 0.6032418413658963

In [25]:
# This function performs a k-nearest neighbors (KNN) search using the text_vector field in Elasticsearch.
# It searches for similar documents using the text_vector field, which represents vector embeddings of the main text content in each document.
def text_vector_knn(q):
    question = q['question']  # Extract the question text from the query
    course = q['course']  # Extract the course name from the query

    v_q = model.encode(question)  # Encode the question text into a dense vector

    # Perform KNN search using the 'text_vector' field with the encoded query vector
    return elastic_search_knn('text_vector', v_q, course)

In [23]:
# This function performs a KNN search using the question_text_vector field in Elasticsearch.
# This function searches for similar documents using the question_text_vector field, which combines the question and text content into a single vector for each document.
def question_text_vector_knn(q):
    question = q['question']
    course = q['course']

    v_q = model.encode(question)

    return elastic_search_knn('question_text_vector', v_q, course)


In [26]:
#This function performs a KNN search by combining multiple vector fields using a scripted similarity score in Elasticsearch. 
# It considers question_vector, text_vector, and question_text_vector to create a more comprehensive similarity score.
def elastic_search_knn_combined(vector, course):
    search_query = {
        "size": 5,  # Retrieve the top 5 results
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "term": {
                                    "course": course  # Filter by course
                                }
                            },
                            "script": {
                                "source": """
                                    cosineSimilarity(params.query_vector, 'question_vector') + 
                                    cosineSimilarity(params.query_vector, 'text_vector') + 
                                    cosineSimilarity(params.query_vector, 'question_text_vector') + 
                                    1
                                """,
                                "params": {
                                    "query_vector": vector  # Query vector to match against
                                }
                            }
                        }
                    }
                ],
                "filter": {
                    "term": {
                        "course": course  # Course filter to narrow the search space
                    }
                }
            }
        },
        "_source": ["text", "section", "question", "course", "id"]  # Fields to retrieve in the result
    }

    es_results = es_client.search(
        index=index_name,  # Elasticsearch index name
        body=search_query  # Search query body
    )
    
    result_docs = []
    
    # Extract document source fields for each hit in the results
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs  # Return the list of matched documents


In [29]:
# This function performs a combined vector search by calling elastic_search_knn_combined().
# It initiates a combined vector search that considers multiple vector fields in Elasticsearch, 
# which may improve retrieval accuracy by combining different aspects of similarity.

def vector_combined_knn(q):
    question = q['question']  # Extract the question from the query
    course = q['course']  # Extract the course name

    v_q = model.encode(question)  # Encode the question text into a dense vector

    return elastic_search_knn_combined(v_q, course)  # Perform combined KNN search


### Evaluation

In [31]:
evaluate(ground_truth, text_vector_knn)              # Evaluates using only text vector similarity

  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.8438782120492334, 'mrr': 0.723997696681783}

In [32]:
evaluate(ground_truth, question_text_vector_knn)     # Evaluates using question + text vector similarity

  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.9328438782120493, 'mrr': 0.8429784783704027}

In [33]:
evaluate(ground_truth, vector_combined_knn)          # Evaluates using combined similarity across multiple vectors

  0%|          | 0/4631 [00:00<?, ?it/s]

{'hit_rate': 0.9209673936514792, 'mrr': 0.8249370186424827}