# Evaluating a Semantic Search Retriever
This notebook demonstrates how to build and evaluate a simple semantic search system using the 20 Newsgroups dataset. We'll use a sentence-transformer model to create vector embeddings and then measure the retrieval quality using Precision@K and Recall@K.

- Vectorize Documents: Convert text from the 20 Newsgroups dataset into dense vector embeddings.
- Implement Semantic Search: Use cosine similarity to find the most relevant documents for a given query.
- Calculate Metrics: Compute precision and recall for a set of test queries to evaluate the retriever's performance.

In [9]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import matplotlib.pyplot as plt
import joblib
import os

### Dataset

In [10]:
# Load the 20 Newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

In [11]:
# Convert the dataset to a DataFrame
df = pd.DataFrame({
    'text': newsgroups_train.data,
    'category': newsgroups_train.target
})

df.head()

Unnamed: 0,text,category
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


In [12]:
print("\nDataset Size:", df.shape)
print("\nNumber of Categories:", len(newsgroups_train.target_names))
print("\nCategories:", newsgroups_train.target_names)


Dataset Size: (11314, 2)

Number of Categories: 20

Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [13]:
print(f"TEXT:\n\t{df['text'][0]}\nCATEGORY:\n\t{newsgroups_train.target_names[df['category'][0]]}")

TEXT:
	From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





CATEGORY:
	rec.autos


### Preprocessing and Vectorizing Data

Preprocess the text data by cleaning it and then vectorize the text using a pre-trained model from the `sentence-transformers` library. we will use the model `BAAI/bge-base-en-v1.5` for encoding the sentences into vectors.

In [14]:
model_name =  "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(os.path.join(os.environ['MODEL_PATH'], model_name))

embedding_vectors = joblib.load('embeddings.joblib')

In [16]:
len(embedding_vectors)

11314

In [18]:
embedding_vectors[0].shape

(768,)

### Functions for retrieval

Implement a basic RAG mechanism by performing a similarity search over our precomputed embeddings. This code uses cosine similarity to find the most relevant documents for a given query.

In [19]:
def preprocess_text(text):
    """
    Preprocess the text data by removing leading and trailing whitespace.
    """
    text = text.strip()
    return text

In [20]:
def cosine_similarity(v1, array_of_vectors):
    """
    Compute the cosine similarity between a vector and an array of vectors.

    Parameters:
    v1 (array-like): The first vector.
    array_of_vectors (array-like): An array of vectors or a single vector.

    Returns:
    list: A list of cosine similarities between v1 and each vector in array_of_vectors.
    """
    v1 = np.array(v1)
    similarities = []
    if len(np.shape(array_of_vectors)) == 1:
        array_of_vectors = [array_of_vectors]
    # Iterate over each vector in the array
    for v2 in array_of_vectors:
        v2 = np.array(v2)
        # Compute the dot product of v1 and v2
        dot_product = np.dot(v1, v2)
        # Compute the norms of the vectors
        norm_v1 = np.linalg.norm(v1)
        norm_v2 = np.linalg.norm(v2)
        # Compute the cosine similarity and append to the list
        similarity = dot_product / (norm_v1 * norm_v2)
        similarities.append(similarity)
    return similarities

In [21]:
def top_k_greatest_indices(lst, k):
    """
    Get the indices of the top k greatest items in a list.

    Parameters:
    lst (list): The list of elements to evaluate.
    k (int): The number of top elements to retrieve by index.

    Returns:
    list: A list of indices corresponding to the top k greatest elements in lst.
    """
    indexed_list = list(enumerate(lst))
    sorted_by_value = sorted(indexed_list, key=lambda x: x[1], reverse=True)
    top_k_indices = [index for index, value in sorted_by_value[:k]]
    return top_k_indices

### The retriever function.

In [22]:
def retrieve_documents(query, embeddings, model, top_k=5):
    """
    Retrieve the top-k most similar documents to a given query based on cosine similarity.

    Parameters:
    query (str): The search query for which similar documents are to be retrieved.
    embeddings (list): A list of document embeddings against which the query is compared.
    model (object): A model with an 'encode' method to transform the query into an embedding.
    top_k (int, optional): The number of top documents to retrieve. Defaults to 5.
    """
    query_clean = preprocess_text(query)
    query_embedding = model.encode(query_clean, convert_to_tensor=True)

    cosine_scores = []
    for x in embeddings:
        cosine_scores.append(cosine_similarity(query_embedding, x))

    top_results = top_k_greatest_indices(cosine_scores, k=top_k)

    # Display the results
    print(f"Query: {query}")
    for x in top_results:
        print(f"Document: {df.iloc[x]['text'][:200]}...")
        print(f"Category: {newsgroups_train.target_names[df.iloc[x]['category']]}...")
        print("\n\n")

In [24]:
example_query = "space exploration"
retrieve_documents(example_query, embedding_vectors, model, top_k = 3)

Query: space exploration
Document: From: u1452@penelope.sdsc.edu (Jeff Bytof - SIO)
Subject: End of the Space Age?
Organization: San Diego Supercomputer Center @ UCSD
Lines: 16
Distribution: world
NNTP-Posting-Host: penelope.sdsc.edu

...
Category: sci.space...



Document: From: dennisn@ecs.comm.mot.com (Dennis Newkirk)
Subject: Space class for teachers near Chicago
Organization: Motorola
Distribution: usa
Nntp-Posting-Host: 145.1.146.43
Lines: 59

I am posting this for...
Category: sci.space...



Document: From: Wales.Larrison@ofa123.fidonet.org
Subject: Commercial Space News #22
X-Sender: newtout 0.08 Feb 23 1993
Lines: 666

COMMERCIAL SPACE NEWS/SPACE TECHNOLOGY INVESTOR NUMBER 22

   This is number t...
Category: sci.space...





## Retrieving metrics

### Precision

Relevancy of the retrieved documents. 

$$\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$$



In [26]:
def precision(tp, tn, fp, fn):
    """
    Calculate the precision of a binary classification model.
    """
    if tp < 0 or tn < 0 or fp < 0 or fn < 0:
        raise ValueError("All input values must be non-negative.")
    
    if tp + fp == 0:
        return 0.0

    return tp / (tp + fp)

### Recall

Model's ability to retrieve all relevant documents from the dataset.

$$\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$$

In [27]:
def recall(tp, tn, fp, fn):
    """
    Calculate the recall (sensitivity) of a binary classification model.
    """
    if tp < 0 or tn < 0 or fp < 0 or fn < 0:
        raise ValueError("All input values must be non-negative.")

    if tp + fn == 0:
        return 0.0

    return tp / (tp + fn)

### Computing metrics over some queries

In [28]:
# test queries with their corresponding desired categories

test_queries = [
    {"query": "advancements in space exploration technology", "desired_category": "sci.space"},
    {"query": "real-time rendering techniques in computer graphics", "desired_category": "comp.graphics"},
    {"query": "latest findings in cardiovascular medical research", "desired_category": "sci.med"},
    {"query": "NHL playoffs and team performance statistics", "desired_category": "rec.sport.hockey"},
    {"query": "impacts of cryptography in online security", "desired_category": "sci.crypt"},
    {"query": "the role of electronics in modern computing devices", "desired_category": "sci.electronics"},
    {"query": "motorcycles maintenance tips for enthusiasts", "desired_category": "rec.motorcycles"},
    {"query": "high-performance baseball tactics for championships", "desired_category": "rec.sport.baseball"},
    {"query": "historical influence of politics on society", "desired_category": "talk.politics.misc"},
    {"query": "latest technology trends in the Windows operating system", "desired_category": "comp.os.ms-windows.misc"}
    
]

In [29]:
def compute_metrics(queries, embeddings, model, top_k=5):
    """
    Compute precision and recall metrics for a list of queries against a dataset of document embeddings.

    Parameters:
    queries (list): A list of dictionaries, each containing a "query" and a "desired_category".
    embeddings (list): A list of document embeddings to which queries will be compared.
    model (object): A model with an 'encode' method to transform queries into embeddings.
    top_k (int, optional): The number of top documents to consider for each query. Defaults to 5.

    Returns:
    list: A list of dictionaries, each containing the query, its precision, and recall.
    """
    
    results = []

    for item in queries:
        query = item["query"]
        desired_category = item["desired_category"]

        query_clean = preprocess_text(query)
        query_embedding = model.encode(query_clean, convert_to_tensor=True)

        # Compute cosine similarities with the dataset embeddings
        cosine_scores = []
        for x in embedding_vectors:
            cosine_scores.append(cosine_similarity(query_embedding, x))

        # Retrieve top-k documents based on cosine similarity
        top_results = top_k_greatest_indices(cosine_scores, k=top_k)

        # Check the categories of the retrieved documents
        retrieved_categories = [
            newsgroups_train.target_names[df.iloc[idx]["category"]] for idx in top_results
        ]
        
        # Calculate true positives and false positives
        true_positives = sum(1 for cat in retrieved_categories if cat == desired_category)
        false_positives = top_k - true_positives
        # Assume all other relevant documents in this context are false negatives
        false_negatives = sum(
            newsgroups_train.target_names[df.iloc[idx]["category"]] == desired_category 
            for idx in top_results
        ) - true_positives
        # TN (True Negatives) is generally not well-defined in this informational retrieval context
        true_negatives = 0

        # Calculate precision and recall using defined functions
        p = precision(true_positives, true_negatives, false_positives, false_negatives)
        r = recall(true_positives, true_negatives, false_positives, false_negatives)

        # Append the results to the list
        results.append({
            "query": query,
            "precision": p,
            "recall": r,
        })

    return results

In [30]:
results = compute_metrics(test_queries, embedding_vectors, model)

print("Results:")
for result in results:
    print(f"Query: {result['query']}, Precision: {result['precision']:.2f}, Recall: {result['recall']:.2f}")

Results:
Query: advancements in space exploration technology, Precision: 1.00, Recall: 1.00
Query: real-time rendering techniques in computer graphics, Precision: 1.00, Recall: 1.00
Query: latest findings in cardiovascular medical research, Precision: 1.00, Recall: 1.00
Query: NHL playoffs and team performance statistics, Precision: 1.00, Recall: 1.00
Query: impacts of cryptography in online security, Precision: 1.00, Recall: 1.00
Query: the role of electronics in modern computing devices, Precision: 1.00, Recall: 1.00
Query: motorcycles maintenance tips for enthusiasts, Precision: 1.00, Recall: 1.00
Query: high-performance baseball tactics for championships, Precision: 1.00, Recall: 1.00
Query: historical influence of politics on society, Precision: 0.40, Recall: 1.00
Query: latest technology trends in the Windows operating system, Precision: 0.80, Recall: 1.00


For every query, the **recall** is consistently 1. This means that all relevant documents are retrieved for each query. For example, in the first query, "advancements in space exploration technology," every document related to the category _sci.space_ is included. In some instances, like with the query "historical influence of politics on society," the **precision** is less than 1. With a precision of 0.4, this indicates that only 40% of the retrieved documents are relevant to the query. Nonetheless, because the **recall** is 1, it confirms that all relevant documents are among those retrieved.