# Top-N Retrieval Comparison

This notebook compares retrieval performance for top 5, top 10, and top 50 results using semantic search embeddings.

In [None]:

from sentence_transformers import SentenceTransformer
import numpy as np
import time

# Create a simple corpus with repeated topics
categories = ['sports', 'technology', 'finance', 'health', 'travel']
corpus = [f'This is document {i} about {cat}.' for cat in categories for i in range(40)]

model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model.encode(corpus)


Define a simple semantic search using cosine similarity:

In [None]:

def semantic_search(query_embedding, embeddings, top_k):
    scores = np.dot(embeddings, query_embedding)
    top_indices = np.argsort(scores)[::-1][:top_k]
    return top_indices, scores[top_indices]


In [None]:

query = 'technology news'
query_embedding = model.encode([query])[0]


In [None]:

results = {}
for k in [5, 10, 50]:
    start = time.perf_counter()
    idx, score = semantic_search(query_embedding, corpus_embeddings, k)
    elapsed = time.perf_counter() - start
    results[k] = idx
    print(f'Top {k} retrieval took {elapsed:.6f} seconds')


Check if the top 5 results are identical in both the top 5 and top 50 queries:

In [None]:

match = np.array_equal(results[5], results[50][:5])
print('Top-5 results match:', match)


This confirms that retrieving more results only adds minor overhead and the ranking of the first five remains the same.