**Main Question**: Deploy the vector DB on your own, and implement the `vector cosine similarity` without using a high level library. 

*Assumption*: The main point on this question is to construct `cosine_similarity` formula to Python function. I assume that I can use `vector_db` dummy and only save them as a `pickle`. This implementation can be easily extended to larger datasets or real embeddings from LLMs in the future.

In [40]:
import math
import pickle

In [41]:
def cosine_similarity(a, b):
    norm_a = math.sqrt(sum(x*x for x in a))
    norm_b = math.sqrt(sum(y*y for y in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    dot_product = sum(x*y for x, y in zip(a, b))
    return dot_product / (norm_a * norm_b)

In [42]:
def top_k_query(vector_db, query_vec, k):
    scores = []
    for doc_id, embedding in vector_db.items():
        score = cosine_similarity(query_vec, embedding)
        scores.append((doc_id, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:k]

In [43]:
# Dummy vector database
vector_db = {
    "doc_1": [0.1, 0.3, 0.5],
    "doc_2": [0.2, 0.1, 0.4],
    "doc_3": [0.9, 0.8, 0.7],
    "doc_4": [0.5, 0.2, 0.1],
    "doc_5": [0.0, 0.2, 0.9]
}

query_vec = [0.1, 0.2, 0.4]

K = 5
top_k_results = top_k_query(vector_db, query_vec, k=K)

print(f"Top-{K} most similar documents:")
for doc_id, score in top_k_results:
    print(f"{doc_id}: {score:.4f}")

Top-5 most similar documents:
doc_1: 0.9959
doc_2: 0.9524
doc_5: 0.9468
doc_3: 0.8304
doc_4: 0.5179


In [44]:
with open("vector_db.pkl", "wb") as f:
    pickle.dump(vector_db, f)

with open("vector_db.pkl", "rb") as f:
    loaded_vector_db = pickle.load(f)

print("Loaded vector DB keys:", list(loaded_vector_db.keys()))

Loaded vector DB keys: ['doc_1', 'doc_2', 'doc_3', 'doc_4', 'doc_5']
