In [1]:
from typing import Dict, List

import numpy as np
import pprint
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [3]:
# Our sample documents to turn into Vector Embeddings
sample_docs = [
    "Pigs are stout-bodied, short-legged, omnivorous mammals, with thick skin usually sparsely coated with short bristles",
    "Cows are four-footed and have a large body. It has two horns, two eyes plus two ears and one nose and a mouth. Cows are herbivorous animals.",
    "Chickens are average-sized fowls, characterized by smaller heads, short beaks and wings, and a round body perched on featherless legs.",
    "NumPy (Numerical Python) is an open source Python library that's used in almost every field of science and engineering. It's the universal standard for working with numerical data in Python, and it's at the core of the scientific Python and PyData ecosystems."
]

# The Embedding function, which is a Neural Network taking in text and outputting vectors
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    # model_kwargs = {'device': 'cuda'},
    model_kwargs = {'device': 'cpu'},
    encode_kwargs = {'normalize_embeddings': True}
)

In [4]:
embeddings = np.array(embedding_function.embed_documents(texts = sample_docs))
for i, embedding in enumerate(embeddings[:1]):
    print(f"First 5 dimensions for embedding of {sample_docs[i]}:")
    print(f"\t {embeddings[i,:5]}") # Only printing the first 5 to shorten it
    print(f"Embedding Dimension: {embeddings[i].shape}")
    print("-" * 80)

First 5 dimensions for embedding of Pigs are stout-bodied, short-legged, omnivorous mammals, with thick skin usually sparsely coated with short bristles:
	 [ 0.03743296  0.00983476  0.05154186 -0.03185306 -0.01204163]
Embedding Dimension: (1024,)
--------------------------------------------------------------------------------


## Retrieving Relevant Documents via Vector Similarity

A key property of these embeddings is that in the vector space, two semantically similar vectors should be close, while non-similar concepts should be far or 0.  The metric used for measuring often depends on how the model is trained.  For this model, we will use cosine similarity.

With the embeddings we can compute how similar any two documents are by computing the cosine similarity between their vector embeddings.

In [5]:
norms = np.linalg.norm(embeddings, axis = 1)
cosine_similarities = (embeddings @ embeddings.T) / (norms.T * norms)
for i in range(len(sample_docs)):
    for j in range(i):
        print(f"Similarity between {sample_docs[j][:20]}... and {sample_docs[i][:20]}...: {cosine_similarities[j][i]}")

Similarity between Pigs are stout-bodie... and Cows are four-footed...: 0.6423076099256174
Similarity between Pigs are stout-bodie... and Chickens are average...: 0.6509346820235993
Similarity between Cows are four-footed... and Chickens are average...: 0.6070802604727121
Similarity between Pigs are stout-bodie... and NumPy (Numerical Pyt...: 0.3806830482417314
Similarity between Cows are four-footed... and NumPy (Numerical Pyt...: 0.35324548971637787
Similarity between Chickens are average... and NumPy (Numerical Pyt...: 0.36846483482141795


As a simple eye test, we can see here that the first three documents have high cosine similarities, while each of their relationship with the description of numpy is lower.

## Simplest Vector Search System

With only these tools, we have enough to technically build a semantic search system.

In [7]:
embedding_function = SentenceTransformerEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    # model_kwargs = {'device': 'cuda'},
    model_kwargs = {'device': 'cpu'},
    encode_kwargs = {'normalize_embeddings': True}
)
sample_docs = [
    "Pigs are stout-bodied, short-legged, omnivorous mammals, with thick skin usually sparsely coated with short bristles",
    "Cows are four-footed and have a large body. It has two horns, two eyes plus two ears and one nose and a mouth. Cows are herbivorous animals.",
    "Chickens are average-sized fowls, characterized by smaller heads, short beaks and wings, and a round body perched on featherless legs.",
    "NumPy (Numerical Python) is an open source Python library that's used in almost every field of science and engineering. It's the universal standard for working with numerical data in Python, and it's at the core of the scientific Python and PyData ecosystems."
]

def embed_documents(docs: List[str]) -> np.ndarray:
    """embed all of our documents, only done once"""
    return np.array(embedding_function.embed_documents(docs))

def embed_query(query: str) -> np.ndarray:
    """embed the query, done on demand"""
    return np.array(embedding_function.embed_documents([query]))[0,:]

def retrieve_relevant_documents(doc_embeddings : np.ndarray, query_embedding : np.ndarray, k : int = 1) -> List[Dict[str, float]]:
    """compute cosine similarity between query and documents, return top k and their scores"""
    cosine_similarities = (doc_embeddings @ query_embedding) / (np.linalg.norm(doc_embeddings, axis = 1).T * np.linalg.norm(query_embedding))
    sim_scores = np.argsort(cosine_similarities)
    return [{'document' : sample_docs[i], 'score' : cosine_similarities[i]} for i in sim_scores[::-1][:k]]
                    

First we embed our documents, typically done offline

In [8]:
doc_embeddings = embed_documents(sample_docs)

Then for every query:
1. embed the query
2. compute the similarity score between the query and the documents
3. return the top k most similar documents

In [10]:
query_embedding = embed_query("What is a chicken?")

relevant_docs = retrieve_relevant_documents(doc_embeddings, query_embedding, k = 2)
pprint.pprint(relevant_docs)

[{'document': 'Chickens are average-sized fowls, characterized by smaller '
              'heads, short beaks and wings, and a round body perched on '
              'featherless legs.',
  'score': 0.7052902283407293},
 {'document': 'Cows are four-footed and have a large body. It has two horns, '
              'two eyes plus two ears and one nose and a mouth. Cows are '
              'herbivorous animals.',
  'score': 0.526725074842122}]
