# Facebook AI Similarity Search

Comparing two vectors is easy. Cosine similarity is a simple and effective technique. But what if you have thousands or even millions of vectors? How do you find the most similar vectors to a given query vector? Time required to compare (via cosine similarity) with all vectors scales linearly, so comparing to a hundred vectors is not a problem but even our simple searches through plays in the previous workbook took ~200ms to run. On a much larger corpus this would take seconds or tens of seconds for a search. This is the problem that similarity search solves. Running cosine similarity over such a large set of vectors is not feasible, so we need to use other approaches. This is what [Facebook AI Similarity Search (FAISS)](https://ai.meta.com/tools/faiss/) does. It uses indexing to speed up the search process.



## Motivation

In [None]:
from sentence_transformers import SentenceTransformer

# Available models: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
# model = SentenceTransformer('bert-base-nli-mean-tokens')
# model = SentenceTransformer('multi-qa-distilbert-cos-v1')
# model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

In [None]:
# Corpus of documents to search
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read()
            files[file_name] = file_contents

For FAISS exploration, let's use a small chunk size so that we get many vectors.

In [None]:
import semchunk
import tiktoken

class Chunk:
    def __init__(self, text, doc_name, embedding):
        self.text = text
        self.doc_name = doc_name
        self.embedding = embedding

# Chunk the files into smaller portions
corpus_chunks = {}
doc_names = list(files.keys())
encoder = tiktoken.encoding_for_model('gpt-4')

print("Chunking plays...")
for play, text in files.items():
    corpus_chunks[play] = [Chunk(t, play, None) for t in semchunk.chunk(text, chunk_size = 64, token_counter=lambda text: len(encoder.encode(text)))]

print("Embedding chunks...")
for play in doc_names:
    for chunk in corpus_chunks[play]:
        chunk.embedding = model.encode(chunk.text)

print("Complete")

for play in doc_names:
    print(f'{play}: {len(corpus_chunks[play])} chunks')

In [None]:
# Combine all play's chunks into a single list
chunks = [chunk for play in doc_names for chunk in corpus_chunks[play]]
print(sum([len(corpus_chunks[play]) for play in doc_names]))
print(len(chunks))

Let's create functions to search through those chunked plays for relevant sections.

In [None]:
import numpy as np

def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    relevances = []
    for chunk in chunks:
        similarity = cosine_similarity(query_embedding, chunk.embedding)
        relevances.append(Relevance(chunk.doc_name, chunk.text, similarity))
    relevances.sort(key=lambda x: x.similarity, reverse=True)
    for answer in relevances[:5]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()

In [None]:
%%time
check_document_relevance("What was Romeo's last name?")
check_document_relevance("Who stabbed Julius Caesar?")

We have a mere 17k vectors but it takes hundreds of milliseconds to search for two query strings. This begins to show the challenge of using cosine similarity directly (which scales linearly) and the need for a more efficient algorithm.

## Example usage

FAISS requires an index. There are many ways of indexing vectors, this notebook will explore some of the simpler ones.

### Flat L2 index

Indexes based on the distance between vectors

In [None]:
import faiss
import numpy

# Create the index
index = faiss.IndexFlatL2(len(chunks[0].embedding))

# Flat L2 index doesn't need trained since it is not specific to the data set
print(index.is_trained)

# Add embeddings
index.add(numpy.array([c.embedding for c in chunks]))
print(index.ntotal)

Now, as before, we can search but we'll use `index.search` instead of cosine similarity.

In [None]:
def check_document_relevance_faiss(query):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode([query])

    print(f'Relevance for "{query}":')
    print('--------------')

    # Searching returns both the distances of each vector from the query vector
    # and the indeces for those nearest neigbors
    distances, indeces = index.search(query_embedding, 5)
    
    # Note that relevance with FAISS is different from cosine similarity.
    # Now *lower* values are more relevant instead of higher ones.
    relevances = [Relevance(chunks[i].doc_name, chunks[i].text, distances[0][j]) for j, i in enumerate(indeces[0])]

    for answer in relevances[:5]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()

In [None]:
%%time
check_document_relevance_faiss("What was Romeo's last name?")
check_document_relevance_faiss("Who stabbed Julius Caesar?")

Much better. But it can still be improved!