# Facebook AI Similarity Search

Comparing two vectors is easy. Cosine similarity is a simple and effective technique. But what if you have thousands or even millions of vectors? How do you find the most similar vectors to a given query vector? Time required to compare (via cosine similarity) with all vectors scales linearly, so comparing to a hundred vectors is not a problem but even our simple searches through plays in the previous workbook took ~200ms to run. On a much larger corpus this would take seconds or tens of seconds for a search. This is the problem that similarity search solves. Running cosine similarity over such a large set of vectors is not feasible, so we need to use other approaches. This is what [Facebook AI Similarity Search (FAISS)](https://ai.meta.com/tools/faiss/) does. It uses indexing to speed up the search process.



## Motivation

In [3]:
from sentence_transformers import SentenceTransformer

# Available models: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
# model = SentenceTransformer('bert-base-nli-mean-tokens')
# model = SentenceTransformer('multi-qa-distilbert-cos-v1')
# model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

  from .autonotebook import tqdm as notebook_tqdm


In [1]:
# Corpus of documents to search
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read()
            files[file_name] = file_contents

For FAISS exploration, let's use a small chunk size so that we get many vectors.

In [4]:
import semchunk
import tiktoken

class Chunk:
    def __init__(self, text, doc_name, embedding):
        self.text = text
        self.doc_name = doc_name
        self.embedding = embedding

# Chunk the files into smaller portions
corpus_chunks = {}
doc_names = list(files.keys())
encoder = tiktoken.encoding_for_model('gpt-4')

print("Chunking plays...")
for play, text in files.items():
    corpus_chunks[play] = [Chunk(t, play, None) for t in semchunk.chunk(text, chunk_size = 64, token_counter=lambda text: len(encoder.encode(text)))]

print("Embedding chunks...")
for play in doc_names:
    for chunk in corpus_chunks[play]:
        chunk.embedding = model.encode(chunk.text)

print("Complete")

for play in doc_names:
    print(f'{play}: {len(corpus_chunks[play])} chunks')

Chunking plays...
Embedding chunks...
Complete
A Midsummer Night's Dream: 488 chunks
All's Well That Ends Well: 644 chunks
Antony and Cleopatra: 806 chunks
As You Like It: 599 chunks
Cymbeline: 803 chunks
King Lear: 797 chunks
Loves Labours Lost: 640 chunks
Measure for Measure: 631 chunks
Much Ado About Nothing: 598 chunks
Othello the Moore of Venice: 783 chunks
Pericles Prince of Tyre: 539 chunks
Romeo and Juliet: 724 chunks
The Comedy of Errors: 446 chunks
The Life and Death of Julius Caesar: 580 chunks
The Merchant of Venice: 582 chunks
The Merry Wives of Windsor: 692 chunks
The Taming of the Shrew: 627 chunks
The Tempest: 488 chunks
The Tragedy of Coriolanus: 820 chunks
The Tragedy of Hamlet Prince of Denmark: 893 chunks
The Tragedy of Macbeth: 523 chunks
Timon of Athens: 550 chunks
Titus Andronicus: 610 chunks
Troilus and Cressida: 806 chunks
Twelfth Night: 600 chunks
Two Gentlemen of Verona: 491 chunks
Winter's Tale: 682 chunks


In [9]:
# Combine all play's chunks into a single list
chunks = [chunk for play in doc_names for chunk in corpus_chunks[play]]
print(sum([len(corpus_chunks[play]) for play in doc_names]))
print(len(chunks))

17442
17442


Let's create functions to search through those chunked plays for relevant sections.

In [18]:
import numpy as np

def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    relevances = []
    for chunk in chunks:
        similarity = cosine_similarity(query_embedding, chunk.embedding)
        relevances.append(Relevance(chunk.doc_name, chunk.text, similarity))
    relevances.sort(key=lambda x: x.similarity, reverse=True)
    for answer in relevances[:5]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()

In [23]:
%%time
check_document_relevance("What was Romeo's last name?")
check_document_relevance("Who stabbed Julius Caesar?")

Relevance for "What was Romeo's last name?":
--------------
Romeo and Juliet (0.7419300675392151): Enter ROMEO

Romeo and Juliet (0.6840879917144775): JULIET
O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO

Romeo and Juliet (0.6790132522583008): ROMEO
Wilt thou provoke me? then have at thee, boy!
They fight

Romeo and Juliet (0.6586402654647827): Why dost thou stay?
Exit ROMEO

Romeo and Juliet (0.6539598107337952): Not Romeo, prince, he was Mercutio's friend;
His fault concludes but what the law should end,
The life of Tybalt.
PRINCE
And for that offence
Immediately we do exile him hence:
I have an interest in your hate's proceeding,


Relevance for "Who stabbed Julius Caesar?":
--------------
The Life and Death of Julius Caesar (0.7105638384819031): Caesar, thou art revenged,
Even with the sword that kill'd thee.
Dies

The Life and Death of Julius Caesar (0.67732590436935

We have a mere 17k vectors but it takes hundreds of milliseconds to search for two query strings. This begins to show the challenge of using cosine similarity directly (which scales linearly) and the need for a more efficient algorithm.

## Example usage

FAISS requires an index. There are many ways of indexing vectors, this notebook will explore some of the simpler ones.

### Flat L2 index

Indexes based on the distance between vectors

In [39]:
import faiss
import numpy

# Create the index
index = faiss.IndexFlatL2(len(chunks[0].embedding))

# Flat L2 index doesn't need trained since it is not specific to the data set
print(index.is_trained)

# Add embeddings
index.add(numpy.array([c.embedding for c in chunks]))
print(index.ntotal)

True
17442


Now, as before, we can search but we'll use `index.search` instead of cosine similarity.

In [None]:
def check_document_relevance_faiss(query):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode([query])

    print(f'Relevance for "{query}":')
    print('--------------')

    distances, i = index.search(query_embedding, 5)
    

    relevances = []
    for chunk in chunks:

        relevances.append(Relevance(chunk.doc_name, chunk.text, similarity))
    relevances.sort(key=lambda x: x.similarity, reverse=True)
    for answer in relevances[:5]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()