# Sentence Transformers (SBERT, etc.)

Whereas document vectors with TF-IDF or BM25 are sparse and high-dimensional, [SBERT (Sentence-BERT)](https://www.sbert.net/) and other sentence transformers produce dense, low-dimensional representation of text. SBERT is built on the BERT pre-trained transformer network. Unlike TF-IDF, sentence transformers preserves semantic meaning (so "puppy" and "dog" are similar when vectorized). TF-IDF and BM25 are purely token based. SBERT is a pre-trained model that converts text into a 768-dimensional vector. This notebook demonstrates how to use sentence transformers to compute the similarity between two texts.

Unlike TF-IDF and BM25, sentence transformers use pre-trained models so we can't build them from scratch without any dependencies. Instead, we'll use the `sentence-transformers` library, which provides a simple interface for encoding with these models. Sentence-transformers is built on top of Hugging Face transformers.

In [None]:
from sentence_transformers import SentenceTransformer

# Available models: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
# model = SentenceTransformer('bert-base-nli-mean-tokens')
# model = SentenceTransformer('multi-qa-distilbert-cos-v1')
# model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

## Example usage

In [None]:
# Corpus of documents to search
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read()
            files[file_name] = file_contents

Creating embeddings is *significantly* faster than vectorizing with sparse vector algorithms. Compared to minutes to generate TF-IDF or BM25 vectors, SBERT can generate embeddings in seconds. 

In [None]:
corpus_embeddings = model.encode(list(files.values()))
print(corpus_embeddings[:5])
print(corpus_embeddings.shape)

In [None]:
import numpy as np

def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

In [None]:
def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, similarity):
            self.name = name
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    doc_names = list(files.keys())
    top_answers = [Relevance(doc_names[i], cosine_similarity(query_embedding, corpus_embeddings[i])) for i in range(len(files))]
    top_answers.sort(key=lambda x: x.similarity, reverse=True)
    for answer in top_answers[:5]:
        print(f'{answer.name}: {answer.similarity}')
    print()

In [None]:
check_document_relevance("Which play has three witches?")
check_document_relevance("Which play has a friar as an important character?")
check_document_relevance("Which play has a character named Caesar?")
check_document_relevance("Caesar?")
check_document_relevance("Which play is set in Denmark?")
check_document_relevance(
"To be or not to be—that is the question \
Whether ’tis nobler in the mind to suffer \
The slings and arrows of outrageous fortune, \
Or to take arms against a sea of troubles \
And, by opposing, end them. To die, to sleep— \
No more—and by a sleep to say we end \
The heartache and the thousand natural shocks \
That flesh is heir to—’tis a consummation \
Devoutly to be wished. To die, to sleep— \
To sleep, perchance to dream. Ay, there’s the rub, \
For in that sleep of death what dreams may come, \
When we have shuffled off this mortal coil, \
Must give us pause.")

It does ok (though apparently it hasn't read Hamlet!), but SBERT is really meant for sentence or paragraph level similarity. So, asking questions about a much larger document isn't its intended use case. Curiously, TF-IDF is the best performing 'which document talks about X' algorithm even though it is lacking many of the more advanced features (semanticically-aware dense vectors, etc.) of SBERT.

## Chunking

Lets try chunking the text into smaller pieces and see if we can get better results.

In [None]:
import semchunk
import tiktoken

class Chunk:
    def __init__(self, text, doc_name, embedding):
        self.text = text
        self.doc_name = doc_name
        self.embedding = embedding

# Chunk the files into smaller portions
corpus_scenes = {}
corpus_chunks = {}
doc_names = list(files.keys())
encoder = tiktoken.encoding_for_model('gpt-4')

print("Chunking plays...")
for play, text in files.items():
    corpus_scenes[play] = [Chunk(t, play, None) for t in text.split('SCENE ') if len(t) > 10]
    corpus_chunks[play] = [Chunk(t, play, None) for t in semchunk.chunk(text, chunk_size = 512, token_counter=lambda text: len(encoder.encode(text)))]

print("Embedding chunks...")
for play in doc_names:
    for chunk in corpus_scenes[play]:
        chunk.embedding = model.encode(chunk.text)
    for chunk in corpus_chunks[play]:
        chunk.embedding = model.encode(chunk.text)

print("Complete")

for play in doc_names:
    print(f'{play}: {len(corpus_scenes[play])} scenes, {len(corpus_chunks[play])} chunks')

In [None]:
def check_document_relevance(query, chunks):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    relevances = []
    for play in chunks:
        for chunk in play:
            similarity = cosine_similarity(query_embedding, chunk.embedding)
            relevances.append(Relevance(chunk.doc_name, chunk.text, similarity))
    relevances.sort(key=lambda x: x.similarity, reverse=True)
    for answer in relevances[:3]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()

In [None]:
check_document_relevance("Which play has three witches?", corpus_chunks.values())
check_document_relevance("Which play has a friar as an important character?", corpus_chunks.values())
check_document_relevance("Which play has a character named Caesar?", corpus_chunks.values())
check_document_relevance("Which play is set in Denmark?", corpus_chunks.values())
check_document_relevance("Who was Juliet's lover?", corpus_chunks.values())

In [None]:
check_document_relevance("Which play has three witches?", corpus_scenes.values())
check_document_relevance("Which play has a friar as an important character?", corpus_scenes.values())
check_document_relevance("Which play has a character named Caesar?", corpus_scenes.values())
check_document_relevance("Which play is set in Denmark?", corpus_scenes.values())
check_document_relevance("Who was Juliet's lover?", corpus_scenes.values())

Chunking worked much better. We could even refine things further by seeing how many different chunks in a given play have high relevance with the query and then use that to determine the most relevant play.