# SBERT

Whereas document vectors with TF-IDF or BM25 are sparse and high-dimensional, [SBERT (Sentence-BERT)](https://www.sbert.net/) is a dense and low-dimensional representation of text. SBERT is built on the BERT pre-trained transformer network, so unlike TF-IDF, SBERT preserves semantic meaning (so "puppy" and "dog" are similar when vectorized). TF-IDF and BM25 are purely token based. SBERT is a pre-trained model that converts text into a 768-dimensional vector. This notebook demonstrates how to use SBERT to compute the similarity between two texts.

Unlike TF-IDF and BM25, SBERT uses a pre-trained model so we can't build it from scratch without any dependencies. Instead, we'll use the `sentence-transformers` library, which provides a simple interface to use SBERT. Sentence-transformers is built on top of the Hugging Face transformers.

In [1]:
from sentence_transformers import SentenceTransformer

# Available models: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
# model = SentenceTransformer('bert-base-nli-mean-tokens')
# model = SentenceTransformer('multi-qa-distilbert-cos-v1')
# model = SentenceTransformer('sentence-transformers/all-roberta-large-v1')

  from .autonotebook import tqdm as notebook_tqdm


## Example usage

In [2]:
# Corpus of documents to search
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read()
            files[file_name] = file_contents

Creating embeddings is *significantly* faster than vectorizing with sparse vector algorithms. Compared to minutes to generate TF-IDF or BM25 vectors, SBERT can generate embeddings in seconds. 

In [3]:
corpus_embeddings = model.encode(list(files.values()))
print(corpus_embeddings[:5])
print(corpus_embeddings.shape)

[[-0.19115174 -0.18838881 -0.2232427  ...  0.1413402   0.08265044
  -0.3331467 ]
 [ 0.1953782   0.0505964  -0.17819564 ... -0.01003303  0.24612486
  -0.08150145]
 [ 0.04100218  0.21431017 -0.21289207 ... -0.0960598  -0.01074905
  -0.15659717]
 [-0.19601807  0.4239136  -0.24163108 ...  0.09692784  0.09248476
  -0.12221422]
 [ 0.10492685  0.19338897 -0.25364563 ... -0.07438459  0.15227574
  -0.08392614]]
(27, 768)


In [4]:
import numpy as np

def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

In [8]:
def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, similarity):
            self.name = name
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    doc_names = list(files.keys())
    top_answers = [Relevance(doc_names[i], cosine_similarity(query_embedding, corpus_embeddings[i])) for i in range(len(files))]
    top_answers.sort(key=lambda x: x.similarity, reverse=True)
    for answer in top_answers[:5]:
        print(f'{answer.name}: {answer.similarity}')
    print()

In [6]:
check_document_relevance("Which play has three witches?")
check_document_relevance("Which play has a friar as an important character?")
check_document_relevance("Which play has a character named Caesar?")
check_document_relevance("Caesar?")
check_document_relevance("Which play is set in Denmark?")
check_document_relevance(
"To be or not to be—that is the question \
Whether ’tis nobler in the mind to suffer \
The slings and arrows of outrageous fortune, \
Or to take arms against a sea of troubles \
And, by opposing, end them. To die, to sleep— \
No more—and by a sleep to say we end \
The heartache and the thousand natural shocks \
That flesh is heir to—’tis a consummation \
Devoutly to be wished. To die, to sleep— \
To sleep, perchance to dream. Ay, there’s the rub, \
For in that sleep of death what dreams may come, \
When we have shuffled off this mortal coil, \
Must give us pause.")

Relevance for "Which play has three witches?":
--------------
The Tragedy of Macbeth: 0.6324524283409119
The Taming of the Shrew: 0.5362995266914368
The Tempest: 0.5056280493736267
Twelfth Night: 0.49261975288391113
A Midsummer Night's Dream: 0.48657506704330444

Relevance for "Which play has a friar as an important character?":
--------------
Twelfth Night: 0.5354350805282593
Two Gentlemen of Verona: 0.5252188444137573
Much Ado About Nothing: 0.5229831337928772
Romeo and Juliet: 0.5184388756752014
The Merchant of Venice: 0.5100202560424805

Relevance for "Which play has a character named Caesar?":
--------------
The Tragedy of Coriolanus: 0.5594958662986755
Antony and Cleopatra: 0.5590074062347412
The Life and Death of Julius Caesar: 0.5457943081855774
Cymbeline: 0.5422677993774414
Timon of Athens: 0.5002044439315796

Relevance for "Caesar?":
--------------
Antony and Cleopatra: 0.527336835861206
The Tragedy of Coriolanus: 0.5160210132598877
Titus Andronicus: 0.4949684739112854
The Li

It does ok (though apparently it hasn't read Hamlet!), but SBERT is really meant for sentence or paragraph level similarity. So, asking questions about a much larger document isn't its intended use case. Curiously, TF-IDF is the best performing 'which document talks about X' algorithm even though it is lacking many of the more advanced features (semanticically-aware dense vectors, etc.) of SBERT.

## Chunking

Lets try chunking the text into smaller pieces and see if we can get better results.

In [11]:
import semchunk
import tiktoken

class Chunk:
    def __init__(self, text, doc_name, embedding):
        self.text = text
        self.doc_name = doc_name
        self.embedding = embedding

# Chunk the files into smaller portions
corpus_scenes = {}
corpus_chunks = {}
doc_names = list(files.keys())
encoder = tiktoken.encoding_for_model('gpt-4')

print("Chunking plays...")
for play, text in files.items():
    corpus_scenes[play] = [Chunk(t, play, None) for t in text.split('SCENE ') if len(t) > 10]
    corpus_chunks[play] = [Chunk(t, play, None) for t in semchunk.chunk(text, chunk_size = 512, token_counter=lambda text: len(encoder.encode(text)))]

print("Embedding chunks...")
for play in doc_names:
    for chunk in corpus_scenes[play]:
        chunk.embedding = model.encode(chunk.text)
    for chunk in corpus_chunks[play]:
        chunk.embedding = model.encode(chunk.text)

print("Complete")

for play in doc_names:
    print(f'{play}: {len(corpus_scenes[play])} scenes, {len(corpus_chunks[play])} chunks')

Chunking plays...
Embedding chunks...
Complete
A Midsummer Night's Dream: 9 scenes, 73 chunks
All's Well That Ends Well: 23 scenes, 100 chunks
Antony and Cleopatra: 42 scenes, 115 chunks
As You Like It: 22 scenes, 88 chunks
Cymbeline: 26 scenes, 116 chunks
King Lear: 26 scenes, 118 chunks
Loves Labours Lost: 9 scenes, 96 chunks
Measure for Measure: 17 scenes, 94 chunks
Much Ado About Nothing: 17 scenes, 98 chunks
Othello the Moore of Venice: 15 scenes, 116 chunks
Pericles Prince of Tyre: 23 scenes, 80 chunks
Romeo and Juliet: 25 scenes, 103 chunks
The Comedy of Errors: 11 scenes, 67 chunks
The Life and Death of Julius Caesar: 18 scenes, 87 chunks
The Merchant of Venice: 20 scenes, 93 chunks
The Merry Wives of Windsor: 23 scenes, 97 chunks
The Taming of the Shrew: 14 scenes, 96 chunks
The Tempest: 9 scenes, 74 chunks
The Tragedy of Coriolanus: 29 scenes, 121 chunks
The Tragedy of Hamlet Prince of Denmark: 20 scenes, 135 chunks
The Tragedy of Macbeth: 28 scenes, 77 chunks
Timon of Athens

In [31]:
def check_document_relevance(query, chunks):
    class Relevance:
        def __init__(self, name, snippet, similarity):
            self.name = name
            self.snippet = snippet
            self.similarity = similarity

    query_embedding = model.encode(query)

    print(f'Relevance for "{query}":')
    print('--------------')

    relevances = []
    for play in chunks:
        for chunk in play:
            similarity = cosine_similarity(query_embedding, chunk.embedding[0])
            relevances.append(Relevance(chunk.doc_name, chunk.text, similarity))
    relevances.sort(key=lambda x: x.similarity, reverse=True)
    for answer in relevances[:3]:
        print(f'{answer.name} ({answer.similarity}): {answer.snippet[:500]}\n')
    print()

In [33]:
check_document_relevance("Which play has three witches?", corpus_chunks.values())
check_document_relevance("Which play has a friar as an important character?", corpus_chunks.values())
check_document_relevance("Which play has a character named Caesar?", corpus_chunks.values())
check_document_relevance("Which play is set in Denmark?", corpus_chunks.values())

Relevance for "Which play has three witches?":
--------------
The Tragedy of Macbeth (0.6738752126693726): ACT I
SCENE I. A desert place.
Thunder and lightning. Enter three Witches
First Witch
When shall we three meet again
In thunder, lightning, or in rain?
Second Witch
When the hurlyburly's done,
When the battle's lost and won.
Third Witch
That will be ere the set of sun.
First Witch
Where the place?
Second Witch
Upon the heath.
Third Witch
There to meet with Macbeth.
First Witch
I come, Graymalkin!
Second Witch
Paddock calls.
Third Witch
Anon.
ALL
Fair is foul, and foul is fair:
Hover through the 

The Tragedy of Macbeth (0.6738533973693848): SCENE III. A heath near Forres.
Thunder. Enter the three Witches
First Witch
Where hast thou been, sister?
Second Witch
Killing swine.
Third Witch
Sister, where thou?
First Witch
A sailor's wife had chestnuts in her lap,
And munch'd, and munch'd, and munch'd:--
'Give me,' quoth I:
'Aroint thee, witch!' the rump-fed ronyon cries.
Her husband's t

In [34]:
check_document_relevance("Which play has three witches?", corpus_scenes.values())
check_document_relevance("Which play has a friar as an important character?", corpus_scenes.values())
check_document_relevance("Which play has a character named Caesar?", corpus_scenes.values())
check_document_relevance("Which play is set in Denmark?", corpus_scenes.values())

Relevance for "Which play has three witches?":
--------------
The Tragedy of Macbeth (0.6772112250328064): I. A desert place.
Thunder and lightning. Enter three Witches
First Witch
When shall we three meet again
In thunder, lightning, or in rain?
Second Witch
When the hurlyburly's done,
When the battle's lost and won.
Third Witch
That will be ere the set of sun.
First Witch
Where the place?
Second Witch
Upon the heath.
Third Witch
There to meet with Macbeth.
First Witch
I come, Graymalkin!
Second Witch
Paddock calls.
Third Witch
Anon.
ALL
Fair is foul, and foul is fair:
Hover through the fog and filt

The Tragedy of Macbeth (0.6677723526954651): III. A heath near Forres.
Thunder. Enter the three Witches
First Witch
Where hast thou been, sister?
Second Witch
Killing swine.
Third Witch
Sister, where thou?
First Witch
A sailor's wife had chestnuts in her lap,
And munch'd, and munch'd, and munch'd:--
'Give me,' quoth I:
'Aroint thee, witch!' the rump-fed ronyon cries.
Her husband's to Alep

Chunking worked much better. We could even refine things further by seeing how many different chunks in a given play have high relevance with the query and then use that to determine the most relevant play.