# BM25

[BM25 (aka Okapi BM25)](https://en.wikipedia.org/wiki/Okapi_BM25) is an optimized version of TF-IDF. "BM" stands for "Best Matching". With TF-IDF, scores increase linearly with the frequency of a word. BM-25 is a ranking function that adjusts TF-IDF scores by decaying the impact of higher frequencies. It is useful because more instances don't necessarily mean linearly more relevance.

With TF-IDF, doubling the frequency of a word will double the TF-IDF score. However, doubling the frequency of a word with BM25 will increase the score by a smaller amount (at least at larger frequencies).

### BM25 Definition

In [31]:
import numpy as np

def bm25(term, doc, corpus, corpus_length = -1, average_doc_length = -1, k = 1.25, b = 0.75):
    term = term.lower()
    if corpus_length == -1:
     corpus_length = len(corpus)
    if average_doc_length == -1:
     average_doc_length = sum([len(d) for d in corpus]) / corpus_length
    frequency = doc.count(term)

    if frequency == 0:
     return 0

    tf = (frequency * (k + 1)) / \
         (frequency + k * (1 - b + b * len(doc) / average_doc_length))

    docs_with_term = sum([1 for doc in corpus if term in doc])
    
    idf = np.log(((corpus_length - docs_with_term + 0.5) / \
                  (docs_with_term + 0.5)) \
                 + 1)
    return tf * idf

### Example usage

In [2]:
# Corpus of documents to search
import string
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read().lower().translate(str.maketrans('', '', string.punctuation)).split()
            files[file_name] = file_contents


In [16]:
print(f'Romeo: {bm25("Romeo", files["Romeo and Juliet"], files.values())}')
print(f'Aristotle: {bm25("Aristotle", files["Romeo and Juliet"], files.values())}')
print(f'Mercutio: {bm25("Mercutio", files["Romeo and Juliet"], files.values())}')
print(f'Poison: {bm25("Poison", files["Romeo and Juliet"], files.values())}')

Romeo: 6.554749288102218
Aristotle: 0.0
Mercutio: 6.479393819617477
Poison: 0.6396043966805856


As shown above, BM25 can directly determine the relevance of specific terms to a document. It's also possible to vectorize text using BM25 so that documents can be compared for similarity. This is useful for information retrieval tasks based on larger input queries.

In [20]:
def bm25_vectorizer(doc, vocab, corpus, corpus_length = -1, average_doc_length = -1, k = 1.25, b = 0.75):
    ret = []
    for word in vocab:
        ret.append(bm25(word, doc, corpus, corpus_length, average_doc_length, k, b))
    return ret

full_vocab = set([term for doc in files.values() for term in doc])
len(full_vocab)

23443

As before, vectorizing large documents is slow. This takes over 12 minutes on my Surface Book.

In [34]:
vectors = {}
corpus_length = len(files)
average_doc_length = sum([len(d) for d in files.values()]) / corpus_length
for file in files:
    print(f'Vectorizing {file}... {len(vectors)}/{len(files)}')
    vectors[file] = bm25_vectorizer(files[file], full_vocab, files.values(), corpus_length=corpus_length, average_doc_length=average_doc_length)

Vectorizing A Midsummer Night's Dream... 0/27


Vectorizing All's Well That Ends Well... 1/27
Vectorizing Antony and Cleopatra... 2/27
Vectorizing As You Like It... 3/27
Vectorizing Cymbeline... 4/27
Vectorizing King Lear... 5/27
Vectorizing Loves Labours Lost... 6/27
Vectorizing Measure for Measure... 7/27
Vectorizing Much Ado About Nothing... 8/27
Vectorizing Othello the Moore of Venice... 9/27
Vectorizing Pericles Prince of Tyre... 10/27
Vectorizing Romeo and Juliet... 11/27
Vectorizing The Comedy of Errors... 12/27
Vectorizing The Life and Death of Julius Caesar... 13/27
Vectorizing The Merchant of Venice... 14/27
Vectorizing The Merry Wives of Windsor... 15/27
Vectorizing The Taming of the Shrew... 16/27
Vectorizing The Tempest... 17/27
Vectorizing The Tragedy of Coriolanus... 18/27
Vectorizing The Tragedy of Hamlet Prince of Denmark... 19/27
Vectorizing The Tragedy of Macbeth... 20/27
Vectorizing Timon of Athens... 21/27
Vectorizing Titus Andronicus... 22/27
Vectorizing Troilus and Cressida... 23/27
Vectorizing Twelfth Night..

In [35]:
def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

corpus_length = len(files)
average_doc_length = sum([len(d) for d in files.values()]) / corpus_length
def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, similarity):
            self.name = name
            self.similarity = similarity

    query_tokens = query.lower().translate(str.maketrans('', '', string.punctuation)).split()
    query_vector = bm25_vectorizer(query_tokens, full_vocab, files.values(), corpus_length=corpus_length, average_doc_length=average_doc_length)
    print(f'Relevance for "{query}":')
    print('--------------')
    top_answers = [Relevance(name, cosine_similarity(query_vector, vectors[name])) for name in vectors]
    top_answers.sort(key=lambda x: x.similarity, reverse=True)
    for answer in top_answers[:5]:
        print(f'{answer.name}: {answer.similarity}')
    print()

In [37]:
check_document_relevance("Which play has three witches?")
check_document_relevance("Which play has a friar as an important character?")
check_document_relevance("Which play has a character named Caesar?")
check_document_relevance("Caesar?")
check_document_relevance("Which play is set in Denmark?")

Relevance for "Which play has three witches?":
--------------
The Comedy of Errors: 0.046227095367736955
The Tragedy of Macbeth: 0.045913790816210893
Twelfth Night: 0.0003240812381666201
All's Well That Ends Well: 0.0003175582201290583
Timon of Athens: 0.0003049081655583579

Relevance for "Which play has a friar as an important character?":
--------------
Much Ado About Nothing: 0.03270930487018254
Measure for Measure: 0.032399563849089555
Romeo and Juliet: 0.021691265120807567
Two Gentlemen of Verona: 0.02063497593366803
All's Well That Ends Well: 0.01723354767858958

Relevance for "Which play has a character named Caesar?":
--------------
Measure for Measure: 0.021260249293138158
Cymbeline: 0.018468163575386397
The Life and Death of Julius Caesar: 0.017032372573670268
Twelfth Night: 0.015455845460253321
The Tragedy of Hamlet Prince of Denmark: 0.015327980459987666

Relevance for "Caesar?":
--------------
The Life and Death of Julius Caesar: 0.0274850129556844
Antony and Cleopatra: 0.

This seems to actually perform worse for "sentence similarity". It's possible that reducing the impact of very common terms is causing the model to miss the most important terms in the search query (and allowing them to be swamped by less important terms). TF-IDF appears better at determining relevance for a question with a lot of "filler" words whereas BM25 is better at determining relevance for a question with fewer, more important words. Adjusting the K and B parameters in BM25 could help with this.