# TF-IDF

[TF-IDF (Term Frequency-Inverse Document Frequency)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF is the frequency of a word in a document. It is calculated as the number of times a word appears in a document, divided by the total number of words in that document.

IDF is the inverse of the document frequency among the whole corpus of documents. It is calculated as the logarithm of the number of documents in the corpus divided by the number of documents where the specific term appears. IDF increases as the term appears in fewer documents. If a term appears in all documents, then IDF is 0 because it's ubiquitous and no document matters more than others.

TF-IDF is simply the product of TF and IDF.

In [9]:
# Corpus of documents to search
import string

a = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way — in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
b = "Once upon a midnight dreary, as I pondered weak and weary,\nOver many a quaint and curious volume of forgotten lore\nWhile I nodded, nearly napping, suddenly there came a tapping,\nAs of someone gently rapping, rapping at my chamber door."
c = "Call me Ishmael. Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off — then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."
d = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or eat: it was a hobbit-hole, and that means comfort."
e = "In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels. Troops went by the house and down the road and the dust they raised powdered the leaves of the trees. The trunks of the trees too were dusty and the leaves fell early that year and we saw the troops marching along the road and the dust rising and leaves, stirred by the breeze, falling and the soldiers marching and afterward the road bare and white except for the leaves."
f = "You don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly - Tom's Aunt Polly, she is - and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before."
g = "It was a bright cold day in April, and the clocks were striking thirteen."
h = "Squire Trelawnay, Dr Livesey, and the rest of these gentlemen having asked me to write down the whole particulars about Treasure Island, from the beginning to the end, keeping nothing back but the bearings of the island, and that only because there is still treasure not yet lifted, I take up my pen in the year of grace 17 — and go back to the time when my father kept the Admiral Benbow inn and the brown old seaman with the sabre cut first took up his lodging under our roof."
i = "When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton."
j = "Two households, both alike in dignity\n(In fair Verona, where we lay our scene),\nFrom ancient grudge break to new mutiny,\nWhere civil blood makes civil hands unclean.\nFrom forth the fatal loins of these two foes\nA pair of star-crossed lovers take their life;\nWhose misadventured piteous overthrows\nDoth with their death bury their parents’ strife.\nThe fearful passage of their death-marked love\nAnd the continuance of their parents’ rage,\nWhich, but their children’s end, naught could remove,\nIs now the two hours’ traffic of our stage;\nThe which, if you with patient ears attend,\nWhat here shall miss, our toil shall strive to mend."

sample_docs = [[term.lower().translate(str.maketrans('', '', string.punctuation)) for term in doc.split()] for doc in [a, b, c, d, e, f, g, h, i, j]]

### TF-IDF function definitions

In [16]:
import numpy as np

# Frequency of term in document / total number of terms in document
def tf(term, doc):
    return doc.count(term) / len(doc)

# Log of total number of documents / number of documents with term
def idf(term, corpus):
    docs_with_term = sum([1 for doc in corpus if term in doc])
    if docs_with_term == 0:
        return 0
    return np.log10(len(corpus) / docs_with_term)

# Product of tf and idf
def tfidf(term, doc, corpus):
    # Check TF first since if the term doesn't appear in *any* doc, we
    # can't calculate the IDF and should just return 0.
    term = term.lower()
    if (tf(term, doc) == 0):
        return 0
    return tf(term, doc) * idf(term, corpus)

### Example usage

In [11]:
print([tfidf("it", doc, sample_docs) for doc in sample_docs])
print([tfidf("Ishmael", doc, sample_docs) for doc in sample_docs])
print([tfidf("hobbit", doc, sample_docs) for doc in sample_docs])
print([tfidf("some", doc, sample_docs) for doc in sample_docs])

[0.025085832971998432, 0, 0.007378186168234833, 0.011805097869175734, 0, 0.0027366363242180107, 0.021502142547427227, 0, 0, 0]
[0, 0, 0.004901960784313725, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0.0196078431372549, 0, 0, 0, 0, 0, 0]
[0.004357322877336147, 0, 0.005126262208630761, 0, 0, 0.00475344313891216, 0, 0, 0, 0]


It's possible to create a vector of TF-IDF values for each document in a corpus. This is done by calculating the TF-IDF values for each word in the corpus for that document. This is what `TfidfVectorizer` from the scikit-learn library.

In [12]:
# Create a set from the combination of all the documents
vocab = set([term for doc in sample_docs for term in doc])

def tfidfVectorizer(doc, vocab, corpus):
    ret = []
    for word in vocab:
        ret.append(tfidf(word, doc, corpus))
    return ret

for doc in sample_docs:
    print(tfidfVectorizer(doc, vocab, sample_docs))


[0.008333333333333333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.008333333333333333, 0.016666666666666666, 0.008333333333333333, 0, 0, 0, 0, 0, 0, 0.004357322877336147, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.008333333333333333, 0, 0, 0, 0, 0, 0, 0, 0.016666666666666666, 0.008333333333333333, 0.016666666666666666, 0, 0, 0, 0.0012908496665478598, 0, 0, 0, 0, 0, 0, 0, 0.00582475003613349, 0, 0, 0, 0, 0, 0, 0.01130616818427325, 0, 0, 0, 0, 0, 0, 0, 0.00582475003613349, 0, 0.008333333333333333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.016666666666666666, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.008333333333333333, 0, 0, 0.016666666666666666, 0, 0, 0, 0, 0, 0, 0, 0.008333333333333333, 0, 0, 0, 0, 0, 0.008333333333333333, 0, 0, 0, 0, 0.000762624842677919, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0025085832971998433, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.016666666666666666, 0, 0, 0, 0, 0, 0.008333333333333333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.008333333333333

Let's try with a larger data set.

In [13]:
import os

directory = ".\plays"
files = {}

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, "r") as file:
            file_name = os.path.splitext(filename)[0]
            file_contents = file.read().lower().translate(str.maketrans('', '', string.punctuation)).split()
            files[file_name] = file_contents

In [14]:
full_vocab = set([term for doc in files.values() for term in doc])
len(full_vocab)

23443

Note that vectorizing large documents with a large vocabulary is slow. On my laptop, it takes 30+ seconds per play to vectorize the 37 plays in the Shakespeare corpus.

In [15]:
vectors = {}
for file in files:
    print(f'Vectorizing {file}... {len(vectors)}/{len(files)}')
    vectors[file] = tfidfVectorizer(files[file], full_vocab, files.values())

Vectorizing A Midsummer Night's Dream... 0/27
Vectorizing All's Well That Ends Well... 1/27
Vectorizing Antony and Cleopatra... 2/27
Vectorizing As You Like It... 3/27
Vectorizing Cymbeline... 4/27
Vectorizing King Lear... 5/27
Vectorizing Loves Labours Lost... 6/27
Vectorizing Measure for Measure... 7/27
Vectorizing Much Ado About Nothing... 8/27
Vectorizing Othello the Moore of Venice... 9/27
Vectorizing Pericles Prince of Tyre... 10/27
Vectorizing Romeo and Juliet... 11/27
Vectorizing The Comedy of Errors... 12/27
Vectorizing The Life and Death of Julius Caesar... 13/27
Vectorizing The Merchant of Venice... 14/27
Vectorizing The Merry Wives of Windsor... 15/27
Vectorizing The Taming of the Shrew... 16/27
Vectorizing The Tempest... 17/27
Vectorizing The Tragedy of Coriolanus... 18/27
Vectorizing The Tragedy of Hamlet Prince of Denmark... 19/27
Vectorizing The Tragedy of Macbeth... 20/27
Vectorizing Timon of Athens... 21/27
Vectorizing Titus Andronicus... 22/27
Vectorizing Troilus and

Let's look at the TF-IDF values for some words in these documents.

In [34]:
def check_tf_idf(word):
    word = word.lower()
    print(f'TF-IDF for "{word}":')
    print('--------------')
    for file in vectors:
        if word in full_vocab:
            print(f'{file}: {vectors[file][list(full_vocab).index(word)]}')
        else:
            print(f'{file}: 0')
    print()

In [48]:
# check_tf_idf("macbeth")
# check_tf_idf("the")
# check_tf_idf("friar")
# check_tf_idf("caesar")
# check_tf_idf("juliet")
# check_tf_idf("poison")
check_tf_idf("witch")

TF-IDF for "witch":
--------------
A Midsummer Night's Dream: 0
All's Well That Ends Well: 0
Antony and Cleopatra: 3.943813192736422e-05
As You Like It: 0
Cymbeline: 1.2269030416699616e-05
King Lear: 1.2773657760377298e-05
Loves Labours Lost: 0
Measure for Measure: 0
Much Ado About Nothing: 1.5710510688823773e-05
Othello the Moore of Venice: 0
Pericles Prince of Tyre: 0
Romeo and Juliet: 0
The Comedy of Errors: 6.567710289886787e-05
The Life and Death of Julius Caesar: 0
The Merchant of Venice: 0
The Merry Wives of Windsor: 0.00013488415094268957
The Taming of the Shrew: 0
The Tempest: 6.105446716752889e-05
The Tragedy of Coriolanus: 0
The Tragedy of Hamlet Prince of Denmark: 1.1008799915956441e-05
The Tragedy of Macbeth: 0.001012578289383548
Timon of Athens: 1.804675983148155e-05
Titus Andronicus: 0
Troilus and Cressida: 2.573493007755663e-05
Twelfth Night: 0
Two Gentlemen of Verona: 0
Winter's Tale: 1.3674335783784216e-05



Let's also see what happens if we construct vectors for questions and compare them using cosine similarity. This is not semantic search, but it should be able to identify the relevant terms in the query and match them with appropriate documents.

In [56]:
def cosine_similarity (vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    return dot_product / (norm_vector1 * norm_vector2)

def check_document_relevance(query):
    class Relevance:
        def __init__(self, name, similarity):
            self.name = name
            self.similarity = similarity

    query_tokens = query.lower().translate(str.maketrans('', '', string.punctuation)).split()
    query_vector = tfidfVectorizer(query_tokens, full_vocab, files.values())
    print(f'Cosine Similarity for "{query}":')
    print('--------------')
    top_answers = [Relevance(name, cosine_similarity(query_vector, vectors[name])) for name in vectors]
    top_answers.sort(key=lambda x: x.similarity, reverse=True)
    for answer in top_answers[:5]:
        print(f'{answer.name}: {answer.similarity}')
    print()

In [60]:
check_document_relevance("Which play has three witches?")
check_document_relevance("Which play has a friar as an important character?")
check_document_relevance("Which play has a character named Caesar?")
check_document_relevance("Which play is set in Denmark?")

Cosine Similarity for "Which play has three witches?":
--------------
The Tragedy of Macbeth: 0.01948626250486932
The Comedy of Errors: 0.005611824433082991
Winter's Tale: 0.00037118348325664843
Timon of Athens: 0.00031602372232806956
All's Well That Ends Well: 0.00030145591259637955

Cosine Similarity for "Which play has a friar as an important character?":
--------------
Romeo and Juliet: 0.07641111959611063
Measure for Measure: 0.053790236776151544
Much Ado About Nothing: 0.023315855346946388
Two Gentlemen of Verona: 0.0037530961624678808
All's Well That Ends Well: 0.0021664152548854904

Cosine Similarity for "Which play has a character named Caesar?":
--------------
The Life and Death of Julius Caesar: 0.15360041890648263
Antony and Cleopatra: 0.12476452754254892
Cymbeline: 0.009163618773637387
Measure for Measure: 0.0031858348211999815
Winter's Tale: 0.0026004508016721104

Cosine Similarity for "Which play is set in Denmark?":
--------------
The Tragedy of Hamlet Prince of Denmark