In [1]:
import numpy as np

### TF-IDF Weightage Sample Code
This is a sample code to give students an idea of how TF-IDF weightage model can be applied to calculate relevance score of documents.

The example is taken from **lecture 3.2**.

**P.S Do not get confused with the answers provided for TF-IDF in lecture 3.2, since the example shown had documents from d1-d5 but in actual the collection was large, hence M and k are actually unknown in that solved solution.** <br>
**In this example, M = 5 as it is assumed that the collection has 5 docs only.**

<h3 style = 'color:purple;'>Vector Space Model (TF-IDF Weightage Model)</h3>

$$ f(q,d) = sim(q,d) =  \sum_{i=1}^n x_iy_i $$ 
q = (x_1,.....,x_n) <br>
d = (y_1,.....,y_n) <br>
x_i = count of word W_i in query. <br>
y_i = TF-IDF of word W_i in doc i.e $$ y_i = C(W_i,doc) * log_2 \frac {M+1} {k} $$
M = number of documents in the collection <br>
k = document frequency


In [2]:
#lets say we have the following documents
documents = {
    "d1" : "news about",
    "d2" : "news about organic food campaign",
    "d3" : "news of presidential campaign",
    "d4" : "news of presidential campaign presidential candidate",
    "d5" : "news of organic food campaign campaign campaign campaign"
} # a dictionary with doc# as key and doc content as value

In [3]:
#visualize the dictionary
documents

{'d1': 'news about',
 'd2': 'news about organic food campaign',
 'd3': 'news of presidential campaign',
 'd4': 'news of presidential campaign presidential candidate',
 'd5': 'news of organic food campaign campaign campaign campaign'}

In [4]:
#create a corpus ccontaining the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    for word in doc.split(): #go through each word in the current doc
        if not word in corpus: 
            corpus.append(word) #add word in corpus if not already added

In [5]:
#visualize the corpus 
corpus

['news',
 'about',
 'organic',
 'food',
 'campaign',
 'of',
 'presidential',
 'candidate']

In [6]:
#lets create a dictionary that will store document frequency for each word in the corpus
df_corpus = {} #document frequency for every word in corpus
for word in corpus:
    k = 0 #initial document frequency set to 0
    for doc in documents.values(): #iterate through documents
        if word in doc.split(): #check if word in doc
            k+=1 
    df_corpus[word] = k

In [7]:
#verify the document frequency values
df_corpus

{'news': 5,
 'about': 2,
 'organic': 2,
 'food': 2,
 'campaign': 4,
 'of': 3,
 'presidential': 2,
 'candidate': 1}

In [8]:
#since we have calculated k (document frequency) for all the words in the corpus, next step is to calculate idf
M = len(documents) #number of documents in the collection
idf_corpus = {} #inverse_document frequency for every word in corpus
for word in corpus:
    idf_corpus[word] = np.log2((M+1) / df_corpus[word]) #log_2 ((M+1)/k) i.e inverse document frequency

In [9]:
#visualize idf values
idf_corpus

{'news': 0.2630344058337938,
 'about': 1.584962500721156,
 'organic': 1.584962500721156,
 'food': 1.584962500721156,
 'campaign': 0.5849625007211562,
 'of': 1.0,
 'presidential': 1.584962500721156,
 'candidate': 2.584962500721156}

We have successfully calculated inverse_document_frequency for all the words in the corpus. Now, using the idf values, we need to calculate tf-idf for each word with respect to a particular document.

In [10]:
#calculating tf_idf
tf_idf_docs = {} #will store tf_idf scores for document words
for doc_id in documents.keys():
    tf_idf_docs[doc_id] = {} #initialize empty dictionary for each doc_id

In [11]:
#visualize tf_idf initial state
tf_idf_docs

{'d1': {}, 'd2': {}, 'd3': {}, 'd4': {}, 'd5': {}}

In [12]:
#finalizing the tf_idf calculations
for word in corpus:
    for doc_id,doc in documents.items(): #iterate through key,value pairs where key = doc_id and value = doc content
        tf_idf_docs[doc_id][word] = doc.split().count(word) * idf_corpus[word] #C(W_i,doc) * IDF(W_i) 

In [13]:
#visualize final tf_idf scores for each doc
tf_idf_docs

{'d1': {'news': 0.2630344058337938,
  'about': 1.584962500721156,
  'organic': 0.0,
  'food': 0.0,
  'campaign': 0.0,
  'of': 0.0,
  'presidential': 0.0,
  'candidate': 0.0},
 'd2': {'news': 0.2630344058337938,
  'about': 1.584962500721156,
  'organic': 1.584962500721156,
  'food': 1.584962500721156,
  'campaign': 0.5849625007211562,
  'of': 0.0,
  'presidential': 0.0,
  'candidate': 0.0},
 'd3': {'news': 0.2630344058337938,
  'about': 0.0,
  'organic': 0.0,
  'food': 0.0,
  'campaign': 0.5849625007211562,
  'of': 1.0,
  'presidential': 1.584962500721156,
  'candidate': 0.0},
 'd4': {'news': 0.2630344058337938,
  'about': 0.0,
  'organic': 0.0,
  'food': 0.0,
  'campaign': 0.5849625007211562,
  'of': 1.0,
  'presidential': 3.169925001442312,
  'candidate': 2.584962500721156},
 'd5': {'news': 0.2630344058337938,
  'about': 0.0,
  'organic': 1.584962500721156,
  'food': 1.584962500721156,
  'campaign': 2.3398500028846247,
  'of': 1.0,
  'presidential': 0.0,
  'candidate': 0.0}}

### Querying the documents for relevance scores
Since we have calculated the term frequencies for all the documents in our collection, let us calcualte the relevance score of each document for a given query.

In [14]:
query = "news about presidential campaign" #the query
query

'news about presidential campaign'

In [15]:
query_vocab = [] # will store the unique words that occur in the query
for word in query.split():
    if word not in query_vocab:
        query_vocab.append(word)

In [16]:
query_vocab # the unique words in the query

['news', 'about', 'presidential', 'campaign']

In [17]:
query_wc = {} # a dictionary to store count of a word in the query (i.e x_i according to lecture slides terminology)
for word in query_vocab:
    query_wc[word] = query.split().count(word)

In [18]:
query_wc # the count of each word that occurs in the query

{'news': 1, 'about': 1, 'presidential': 1, 'campaign': 1}

In [19]:
relevance_scores = {} # a dictionary that will store the relevance score for each doc
# doc_id will be the key and relevance score the value for this dictionary
for doc_id in documents.keys():
    score = 0 #initialze the score for the doc to 0 at the start
    for word in query_vocab:
        score += query_wc[word] * tf_idf_docs[doc_id][word] # count of word in query * term_freq of the word
    relevance_scores[doc_id] = score

In [20]:
# lets print the relevance scores for the query
print("Document Relevancy Scores\n",relevance_scores)

Document Relevancy Scores
 {'d1': 1.84799690655495, 'd2': 2.432959407276106, 'd3': 2.432959407276106, 'd4': 4.017921907997263, 'd5': 2.6028844087184186}


### What next ?

<img src = "https://media.makeameme.org/created/brace-yourself-assignment-5bb85a.jpg" alt = "Assignment is ccoming" />