# TF - IDF Scores
 **Tf-Idf** or **Term frequency - Inverse document frequency** scores evaluates how relevant a word is to a document in a collection of documents
 
 ![image.png](attachment:image.png)
 
* **Term Frequecy** = how many times a word appears in a Document 
* **Inverse Document Frequency** = We obtain this by dividing the number of documents that contain a word by the total number of documents and calculating the logarithm. This metric approaches 0 for words commonly appearing in documents, and 1 for words rarely appearing in documents.


### Importing necessary libraries

In [63]:
import math

### Function to compute Term Frequency

In [64]:
def computeTF(wordDict):
        tfDict = {}
        wordcount=0
        for key,val in wordDict.items():
        	wordcount+=val
        for key,val in wordDict.items():
                tfDict[key] = val/float(wordcount)
        return tfDict

### Function to compute Inverse Document Frequency

In [65]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
            for word, val in doc.items():
                if val>0:
                    idfDict[word] += 1
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
    return idfDict

### Function to compute  Tf-Idf Matrix

In [66]:
def computeTFIDF(tfs, idfs):
    tfidf = {}
    for word,val in tfs.items():
        tfidf[word] = val*idfs[word]
    return tfidf

### Sample Data

In [67]:
s1_text="Information Retrieval is an easy subject that is fun to learn about"
s2_text="Thinking about Information Retrieval"
s1_dict={}
s2_dict={}
distinct=[] #(distinct words in corpus)

### Preprocessing

In [68]:
#Splitting Sentences to Words
s1=s1_text.split(" ")
s2=s2_text.split(" ")

#Build Term Frequency Vectors
for i in s1:
    if(i not in distinct):
        s1_dict[i]=0
        s2_dict[i]=0
        distinct.append(i)
    s1_dict[i]+=1
for i in s2:
    if(i not in distinct):
        s1_dict[i]=0
        s2_dict[i]=0
        distinct.append(i)
    s2_dict[i]+=1
print("Dictionary for DOC1")
print(s1_dict)
print("Dictionary for DOC2")
print(s2_dict)
print("TF for DOC1")
print(computeTF(s1_dict))
print("TF for DOC2")
print(computeTF(s2_dict))

#Build Inverse Document Frequency Vector
doc_list=[s1_dict,s2_dict]
print("IDF in General")
print(computeIDF(doc_list))

Dictionary for DOC1
{'Information': 1, 'Retrieval': 1, 'is': 2, 'a': 1, 'easy': 1, 'subject': 1, 'that': 1, 'fun': 1, 'to': 1, 'learn': 1, 'about': 1, 'Thinking': 0}
Dictionary for DOC2
{'Information': 1, 'Retrieval': 1, 'is': 0, 'a': 0, 'easy': 0, 'subject': 0, 'that': 0, 'fun': 0, 'to': 0, 'learn': 0, 'about': 1, 'Thinking': 1}
TF for DOC1
{'Information': 0.08333333333333333, 'Retrieval': 0.08333333333333333, 'is': 0.16666666666666666, 'a': 0.08333333333333333, 'easy': 0.08333333333333333, 'subject': 0.08333333333333333, 'that': 0.08333333333333333, 'fun': 0.08333333333333333, 'to': 0.08333333333333333, 'learn': 0.08333333333333333, 'about': 0.08333333333333333, 'Thinking': 0.0}
TF for DOC2
{'Information': 0.25, 'Retrieval': 0.25, 'is': 0.0, 'a': 0.0, 'easy': 0.0, 'subject': 0.0, 'that': 0.0, 'fun': 0.0, 'to': 0.0, 'learn': 0.0, 'about': 0.25, 'Thinking': 0.25}
IDF in General
{'Information': 0.0, 'Retrieval': 0.0, 'is': 0.3010299956639812, 'a': 0.3010299956639812, 'easy': 0.301029995

### Build Tf-Idf matrix

In [69]:
tf_idf_vector=[]
for i in range(len(distinct)):
    temp=[0,0]
    tf_idf_vector.append(temp)
j=0
for doc in doc_list:
    temp_dict=computeTFIDF(computeTF(doc),computeIDF(doc_list))
    for i in range(len(distinct)):
        tf_idf_vector[i][j]=temp_dict[distinct[i]]
    j+=1
for i in tf_idf_vector:
    print(i)

[0.0, 0.0]
[0.0, 0.0]
[0.050171665943996864, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.025085832971998432, 0.0]
[0.0, 0.0]
[0.0, 0.0752574989159953]
