# TF - IDF Scores
 **Tf-Idf** or **Term frequency - Inverse document frequency** score's evaluate how relevant a word is to a document in a collection of documents
 
 ![image.png](attachment:image.png)
 
* **Term Frequecy** = how many times a word appears in a Document 
* **Inverse Document Frequency** = We obtain this by dividing the number of documents that contain a word by the total number of documents and calculating the logarithm. This metric approaches 0 for words commonly appearing in documents, and 1 for words rarely appearing in documents.


### Importing necessary libraries

In [63]:
import math

### Function to compute Term Frequency

In [64]:
def computeTF(wordDict):
        tfDict = {}
        wordcount=0
        for key,val in wordDict.items():
        	wordcount+=val
        for key,val in wordDict.items():
                tfDict[key] = val/float(wordcount)
        return tfDict

### Function to compute Inverse Document Frequency

In [65]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
            for word, val in doc.items():
                if val>0:
                    idfDict[word] += 1
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
    return idfDict

### Function to compute  Tf-Idf Matrix

In [66]:
def computeTFIDF(tfs, idfs):
    tfidf = {}
    for word,val in tfs.items():
        tfidf[word] = val*idfs[word]
    return tfidf

### Sample Data

In [71]:
s1_text="A mathematician found a solution to the problem"
s2_text="The problem was solved by a young mathematician"
s1_dict={}
s2_dict={}
distinct=[] #(distinct words in corpus)

### Preprocessing

In [72]:
#Splitting Sentences to Words
s1=s1_text.split(" ")
s2=s2_text.split(" ")

#Build Term Frequency Vectors
for i in s1:
    if(i not in distinct):
        s1_dict[i]=0
        s2_dict[i]=0
        distinct.append(i)
    s1_dict[i]+=1
for i in s2:
    if(i not in distinct):
        s1_dict[i]=0
        s2_dict[i]=0
        distinct.append(i)
    s2_dict[i]+=1
print("Dictionary for DOC1")
print(s1_dict)
print("Dictionary for DOC2")
print(s2_dict)
print("TF for DOC1")
print(computeTF(s1_dict))
print("TF for DOC2")
print(computeTF(s2_dict))

#Build Inverse Document Frequency Vector
doc_list=[s1_dict,s2_dict]
print("IDF in General")
print(computeIDF(doc_list))

Dictionary for DOC1
{'A': 1, 'mathematician': 1, 'found': 1, 'a': 1, 'solution': 1, 'to': 1, 'the': 1, 'problem': 1, 'The': 0, 'was': 0, 'solved': 0, 'by': 0, 'young': 0}
Dictionary for DOC2
{'A': 0, 'mathematician': 1, 'found': 0, 'a': 1, 'solution': 0, 'to': 0, 'the': 0, 'problem': 1, 'The': 1, 'was': 1, 'solved': 1, 'by': 1, 'young': 1}
TF for DOC1
{'A': 0.125, 'mathematician': 0.125, 'found': 0.125, 'a': 0.125, 'solution': 0.125, 'to': 0.125, 'the': 0.125, 'problem': 0.125, 'The': 0.0, 'was': 0.0, 'solved': 0.0, 'by': 0.0, 'young': 0.0}
TF for DOC2
{'A': 0.0, 'mathematician': 0.125, 'found': 0.0, 'a': 0.125, 'solution': 0.0, 'to': 0.0, 'the': 0.0, 'problem': 0.125, 'The': 0.125, 'was': 0.125, 'solved': 0.125, 'by': 0.125, 'young': 0.125}
IDF in General
{'A': 0.3010299956639812, 'mathematician': 0.0, 'found': 0.3010299956639812, 'a': 0.0, 'solution': 0.3010299956639812, 'to': 0.3010299956639812, 'the': 0.3010299956639812, 'problem': 0.0, 'The': 0.3010299956639812, 'was': 0.301029995

### Build Tf-Idf matrix

In [73]:
tf_idf_vector=[]
for i in range(len(distinct)):
    temp=[0,0]
    tf_idf_vector.append(temp)
j=0
for doc in doc_list:
    temp_dict=computeTFIDF(computeTF(doc),computeIDF(doc_list))
    for i in range(len(distinct)):
        tf_idf_vector[i][j]=temp_dict[distinct[i]]
    j+=1
for i in tf_idf_vector:
    print(i)

[0.03762874945799765, 0.0]
[0.0, 0.0]
[0.03762874945799765, 0.0]
[0.0, 0.0]
[0.03762874945799765, 0.0]
[0.03762874945799765, 0.0]
[0.03762874945799765, 0.0]
[0.0, 0.0]
[0.0, 0.03762874945799765]
[0.0, 0.03762874945799765]
[0.0, 0.03762874945799765]
[0.0, 0.03762874945799765]
[0.0, 0.03762874945799765]
