Taken ample reference from https://github.com/mbernico/CS570/blob/master/module_1/TFIDF.ipynb

## Introduction  
<b>tf-idf</b> or <b>TFIDF</b> is a numerical statistic used in information retrieval to signify how important a word is to a document in a collection or corpus.  
The tf-idf value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in  the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently in general (for eg. stopwords such as the,and,etc)  
In short, it can be thought of as a term-weighing scheme.  
[Wiki Ref Article](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [140]:
import pandas as pd
import math

In [141]:
# A widely used term in Natural Language Processing Field is a corpus; which is nothing but a collection of documents.
docA = "the cat sat on my face"
docB = "the dog sat on my dog bed"

### Tokenizing
This is if not but one of the initial steps when we handle textual data in machine learning (Often referred to a 'Bag of Words' model to represent a document)  
In the BOW model, each document is represented as a bag of words, not taking into account the ordering of the words(which may give out /hidden meaning)

In [142]:
#lets first create a count vector
bowA = docA.split() # ['the', 'cat', 'sat', 'on', 'my', 'face']
bowB = docB.split() # ['the', 'dog', 'sat', 'on', 'my', 'bed']

Once the documents are tokenized, we need to convert it into numbers.  
A simple strategy would be to create a vector of all the occuring words in the corpus and for each document count how many times each word appears.

In [143]:
bowSet = set(bowA).union(set(bowB)) #{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'} # uniquely occurring words in a corpus

def countVectorizer(bow, bowSet):
#     print(f"bow {bow} bowSet {bowSet}")
    bowSetDict = dict.fromkeys(bowSet, 0) #{'on': 0, 'my': 0, 'sat': 0, 'face': 0, 'dog': 0, 'the': 0, 'cat': 0, 'bed': 0}
    for word in bow:
        bowSetDict[word] += 1
    return bowSetDict

cvA = countVectorizer(bowA, bowSet) #{'on': 1, 'my': 1, 'sat': 1, 'face': 1, 'dog': 0, 'the': 1, 'cat': 1, 'bed': 0}
cvB = countVectorizer(bowB, bowSet) #{'on': 1, 'my': 1, 'sat': 1, 'face': 0, 'dog': 2, 'the': 1, 'cat': 0, 'bed': 1}

pd.DataFrame([cvA,cvB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,2,0,1,1,1,1


With few steps, we converted words into a linear algebra problem!  
  
The Problem however with this strategy is that words which are commonly used in a language (often called as stopwords which include the,a,of,etc) tend to overshadow the other rarely occuring words which on the other hand may reveal more about a document. inshort using this approach we end up with numbers that dont contain much information.

## TF-IDF  
Rather than just counting, we can use the TF-IDF score of a word to rank its importance.  
The tfidf score of a word, w, is:  
  
$$tf(w) * idf(w)$$  
  
Where tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)  
And where idf(w) = log(Number of documents / Number of documents that contain word w ).

The log here is used to normalize the idf value in extreme situations by way of scaling.  
for example: tf-idf for frequently occurring words : lets say : 100/100  = 1 , log(1) = 0  
tf-idf for rarely occurring words : lets say : 100/1 = 100 (huge difference 1-100) = log(100) = 2

In [144]:
def computeTF(cv, doc):
    totalWords = len(doc)
    tf = {}
    for word,count in cv.items():
        tf[word] = count / totalWords
    return tf

tfA = computeTF(cvA,bowA) #{'on': 0.14285714285714285, 'my': 0.14285714285714285,'sat': 0.14285714285714285,'face': 0.2857142857142857,'dog': 0.0,'the': 0.14285714285714285,'cat': 0.14285714285714285,'bed': 0.0}
tfB = computeTF(cvB,bowB)

In [145]:
#idf = log(total no. of docs/ no of docs where term is appearing)

def computeIDF(bowSet,docList):
    totalDocs = len(docList)
    
    bowSetDict = dict.fromkeys(bowSet, 0)
    
    for term in bowSet:
        for doc in docList:
            if term in doc:
                bowSetDict[term] += 1
    
    for key, value in bowSetDict.items():
        bowSetDict[key] = math.log(totalDocs / value)
    
    return bowSetDict

idf = computeIDF(bowSet, [bowA,bowB]) #{'on': 0.0, 'my': 0.0, 'sat': 0.0, 'face': 0.6931471805599453, 'dog': 0.6931471805599453, 'the': 0.0, 'cat': 0.6931471805599453, 'bed': 0.6931471805599453}

In [146]:
def computeTFIDF(tf, idf):
    tfidf = {}
    for key in tf.keys():
        tfidf[key] = tf[key] * idf[key]
    return tfidf

tfidfA = computeTFIDF(tfA, idf)
tfidfB = computeTFIDF(tfB, idf)

In [147]:
pd.DataFrame([tfidfA,tfidfB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.099021,0.0,0.198042,0.0,0.0,0.0,0.0,0.0


Notice how the tf-idf values for ['my','on','sat','the'] is 0, since this set of words appears in all of the documents under consideration; which implies that the word is not very informative.

### Calculating TF-IDF values using sklearn(library)  
takes just 3 lines to computes the values  
[Scikit-Learn's TdidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [148]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
tfidf = v.fit_transform([docA,docB])
pd.DataFrame(tfidf.toarray(),columns=v.get_feature_names()) # as default tfidf implementation returns csr (compressed sparse row) matrix, we need to convert it to array inorder to convert the result to a dataframe

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.498446,0.0,0.498446,0.354649,0.354649,0.354649,0.354649
1,0.377292,0.0,0.754584,0.0,0.268446,0.268446,0.268446,0.268446
