<a href="https://colab.research.google.com/github/kai-lim/NLP_course/blob/main/D3_tfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Term frequency x Inverse document frequency - TfIdf

With acknowledgement to Mayank Tripathi https://github.com/mayank408

We will build on the BoW model creating a TfIdf model from first principles.

First a couple imports: "string" to do some string manipulation, pprint and pandas to help us print our data structures.

In [None]:
import string
import pprint as pp

import pandas as pd



Now our corpus:

In [None]:
documents = ['Klonopin 0.25 mg po every evening, Fluconazole 200 mg po daily, Synthroid 125 mcg po every day',
             'she will not consider switching to clozapine',
             'lovastatin 40 mg one half tab po daily, multivitamin daily, metformin 500 mg one tab po twice a day',
             'Aspirin 81 mg po once daily, Zoloft 25 mg po once daily, Calcium with vitamin D two tablets po once daily']


We will "normalise" our documents, to lower case them and remove punctuation.

In [None]:
normalised_documents = []
for i in documents:
    no_punctuation = ''.join(c for c in i if c not in string.punctuation)
    normalised_documents.append(no_punctuation.lower())
    
for i in normalised_documents:
  print(i)

klonopin 025 mg po every evening fluconazole 200 mg po daily synthroid 125 mcg po every day
she will not consider switching to clozapine
lovastatin 40 mg one half tab po daily multivitamin daily metformin 500 mg one tab po twice a day
aspirin 81 mg po once daily zoloft 25 mg po once daily calcium with vitamin d two tablets po once daily


Let's split in to tokens, to give our "bags"

In [None]:
bows = []
for i in normalised_documents:
    bows.append(i.split(' '))

print(bows)

[['klonopin', '025', 'mg', 'po', 'every', 'evening', 'fluconazole', '200', 'mg', 'po', 'daily', 'synthroid', '125', 'mcg', 'po', 'every', 'day'], ['she', 'will', 'not', 'consider', 'switching', 'to', 'clozapine'], ['lovastatin', '40', 'mg', 'one', 'half', 'tab', 'po', 'daily', 'multivitamin', 'daily', 'metformin', '500', 'mg', 'one', 'tab', 'po', 'twice', 'a', 'day'], ['aspirin', '81', 'mg', 'po', 'once', 'daily', 'zoloft', '25', 'mg', 'po', 'once', 'daily', 'calcium', 'with', 'vitamin', 'd', 'two', 'tablets', 'po', 'once', 'daily']]


We need to get a set containing all of our unique words, so that we can calculate their relative 
frequencies in each document and across all documents.

In [None]:
word_set = set()
for i in bows:
  word_set = word_set.union(set(i))
print(word_set)

{'200', '81', 'once', 'half', 'daily', 'she', 'fluconazole', '500', 'synthroid', 'metformin', 'mcg', 'zoloft', 'to', '125', '40', 'evening', 'not', '25', 'po', '025', 'with', 'tablets', 'd', 'every', 'day', 'calcium', 'vitamin', 'tab', 'aspirin', 'lovastatin', 'a', 'consider', 'clozapine', 'klonopin', 'switching', 'multivitamin', 'two', 'will', 'twice', 'mg', 'one'}


Let's count how many of each word we have for each of our bags:

In [None]:
wordCounts = []
for i in bows:
  thisWordCount = dict.fromkeys(word_set, 0)
  for word in i:
    thisWordCount[word]+=1
  wordCounts.append(thisWordCount)
 

Let's take a look at these counts:

In [None]:
pd.DataFrame(wordCounts)

Unnamed: 0,200,81,once,half,daily,she,fluconazole,500,synthroid,metformin,...,consider,clozapine,klonopin,switching,multivitamin,two,will,twice,mg,one
0,1,0,0,0,1,0,1,0,1,0,...,0,0,1,0,0,0,0,0,2,0
1,0,0,0,0,0,1,0,0,0,0,...,1,1,0,1,0,0,1,0,0,0
2,0,0,0,1,2,0,0,1,0,1,...,0,0,0,0,1,0,0,1,2,2
3,0,1,3,0,3,0,0,0,0,0,...,0,0,0,0,0,1,0,0,2,0


Now we will define ***Term Frequency*** (TF) as the relative frequency of a word in a bag (document) - i.e. what fraction of all words in a document is a particular word? We will define a function to compute this for all of the words in a bag.

In [None]:
def computeTF(wordCount, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordCount.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

We will run this function over all of our bags (documents), and put the resulting TFs in a single data structure. Tale a look and see how documents differ, and how the TFs reflect relative occurence of a word in each document.

In [None]:
termFreqs = []
for i in range(0,len(bows)): 
  print(wordCounts[i])
  print(bows[i])
  termFreqs.append(computeTF(wordCounts[i],bows[i]))

pd.DataFrame(termFreqs) 

{'200': 1, '81': 0, 'once': 0, 'half': 0, 'daily': 1, 'she': 0, 'fluconazole': 1, '500': 0, 'synthroid': 1, 'metformin': 0, 'mcg': 1, 'zoloft': 0, 'to': 0, '125': 1, '40': 0, 'evening': 1, 'not': 0, '25': 0, 'po': 3, '025': 1, 'with': 0, 'tablets': 0, 'd': 0, 'every': 2, 'day': 1, 'calcium': 0, 'vitamin': 0, 'tab': 0, 'aspirin': 0, 'lovastatin': 0, 'a': 0, 'consider': 0, 'clozapine': 0, 'klonopin': 1, 'switching': 0, 'multivitamin': 0, 'two': 0, 'will': 0, 'twice': 0, 'mg': 2, 'one': 0}
['klonopin', '025', 'mg', 'po', 'every', 'evening', 'fluconazole', '200', 'mg', 'po', 'daily', 'synthroid', '125', 'mcg', 'po', 'every', 'day']
{'200': 0, '81': 0, 'once': 0, 'half': 0, 'daily': 0, 'she': 1, 'fluconazole': 0, '500': 0, 'synthroid': 0, 'metformin': 0, 'mcg': 0, 'zoloft': 0, 'to': 1, '125': 0, '40': 0, 'evening': 0, 'not': 1, '25': 0, 'po': 0, '025': 0, 'with': 0, 'tablets': 0, 'd': 0, 'every': 0, 'day': 0, 'calcium': 0, 'vitamin': 0, 'tab': 0, 'aspirin': 0, 'lovastatin': 0, 'a': 0, 'cons

Unnamed: 0,200,81,once,half,daily,she,fluconazole,500,synthroid,metformin,...,consider,clozapine,klonopin,switching,multivitamin,two,will,twice,mg,one
0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.058824,0.0,...,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.117647,0.0
1,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.142857,0.142857,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0
2,0.0,0.0,0.0,0.052632,0.105263,0.0,0.0,0.052632,0.0,0.052632,...,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.052632,0.105263,0.105263
3,0.0,0.047619,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.095238,0.0


Our next function defines ***Inverse Docuemnt Frequency*** - IDF. This measures the rareness of a word across our whole collection of documents. For each word, we divide the total number of documents by the number containing that word. We take the log of this. 

In [None]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

Now we compute IDF for our words. Take a look at the difference between common words like "mg" and rare ones like drug names.

In [None]:
idfs = computeIDF(wordCounts)
pp.pprint(idfs)

{'025': 0.6020599913279624,
 '125': 0.6020599913279624,
 '200': 0.6020599913279624,
 '25': 0.6020599913279624,
 '40': 0.6020599913279624,
 '500': 0.6020599913279624,
 '81': 0.6020599913279624,
 'a': 0.6020599913279624,
 'aspirin': 0.6020599913279624,
 'calcium': 0.6020599913279624,
 'clozapine': 0.6020599913279624,
 'consider': 0.6020599913279624,
 'd': 0.6020599913279624,
 'daily': 0.12493873660829992,
 'day': 0.3010299956639812,
 'evening': 0.6020599913279624,
 'every': 0.6020599913279624,
 'fluconazole': 0.6020599913279624,
 'half': 0.6020599913279624,
 'klonopin': 0.6020599913279624,
 'lovastatin': 0.6020599913279624,
 'mcg': 0.6020599913279624,
 'metformin': 0.6020599913279624,
 'mg': 0.12493873660829992,
 'multivitamin': 0.6020599913279624,
 'not': 0.6020599913279624,
 'once': 0.6020599913279624,
 'one': 0.6020599913279624,
 'po': 0.12493873660829992,
 'she': 0.6020599913279624,
 'switching': 0.6020599913279624,
 'synthroid': 0.6020599913279624,
 'tab': 0.6020599913279624,
 'tabl

Let's define a function to put TF and IDF together.

In [None]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

And now run this over all of the documents in our term frequency list:

In [None]:
tfidfs = []
for i in termFreqs:
  tfidfs.append(computeTFIDF(i, idfs))
  

  
pd.DataFrame(tfidfs)


Unnamed: 0,200,81,once,half,daily,she,fluconazole,500,synthroid,metformin,...,consider,clozapine,klonopin,switching,multivitamin,two,will,twice,mg,one
0,0.035415,0.0,0.0,0.0,0.007349,0.0,0.035415,0.0,0.035415,0.0,...,0.0,0.0,0.035415,0.0,0.0,0.0,0.0,0.0,0.014699,0.0
1,0.0,0.0,0.0,0.0,0.0,0.086009,0.0,0.0,0.0,0.0,...,0.086009,0.086009,0.0,0.086009,0.0,0.0,0.086009,0.0,0.0,0.0
2,0.0,0.0,0.0,0.031687,0.013151,0.0,0.0,0.031687,0.0,0.031687,...,0.0,0.0,0.0,0.0,0.031687,0.0,0.0,0.031687,0.013151,0.063375
3,0.0,0.02867,0.086009,0.0,0.017848,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02867,0.0,0.0,0.011899,0.0


How do these compare to the term frequencies? Run the next line to get just the TFs. What differences are there?

In [None]:
pd.DataFrame(termFreqs)

Unnamed: 0,200,81,once,half,daily,she,fluconazole,500,synthroid,metformin,...,consider,clozapine,klonopin,switching,multivitamin,two,will,twice,mg,one
0,0.058824,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.058824,0.0,...,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.117647,0.0
1,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.142857,0.142857,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0
2,0.0,0.0,0.0,0.052632,0.105263,0.0,0.0,0.052632,0.0,0.052632,...,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.052632,0.105263,0.105263
3,0.0,0.047619,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.095238,0.0
