$$\large \color{green}{\textbf{Computing the Term Frequency-Inverse Document Frequency}}$$
$$\small \color{red}{\textbf{The CopyRight @ Phuong V. Nguyen}}$$
$$\small \color{blue}{\textbf{phuong.nguyen@summer.barcelonagse.eu}}$$

$$\small \color{green}{\textbf{Introduction}}$$

Why one needs to compute the Term Frequency-Inverse Document Frequency (TF-IDF)? This is because Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. Thus, this project aims to introduce a step-by-step procedure for converting a raw text into vectors of numbers.

# Calling the necessary libraries

In [48]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

In [5]:
from pickle import dump
from pickle import load
Purple= '\033[95m'
Cyan= '\033[96m'
Darkcyan= '\033[36m'
Blue = '\033[94m'
Green = '\033[92m'
Yellow = '\033[93m'
Red = '\033[91m'
Bold = "\033[1m"
Reset = "\033[0;0m"
Underline= '\033[4m'
End = '\033[0m'
from pprint import pprint

# Creating documents
In this mini project, for simplicity, we will be working with two simple documents containing one sentence each as follows.

In [2]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

 It is worth noting that in Natural Language Processing (NLP), a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a $\textbf{bag of words}$ model or $\textbf{BoW}$ for short. 
 
# Natural Language Processing (NLP)
## Creating a common BoW
First, we create the separating Bow for each document above
### Creating the separating BoW

In [6]:
bagOfWordsA = documentA.split(' ')
print(Bold+'BoW of the document A:'+End)
print(bagOfWordsA)
bagOfWordsB = documentB.split(' ')
print(Bold+'BoW of the document B:'+End)
print(bagOfWordsB)

[1mBoW of the document A:[0m
['the', 'man', 'went', 'out', 'for', 'a', 'walk']
[1mBoW of the document B:[0m
['the', 'children', 'sat', 'around', 'the', 'fire']


Ok, we see that BoW of both documents A and B have a number of duplicate words, such as $\textbf{"the"}$. Now we will merge these two separating BoW. By doing this, we will delete any duplicate words as follows.

### Merging two separating BoW as one

In [7]:
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print(Bold+'The comment BoWs of two documents:'+End)
print(uniqueWords)

[1mThe comment BoWs of two documents:[0m
{'around', 'went', 'walk', 'sat', 'the', 'children', 'fire', 'man', 'for', 'a', 'out'}


Next, we’ll create a dictionary of words and their occurence for each document in the corpus (collection of documents).
### Creating the corpus

In [11]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for i in bagOfWordsA:
    numOfWordsA[i] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for i in bagOfWordsB:
    numOfWordsB[i] += 1

In [41]:
db=pd.DataFrame([numOfWordsB])
da=pd.DataFrame([numOfWordsA])
corpus = da.append(db, ignore_index=True)
corpus

Unnamed: 0,around,went,walk,sat,the,children,fire,man,for,a,out
0,0,1,1,0,1,0,0,1,1,1,1
1,1,0,0,1,2,1,1,0,0,0,0


Another problem with the bag of words approach is that it doesn’t account for noise. In other words, certain words are used to formulate sentences but do not add any semantic meaning to the text. For example, the most commonly used word in the english language is the which represents 7% of all words written or spoken. You couldn’t make deduce anything about a text given the fact that it contains the word the. On the other hand, words like good and awesome could be used to determine whether a rating was positive or not. In natural language processing, useless words are referred to as stop words.
### Stop words

In [53]:
nltk.download('stopwords')
print(Bold+'The number of useless words in English'+End)
print(stopwords.words('English'))

[1mThe number of useless words in English[0m
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor'

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/phuong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Typically when building a model for analynize text, we will remove all of stop words.Another strategy is to score the relative importance of words using TF-IDF.
# Term Frequency - Inverse Document Frequency
## Term frequency
The number of times a word appears in a document divded by the total number of words in this document. Every document has its own term frequency

$$TF_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{i,j}}$$

In [57]:
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount)
    return tfDict

In [64]:
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)

TF=pd.DataFrame([tfA ])
TF=TF.append(tfB, ignore_index=True)
print(Bold+'The Term Frequency:'+Bold)
TF

[1mThe Term Frequency:[1m


Unnamed: 0,around,went,walk,sat,the,children,fire,man,for,a,out
0,0.0,0.142857,0.142857,0.0,0.142857,0.0,0.0,0.142857,0.142857,0.142857,0.142857
1,0.166667,0.0,0.0,0.166667,0.333333,0.166667,0.166667,0.0,0.0,0.0,0.0


## Inverse Document Frequency

Inverse data frequency determines the weight of unique words across all documents in the corpus.
$$IDF(w)=log(\frac{N}{df_{t}}) $$

In [65]:
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

In [68]:
idfs = computeIDF([numOfWordsA, numOfWordsB])
print(Bold+'The Inverse Document Frequency'+End)
IDF=pd.DataFrame([idfs])
IDF

[1mThe Inverse Document Frequency[0m


Unnamed: 0,around,went,walk,sat,the,children,fire,man,for,a,out
0,0.693147,0.693147,0.693147,0.693147,0.0,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147


## TF-IDF
$$TFIDF=TF_{i,j}*IDF_{i}$$

In [69]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [70]:
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
print(Bold+'The Term Frequecy Inverse Document Frequency'+End)
tfidf = pd.DataFrame([tfidfA, tfidfB])
tfidf

[1mThe Term Frequecy Inverse Document Frequency[0m


Unnamed: 0,around,went,walk,sat,the,children,fire,man,for,a,out
0,0.0,0.099021,0.099021,0.0,0.0,0.0,0.0,0.099021,0.099021,0.099021,0.099021
1,0.115525,0.0,0.0,0.115525,0.0,0.115525,0.115525,0.0,0.0,0.0,0.0


## Skitlearn

In [71]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
tfidf_sk = pd.DataFrame(denselist, columns=feature_names)
print(Bold+'The values of TF-IDF:'+End)
tfidf_sk

[1mThe values of TF-IDF:[0m


Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0.0,0.0,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.42616
1,0.407401,0.407401,0.407401,0.0,0.0,0.0,0.407401,0.579739,0.0,0.0


The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations.