<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Creating TF-IDF Model from Scratch</p>

In [None]:
'''TF-IDF model is one of the most widely used models for text to numeric conversion.'''

In [1]:
'''
 Suppose we have a corpus with three sentences:

"I like to play football"
"Did you go outside to play tennis"
"John and I play tennis"
'''

'\n Suppose we have a corpus with three sentences:\n\n"I like to play football"\n"Did you go outside to play tennis"\n"John and I play tennis"\n'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _1: Tokenize the Sentences</p>

In [None]:
'''
The first step in this regard is to convert the sentences in our corpus into tokens or individual words
Sentence 1	Sentence 2	Sentence 3
I	        Did	        John
like	    you         and
to	        go	        I
play	    outside	    play
football	to	    tennis
            play	
            tennis
'''

'''The idea behind the TF-IDF approach is that the words that are more common in one sentence and
 less common in other sentences should be given high weights.'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _2:  Find TF-IDF Values</p>

In [None]:
'''
TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)
'''

'''For instance, look at the word "play" in the first sentence. Its term frequency will be 0.20 since the word "play" 
occurs only once in the sentence and 
the total number of words in the sentence are 5, hence, 1/5 = 0.20'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _3: Creating the Bag of Words Model</p>

In [None]:
 
'''IDF: (Total number of sentences (documents))/(Number of sentences (documents) containing the word)'''

'''Let's find the IDF frequency of the word "play". Since we have three documents and the word "play" occurs 
in all three of them, 
therefore the IDF value of the word "play" is 3/3 = 1.'''

In [None]:
'''Finally, the TF-IDF values are calculated by multiplying TF values with their corresponding IDF values.'''

In [None]:
'''Finally, we will filter the 8 most frequently occurring words.

As I said earlier, since IDF values are calculated using the whole corpus, we can calculate the IDF value for each word now. 
The following table contains IDF values for each table.'''

In [None]:
'''Word     Frequency       IDF
play        3               3/3 = 1

tennis      2               3/2 = 1.5

to          2               3/2 = 1.5
  	
I           2               3/2 = 1.5
...  	

'''

In [None]:
'''Let's now find the TF-IDF values for all the words in each sentence.'''
'''
Word        Sentence 1          Sentence 2              Sentence 3
play        0.20 x 1 = 0.20     0.14 x 1 = 0.14         0.20 x 1 = 0.20

tennis      0 x 1.5 = 0         0.14 x 1.5 = 0.21       0.20 x 1.5 = 0.30
  ...
'''

In [None]:
'''Note the use of the log function with TF-IDF.'''
'''IDF: log((Total number of sentences (documents))/(Number of sentences (documents) containing the word))'''

<p style="font-family:Roboto; font-size: 28px; color: magenta"> TF-IDF Model from Scratch in Python</p>

In [2]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

corpus = nltk.sent_tokenize(article_text)

for i in range(len(corpus )):
    corpus [i] = corpus [i].lower()
    corpus [i] = re.sub(r'\W',' ',corpus [i])
    corpus [i] = re.sub(r'\s+',' ',corpus [i])

wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

In [None]:
'''Finally, we create a dictionary of word frequencies and then filter the top 200 most frequently occurring words'''
print(len(most_freq))

200


In [6]:
'''The next step is to find the IDF values for the most frequently occurring words in the corpus.'''
word_idf_values = {}
for token in most_freq:
    doc_containing_word = 0
    for document in corpus:
        if token in nltk.word_tokenize(document):
            doc_containing_word += 1
    word_idf_values[token] = np.log(len(corpus)/(1 + doc_containing_word))

In [None]:
word_tf_values = {}
for token in most_freq:
    sent_tf_vector = []
    for document in corpus:
        doc_freq = 0
        '''The next step is to create the TF dictionary for each word'''
        for word in nltk.word_tokenize(document):
            if token == word:
                  doc_freq += 1
        word_tf = doc_freq/len(nltk.word_tokenize(document))
        sent_tf_vector.append(word_tf)
    ''' In the script above word_tf_values is our dictionary. For each word, we create a list sent_tf_vector'''
    word_tf_values[token] = sent_tf_vector

In [8]:
'''Now we have IDF values of all the words, along with TF values of every word across the sentences. 
The next step is to simply multiply IDF values with TF values'''
tfidf_values = []
for token in word_tf_values.keys():
    tfidf_sentences = []
    for tf_sentence in word_tf_values[token]:
        tf_idf_score = tf_sentence * word_idf_values[token]
        tfidf_sentences.append(tf_idf_score)
    tfidf_values.append(tfidf_sentences)

In [9]:
'''Now at this point in time, the tfidf_values is a list of lists.\
We need to convert the two-dimensional list to a numpy array'''
tf_idf_model = np.asarray(tfidf_values)
tf_idf_model

array([[0.        , 0.00678011, 0.        , ..., 0.00794909, 0.01002277,
        0.00390718],
       [0.00894022, 0.00368127, 0.        , ..., 0.00647396, 0.01088375,
        0.00212141],
       [0.04542777, 0.0374111 , 0.03533271, ..., 0.01096532, 0.        ,
        0.04311788],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [10]:
'''We want rows to represent the TF-IDF vectors. We can do so by simply transposing our numpy array as follows'''
tf_idf_model = np.transpose(tf_idf_model)
tf_idf_model

array([[0.        , 0.00894022, 0.04542777, ..., 0.        , 0.        ,
        0.        ],
       [0.00678011, 0.00368127, 0.0374111 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.03533271, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.00794909, 0.00647396, 0.01096532, ..., 0.        , 0.        ,
        0.        ],
       [0.01002277, 0.01088375, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00390718, 0.00212141, 0.04311788, ..., 0.        , 0.        ,
        0.        ]])