# Introduction

<div class="alert alert-block alert-warning">
Natural Language Processing (NLP) is a subfield of artificial intelligence that deals understanding & processing human language. Language models are one of the most important parts of Natural Language Processing. 
</div>

# Language Models

<div class="alert alert-block alert-warning"> 

There are primarily two types of language models:
* Statistical Language Models - These models use traditional statistical techniques like n-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.
* Neural Language Models - These models surpass statistical language models in their effectiveness. They use different kinds of neural networks to model language.

<div class="alert alert-block alert-warning"> 

Both 'TF-IDF' and 'n-grams' are used to prepare text documents for searching. They provide different indexing rules to find matching documents. In this shot, we will work on `bigram model` & `TF-IDF` implementation.

Importing necessary libraries

In [None]:
#!pip3 install nltk
import re
import math
import nltk
import numpy as np
#nltk.download('stopwords')
import pandas as pd
from itertools import islice
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Importing the corpuses for analysis. 

In [None]:
with open('TheStoryofAnHour-KateChopin.txt') as f:
    corpusA = f.read()

with open('TheStoryofTheSlyFox-Jeanine Pirro.txt') as f:
    corpusB = f.read()

<div class="alert alert-block alert-warning">
    
<b></b> 
Another problem with the text analysis is it does not account for noise. In other words, certain words are used to formulate sentences that do not add any specific meaning to the text. for example, the most commonly used word in the english language is the which represents 7% of the all words written or spoken. We could not make reduce anythiing about the text given the fact that it contain the word the. On the other hand, words like good and awesome could be used to determine whether a rating was positive or not. 

In the natural language processing, useless words are referred to as stop words. The python natural language toolkit library provides a list of english stop words.
    
</div>

Importing the english stopwords. 

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
print('stopwords:\n', stopwords[0:20])

## Preprocessing

<div class="alert alert-block alert-warning">
    
 As preprocessing, we removed all capitalization, special characters, and numbers from the text. In addition, remove the following common words (called `stopwords`).

After tokenization, removed also tokens with length less than 2 We perform this step because if e.g. you have in the text words like “sister’s”, tokenization will result in meaningless tokens like “s” (Warning: this might lead to omitting words such as the personal pronoun “I” which could be undesirable in practice, but for this exercise it is okay).

</div>

In [None]:
def preprocesssing(corpus):
    corpus0 = corpus.lower()                                                        # removd all capitalization   
    corpus1 = re.sub('[^A-Za-z0-9]+', ' ', corpus0)                                 # remove special characters
    corpus2 = ''.join([i for i in corpus1 if not i.isdigit()])                      # remove special numbers
    words = [word for word in corpus2.split() if word.lower() not in stopwords]     # remove stopwords
    corpus3 = " ".join(words)  
    corpus4 = list(corpus3.split(" "))                                              # tokenization
    corpus5 = [x for x in corpus4 if len(x) >= 2]                                   # remove tokens with length<2 [e.g: “s”, “I”]
    return corpus5

<div class="alert alert-block alert-warning">
    
<b></b> 
Machine learning algorithm cannot work with the raw text directly. Rather the text must be converted into vectors of numbers. In nature language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This approach is called a bag of words model or BoW for short. It's referred to as a "bag" of words because any information about the structure of the sentence is lost. Here, we will get the `Bow` as `bagOfWordsA` & `bagOfWordsB` after preprocessing.
    
</div>

In [None]:
bagOfWordsA = preprocesssing(corpusA)
bagOfWordsB = preprocesssing(corpusB)
print('bagOfWordsA:\n', bagOfWordsA[0:20])
print('\nbagOfWordsB:\n', bagOfWordsB[0:20])

# What is Bigram model?

<div class="alert alert-block alert-warning"> 
In this shot, I will be implementing `bigram model`. In this model, we find bigrams, which are two words coming together in the corpus (the entire collection of words/sentences). We use a `bigram model` to predict the conditional probability of the next word. To estimate bigram probabilities, we can use the following equation:

$$ P(W_n| W_{n-1}) = \frac {count(W_{n-1}, W_n)} {count(W_{n-1})} $$
    
    
</div>

In [None]:
bigramsA = [(s1, s2) for s1, s2 in zip(bagOfWordsA, bagOfWordsA[1:])]
number_of_bigramsA = len(bigramsA)
print('Number of total bigrams: ', number_of_bigramsA)
print('\nBigramsA:\n', bigramsA[0:20])

<div class="alert alert-block alert-warning">
    
Using a word bi-gram language model, what is the probability of the phrase ‘life might’ in corpusA?
</div>

In [None]:
prob_life = bagOfWordsA.count("life")/len(bagOfWordsA)
prob_might_given_life = bigramsA.count(('life', 'might'))/bagOfWordsA.count('life')
prob_life_might = prob_life * prob_might_given_life
print ('Probability of the phrase ‘life might’ is:', prob_life_might)

# What is TF-IDF?

<div class="alert alert-block alert-warning">
       
`TF-IDF` is useful in many natural language processing applications. For example, Search Engines use `TF-IDF` to rank the relevance of a corpus for a query. `TF-IDF` is also employed in text classification, text summarization, and topic modeling.
    
</div>

In [None]:
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print('uniqueWords:\n', list(uniqueWords)[:10])

In [None]:
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
    numOfWordsA[word] += 1      

numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
    numOfWordsB[word] += 1 
    
print('numOfWordsA:\n', sorted(numOfWordsA.items())[:10])
print('\nnumOfWordsB:\n', sorted(numOfWordsB.items())[:10])

# Term Frequency (TF)

<div class="alert alert-block alert-warning">
    
TF measures how frequently a term occurs in a corpus. Since every corpus is different in length, it is possible that a term would appear much more times in long corpuss than shorter ones. Thus, the term frequency is often divided by the corpus length (aka. the total number of terms in the corpus) as a way of normalization:
   

$$ tf_{(i,j)} = \frac {n_{(i,j)}} {\sum_k n_{(i,j)}} $$
Every corpus has its own term frequency.   
</div>

In [None]:
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bagOfWordsCount)
    return tfDict

In [None]:
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)

def take(n, iterable):
    return list(islice(iterable, n))

print('tfA:\n', take(10, tfA.items()))
print('\ntfA:\n', take(10, tfB.items()))

# Inverse corpus Frequency (IDF)

<div class="alert alert-block alert-warning">
    
IDF measures how important a term is. While computing TF, all terms are considered equally important. The log of the number of corpuss divided by the number of corpuss that contain the word w. Inverse data frequenct determines the weight of rare words across all corpuss in the corpus.
    
$$ idf(w) = log (\frac{N}{df_t}) $$
    
</div>

In [None]:
def computeIDF(documents):
    N = len(documents)
    
    idfDict = dict.fromkeys(documents[0].keys(), 1)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
                
                
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    
    return idfDict

In [None]:
idfs = computeIDF([numOfWordsA, numOfWordsB])
print('idfs:\n', take(30, idfs.items()))

# TF-IDF

<div class="alert alert-block alert-warning">

The TF-IDF of a term is calculated by multiplying TF and IDF scores.
    
$$ w_{i,j} = tf_{i,j} * log(\frac{N}{df_i}) $$
    
</div>

In [None]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [None]:
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
df = pd.DataFrame([tfidfA, tfidfB])
df = df.sort_index(axis=1)
df

Let's check the dataframe avoiding the null values. 

In [None]:
df = df.replace(0, np.nan)
df = df[df.columns[~df.isnull().all()]]
df

# Using `TfidfVectorizer` from `sklearn`

<div class="alert alert-block alert-warning">
    
`TfidfVectorizer` converts a collection of raw documents to a matrix of TF-IDF features. It uses an in-memory vocabulary (as dict) to map the most frequent words to feature indices and hence compute a word occurrence frequency (sparse) matrix.
    
</div>

In [None]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([corpusA, corpusB])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns = feature_names)
df = df.sort_index(axis=1)
df

Let's check the dataframe avoiding the null values. 

In [None]:
df = df.replace(0, np.nan)
df = df[df.columns[~df.isnull().all()]]
df