## Workflow
1. Split the dataset into training and a testing subset. Use the category “title” for the testing set and the categories “comment” and “post” for the training set. The short length of titles will make them good candidates later as seeds for text generation.
2. 	Build the matrix of prefix—word frequencies.
 -  	Use the ngrams function from nltk.utils to generate all n-grams from the corpus
 -  	Set the following left_pad_symbol = <s> and right_pad_symbol = </s>
3. 	Write a text generation function:
 -  	Takes a bigram as input and generates the next token
 -  	Iteratively slide the prefix over the generated text so that the new prefix includes the most recent token; generates the next token
 -  	To generate each next token, sample the list of words associated with the prefix using the probability distribution of the prefix
 -	    Stop the text generation when a certain number of words have been generated or the latest token is a </s>.
4.	Write a function that can estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences.
-   	Split the sentence into trigrams and use the chain rule to calculate the probability of the sentence as a product of the bigrams—tokens probabilities
5.	Implement the perplexity scoring function for a given sentence and for the training corpus.
6.	Implement Additive Laplace smoothing to give a non-zero probability to missing prefix—token combinations when calculating perplexity.
7.	Calculate the perplexity of the language model on the test set composed of titles.
8.	Try to improve the perplexity score of your model by:
 -   	modifying the pre-processing phase of the corpus,
 -   	increasing or decreasing number of tokens in the model (bi grams, 4-grams, etc.),
 -   	varying the delta parameter in the Additive Laplace smoothing step.


## 1. Load Data

In [44]:
import pandas as pd
import numpy as np
import nltk

In [45]:
data = pd.read_csv('../input/build-dom-spec-models/stackexchange_812k_v2.csv')
data.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,tokens,n_tokens
0,291254,,601672.0,The condition makes the gradient unbiased. it ...,comment,the condition makes the gradient unbiased . it...,17
1,115372,,221284.0,"Yes, that sounds fine to me.",comment,"yes , that sounds fine to me .",8
2,327356,,,Consider gaussian variables belonging to a gau...,post,consider gaussian variables belonging to a gau...,31
3,186923,,355055.0,Thanks S. Catterall. - Integrability I knew th...,comment,thanks s . catterall . - integrability i knew ...,30
4,433143,,,Feature with very few extreme values,title,feature with very few extreme values,6


## 2.Split the dataset into training and a testing subset.
Use the category “title” for the testing set and the categories “comment” and “post” for the training set. 

In [46]:
# check distinct values of category
data['category'].value_counts()

comment    540587
post       165377
title       83685
Name: category, dtype: int64

In [47]:
df_test = data[data.category == 'title']

In [48]:
#df_test.head()

In [49]:
df_train = data[data.category != 'title']

In [50]:
df_train.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,tokens,n_tokens
0,291254,,601672.0,The condition makes the gradient unbiased. it ...,comment,the condition makes the gradient unbiased . it...,17
1,115372,,221284.0,"Yes, that sounds fine to me.",comment,"yes , that sounds fine to me .",8
2,327356,,,Consider gaussian variables belonging to a gau...,post,consider gaussian variables belonging to a gau...,31
3,186923,,355055.0,Thanks S. Catterall. - Integrability I knew th...,comment,thanks s . catterall . - integrability i knew ...,30
5,366261,,688119.0,Maybe I'm just a Bayesian at heart A Bayesian ...,comment,maybe i ' m just a bayesian at heart a bayesia...,87


In [51]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 705964 entries, 0 to 789648
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   post_id     705964 non-null  int64  
 1   parent_id   74508 non-null   float64
 2   comment_id  540587 non-null  float64
 3   text        705964 non-null  object 
 4   category    705964 non-null  object 
 5   tokens      705964 non-null  object 
 6   n_tokens    705964 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 43.1+ MB


In [52]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83685 entries, 4 to 789619
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   post_id     83685 non-null  int64  
 1   parent_id   0 non-null      float64
 2   comment_id  0 non-null      float64
 3   text        83685 non-null  object 
 4   category    83685 non-null  object 
 5   tokens      83685 non-null  object 
 6   n_tokens    83685 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 5.1+ MB


## 3. Build the matrix of prefix—word frequencies

In [53]:
from nltk.util import ngrams
from collections import defaultdict , Counter
import math

### Create the Vocabulary

In [54]:
# Get all tokens in training set
train_tokens = []
for txt in df_train['tokens']:
    for tok in txt.split():
        train_tokens.append(tok)


# create dictionary {token:number of occurances in train set}
tok_count = Counter(train_tokens)

# remove tokens with less than 10 occurances
tok_count_f = {key:val for key, val in tok_count.items() if val >= 10}

# The Vocabulary of the stack exchange corpus
vocab = list(set(tok_count_f.keys()))

In [55]:
vocab

['dlogllc',
 'corelation',
 'argmin',
 'computation',
 'donations',
 'ivv',
 'coarser',
 'ph',
 'mud',
 'advancing',
 'damning',
 'hosmer',
 'enrolling',
 'rpg',
 'they',
 'grouped',
 'bender',
 'rbms',
 'grades',
 'yoy',
 'cval',
 'layers',
 'cgpa',
 'maneuver',
 'prepend',
 'merriam',
 'albedo',
 'sokal',
 'keyed',
 'intake',
 'angela',
 'startup',
 'fluorescence',
 'onehotencoder',
 'sqft',
 'procedures',
 'kutner',
 'hexbin',
 'wx',
 'flats',
 'if',
 'soluble',
 'drowning',
 'reasonableness',
 'hspace',
 'credited',
 'detract',
 'fathers',
 'scad',
 'loses',
 'remarked',
 'ohe',
 'herewith',
 'level',
 'oscillatory',
 'damage',
 'cforest',
 'fulfills',
 'krylov',
 'dots',
 'rrf',
 'indqual',
 'fifty',
 'hardt',
 'ranged',
 'award',
 'kindly',
 'haz',
 'multicolinearity',
 'broken',
 'hype',
 'pysal',
 'middling',
 'intr',
 '------------------------------------------------------------------',
 'preparation',
 'algorith',
 'independent',
 'mix',
 'gof',
 'preselected',
 'investors',


In [56]:
len(vocab)

29716

Implement a Trigram Model

In [57]:
trigram_model_count = defaultdict(lambda: defaultdict(lambda: 0))
 
for text in df_train['tokens']:
    for w1, w2, w3 in ngrams(text.split(), 3,pad_right=True, pad_left=True,right_pad_symbol='</s>', left_pad_symbol='<s>'):
        trigram_model_count[(w1, w2)][w3] += 1

In [58]:
trigram_model_count[('That','is')]['fine']

0

In [59]:
bigram_model_count = defaultdict(lambda: defaultdict(lambda: 0))
for text in df_train['tokens']:
    for w1, w2 in ngrams(text.split(),2,pad_right=True, pad_left=True,right_pad_symbol='</s>', left_pad_symbol='<s>'):
        bigram_model_count[w1][w2] += 1

Calculate Log Probability of Sentence
Using Additive Laplace smoothing

In [82]:
alpha = 0.01
# size of vocabulary
vocab_size = len(set(vocab))

def get_sentence_prob(sentence,alpha=0.01):
# use additive Laplace smoothing for calculating probability to handle OOV tokens
    #seed(10)
    sent_prob = 0
    sentence = sentence.split()
    for token in range(len(sentence)-2):
        #print(token)
        #print(trigram_model_count[sentence[token],sentence[token+1]][sentence[token+2]])
        #print(bigram_model_count[sentence[token]][sentence[token+1]])
        sent_prob += math.log2((trigram_model_count[sentence[token],sentence[token+1]][sentence[token+2]] + alpha)
                               /(bigram_model_count[sentence[token]][sentence[token+1]] + alpha*vocab_size))
    
    return sent_prob

In [61]:
def get_perplexity_score(sentence,alpha):
    
    word_count = len(sentence.split()) # Number of words in sentence
    sent_logprob = get_sentence_prob(sentence,alpha) # Log probability of sentence
    cross_ent = - sent_logprob / word_count # Cross Entropy
    perp_score = math.pow(2,cross_ent) # Perplexity
           
    return perp_score


In [83]:
#Test Probability and Perplexity for a sentence
alpha =0.001
sentence = "that sounds fine"
sent_prob = get_sentence_prob(sentence,alpha)
sent_perp_score = get_perplexity_score(sentence,alpha)
print("Probability:{0:.3f}".format(sent_prob))
print("perplexity: {0:.3f}".format(sent_perp_score))


6
549
6
549
Probability:-6.592
perplexity: 4.586


Compute Perplexity on the entire Training Corpus

In [89]:
sent_prob = 0
alpha = 0.001
trigram_cnt = 0
for sentence in df_train['tokens']:
    sentence = sentence.split()
    sentence =  sentence + ['</s>'] + ['</s>']
    #print(len(sentence))
    #print(sentence)
    for token in range(len(sentence)-2):
        sent_prob += math.log2((trigram_model_count[sentence[token],sentence[token+1]][sentence[token+2]] + alpha)
                               /(bigram_model_count[sentence[token]][sentence[token+1]] + alpha*vocab_size))
        trigram_cnt+=1
        #print(token)
        
cross_ent = -sent_prob / trigram_cnt
perp_score = math.pow(2,cross_ent)

print("Training set - Perplexity Score: {0:.3f}".format(perp_score))

Training set - Perplexity Score: 37.424


In [93]:
def get_corpus_perp_score(datframe, alpha =0.001):
    sent_prob = 0
    trigram_cnt = 0
    for sentence in datframe['tokens']:
        sentence = sentence.split()
        sentence =  sentence + ['</s>'] + ['</s>']
        #print(len(sentence))
        #print(sentence)
    for token in range(len(sentence)-2):
        sent_prob += math.log2((trigram_model_count[sentence[token],sentence[token+1]][sentence[token+2]] + alpha)
                               /(bigram_model_count[sentence[token]][sentence[token+1]] + alpha*vocab_size))
        trigram_cnt+=1
        #print(token)
        
    cross_ent = -sent_prob / trigram_cnt
    perp_score = math.pow(2,cross_ent)
    
    return perp_score


In [None]:
alpha = 0.001
perplexity_score = get_corpus_perp_score(df_train,alpha)
print("Training set - Perplexity Score: {0:.3f}".format(perplexity_score))

In [94]:
alpha = 0.001
perplexity_score = get_corpus_perp_score(df_test,alpha)
print("Test set - Perplexity Score: {0:.3f}".format(perplexity_score))

Test set - Perplexity Score: 12.349
