## N-gram Language Model

Split the dataset into a training and a testing subset. Use the category “title” for the testing set and the categories “comment” and “post” for the training set. The short length of titles will make them good candidates later as seeds for text generation.

In [16]:
import pandas as pd
import nltk
from nltk.util import ngrams
from collections import defaultdict, Counter
from tqdm import tqdm
import numpy as np
import random
import math

In [111]:
nltk.__version__

'3.6.5'

In [3]:
df = pd.read_csv("../data/stackexchange_812k.tokenized.csv").sample(frac=1, random_state=8).reset_index(drop=True)

In [4]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,tokens,n_tokens
0,161009,,309845.0,"I can't disclose the algorithm, but I can cert...",comment,"i can ' t disclose the algorithm , but i can c...",40
1,156252,,298634.0,I plan to leave the answer to this question in...,comment,i plan to leave the answer to this question in...,84
2,423360,,790161.0,"Wait, I need to clarify how is Half-normal dis...",comment,"wait , i need to clarify how is half - normal ...",25
3,268623,,,I am fitting several models of the form.. glm ...,post,i am fitting several models of the form .. glm...,82
4,433662,,808873.0,If you really want to calculate some p-value u...,comment,if you really want to calculate some p - value...,66


In [5]:
#Transform the dataset
df['tokens'] = df.tokens.apply(lambda txt : txt.split())

# We split the dataset into a training and a testing subset.
# The testing subset is composed of the titles, the train subset is composed of posts and comments

df_train = df[df.category.isin(['post','comment'])].copy()
df_test = df[df.category.isin(['title'])].copy()

In [6]:
df.sample(5).tokens.values

array([list(['the', 'lasso', 'method', 'for', 'variable', 'selection']),
       list(['if', 'your', 'real', 'task', 'is', 'about', 'discovering', 'the', 'covariance', 'matrix', ',', 'by', 'using', 'trials', ',', 'then', 'you', 'may', 'be', 'interested', 'in', 'a', 'direct', 'analytical', 'solution', 'the', 'variance', '-', 'covariance', 'matrix', 'for', 'a', 'uniform', 'variable', ',', 'in', 'euclidean', 'space', ',', 'on', 'the', 'cube']),
       list(['no', ',', 'i', 'don', "'", 't', 'think', 'as', 'i', 'go', ',', 'i', 'thin', 'afterward', '.', 'it', 'is', 'just', 'that', 'i', 'have', 'to', 'let', 'it', 'run', 'for', 'a', 'long', 'time', 'in', 'order', 'to', 'get', 'the', 'lags', 'i', 'want', '.']),
       list(['it', 'is', 'preferable', 'to', 'do', 'this', 'exercise', 'from', 'definition', 'as', 'done', 'here']),
       list(['well', ',', 'i', 'used', 'to', 'have', 'the', 'same', 'knowledge', 'gap', 'i', "'", 'm', 'referring', 'to', 'in', 'my', 'question', '.', 'it', 'seems', 'to', 

## Build the matrix of prefix—word frequencies.

### Use the ngrams function from nltk.utils to generate all n-grams from the corpus and set the following left_pad_symbol =  \<s> and right_pad_symbol = \</s>

In [7]:
df_train.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,tokens,n_tokens
0,161009,,309845.0,"I can't disclose the algorithm, but I can cert...",comment,"[i, can, ', t, disclose, the, algorithm, ,, bu...",40
1,156252,,298634.0,I plan to leave the answer to this question in...,comment,"[i, plan, to, leave, the, answer, to, this, qu...",84
2,423360,,790161.0,"Wait, I need to clarify how is Half-normal dis...",comment,"[wait, ,, i, need, to, clarify, how, is, half,...",25
3,268623,,,I am fitting several models of the form.. glm ...,post,"[i, am, fitting, several, models, of, the, for...",82
4,433662,,808873.0,If you really want to calculate some p-value u...,comment,"[if, you, really, want, to, calculate, some, p...",66


In [13]:
# Example
list(
    ngrams([1,2,3,4,5]
            , 2
            , pad_left=True
            , pad_right=True
            , left_pad_symbol='<s>'
            , right_pad_symbol='</s>'
          )
    )

[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]

In [8]:
ngrams_degree = 3
counts = defaultdict(Counter)
for tokens in tqdm(df_train.tokens.values):
    for ngram in ngrams(
            tokens, 
            n= ngrams_degree,  
            pad_right=True, 
            pad_left=True, 
            left_pad_symbol="<s>", 
            right_pad_symbol="</s>"):
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        counts[prefix][token] +=1

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 705964/705964 [01:04<00:00, 10901.18it/s]


## Write a text generation function

- Takes a bigram as input and generates the next token
- iteratively slide the prefix over the generated text so that the new prefix includes the most recent token; generates the next token
- to generate each next token, sample the list of words associated with the prefix using the probability distribution of the prefix
- stop the text generation when a certain number of words have been generated or the latest token is a </s>.

In [95]:
def normalize(dictionary):
    summed = math.fsum(dictionary.values())
    return { k : v/summed for k, v in dictionary.items() }

In [96]:
freq = defaultdict(dict)
for prefix, tokens in counts.items():
    freq[prefix] = normalize({token : counts[prefix][token] for token in tokens })

In [98]:
for i in range(5):
    prefix = random.choice(list(freq.keys()))
    print("{}: \t{}".format(prefix,freq[prefix]))

('<s>', 'booby'): 	{"'": 1.0}
('detailed', 'must'): 	{'be': 1.0}
('retain', 'that'): 	{'.': 0.16666666666666666, 'characteristic': 0.16666666666666666, 'meaning': 0.16666666666666666, 'number': 0.16666666666666666, 'customer': 0.16666666666666666, 'tomorrow': 0.16666666666666666}
('actionable', 'input'): 	{'to': 1.0}
('intensity', 'constructed'): 	{'and': 1.0}


In [99]:
# freq = defaultdict(dict)
def generate(text, n_words = 40):
    for i in range(n_words):
        prefix = tuple(text.split()[-ngrams_degree+1:])
        # no available text
        if len(freq[prefix]) == 0:
            break
        candidates = list(freq[prefix].keys())
        probas     = list(freq[prefix].values())
        probas = np.array(probas)
        probas /= probas.sum()
        text      += ' ' + np.random.choice(candidates, p = probas)
        if text.endswith('</s>'):
            break

    return text

In [100]:
text      = 'the model'
print()
print(generate(text))

print()
text      = 'that distribution'
print(generate(text))

print()
text      = 'to determine'
print(generate(text))


the model choice for convienience ? and those are two overall k records and to adjust for rater exercise status x id , mixture models and clustering is a sequence . this issue . </s>

that distribution . e . only the p - value from the grand mean , sd . indeed , the map maximum aposteriori probability density function . see , inter - dependent covariates . the format and open an issue of distinct

to determine an increase in error terms . first model predictor lt - lmer out between within is not reported . for factors such as cca in a country group medium s . does the relative mse at </s>


## Write a function that can estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences.

- Split the sentence into trigrams and use the chain rule to calculate the probability of the sentence as a product of the bigrams—tokens probabilities

Estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences using the code snippet shared below.

In [43]:
def generate_temp(text, temperature = 1, n_words=30):
    for i in range(n_words):
        prefix = tuple(text.split()[-ngrams_degree+1:])
        # no available next word
        if len(freq[prefix]) == 0:
            break
        candidates  = list(freq[prefix].keys())
        initial_probas = list(freq[prefix].values())
        # modify distribution
        denom   = sum( [ p ** temperature for p in initial_probas ] )
        probas  = [ p ** temperature / denom  for p in initial_probas ]
        text       += ' ' + np.random.choice(candidates, p = probas)
        if text.endswith('</s>'):
            break

    return text

In [45]:
text      = 'the model'
print()
print(generate_temp(text))

print()
text      = 'that distribution'
print(generate_temp(text))

print()
text      = 'to determine'
print(generate_temp(text))


the model is capturing seasonality properly . you would make sense to interpolate between points . </s>

that distribution ? is your description , but it is simple to express the value of outcome you regard patients as sick retrieved . on noisy observations is smaller than the median

to determine sensitivity and specificity happen to the population sizes of fish per unit time days until the predictions and confidence interval has only participants for correction for overdispersion which means i


## Implement the perplexity scoring function for a given sentence and for the training corpus.

Use the following code snippet to deal with the missing ngrams and tokens using Laplace smoothing. Skip missing elements to make function run faster.

In [49]:
#Calculate the perplexity on some sentences.
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

def perplexity(sentence):
    sentence = tokenizer.tokenize(sentence.lower())
    N = len(sentence)
    logprob = 0

    for ngram in ngrams(
          sentence, 
          n= ngrams_degree,  
          pad_right=True, pad_left=True, 
          left_pad_symbol="<s>", right_pad_symbol="</s>"):
        try:
            prefix = ngram[:ngrams_degree-1] 
            token = ngram[ngrams_degree-1]
            logprob += np.log( freq[ prefix ][token]  )
        except:
            pass

    return np.exp(- logprob / N)

In [47]:
text      = 'the model'
print()
print(perplexity(text))

print()
text      = 'that distribution'
print(perplexity(text))

print()
text      = 'to determine'
print(perplexity(text))


1051.7392900892169

12757.759076995711

8858.233390878902


## Implement Additive Laplace smoothing to give a non-zero probability to missing prefix—token combinations when calculating perplexity.

In [86]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

def perplexity_laplace(sentence,delta = 1):
    sentence = tokenizer.tokenize(sentence.lower())
    N = len(sentence)
    logprob = 0
    for ngram in ngrams(
          sentence, 
          n= ngrams_degree,  
          pad_right=True, pad_left=True, 
          left_pad_symbol="<s>", right_pad_symbol="</s>"):
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        if prefix in list(counts.keys()):
            total = sum( counts[prefix].values()  )
            if token in counts[prefix].keys():
                # normal calculation
                logprob += np.log( (counts[prefix][token] + delta)/ (total + delta * N ) )
            else:
                logprob += np.log( ( delta)/ (total + delta * N ) )
        else:
            logprob += - np.log( N )

    return np.exp(-logprob / N)

In [87]:
sentence = "this model belongs on a different planet"
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 10), sentence))

sentence = "this question really belongs on a different site."
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 10), sentence))

[perplexity 142.66] this model belongs on a different planet
[perplexity 35.50] this question really belongs on a different site.


In [88]:
sentence = "this model belongs on a different planet"
print("\n[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 1), sentence))

sentence = "this question really belongs on a different site."
print("\n[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 1), sentence))


[perplexity 319.66] this model belongs on a different planet

[perplexity 36.10] this question really belongs on a different site.


## Calculate the perplexity of the language model on the test set composed of titles.


In [85]:
scores = [logproba_sentence(sentence) for sentence in corpus]
score = np.ma.masked_invalid(scores).sum()
- score

21542.76845163307

In [108]:
def corpus_perplexity(corpus):
    # start by calculating the total number of tokens in the corpus
    token_count = np.sum([len(tokenizer.tokenize(sentence)) for sentence in corpus])
    logproba_corpus_scores = np.array([logproba_sentence(sentence) for sentence in corpus])
    logproba_corpus = np.ma.masked_invalid(logproba_corpus_scores).sum()
#     perplexity =  np.multiply((1/token_count), logproba_corpus)
    perplexity =  np.exp(- logproba_corpus/token_count)
    print(token_count, logproba_corpus)
    return perplexity

In [109]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

def logproba_sentence(sentence, delta = 1):
    sentence = tokenizer.tokenize(sentence.lower())
    logprob = 0
    for ngram in ngrams(
        sentence, n= ngrams_degree,  
        pad_right=True, pad_left=True, 
        left_pad_symbol="<s>", right_pad_symbol="</s>"):
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        try:
            logprob += np.log( freq[prefix][token] )
        except:
              pass
    return logprob

# The perplexity of a sample of 1000 titles
corpus = df_test.text.sample(1000, random_state = 8).values
corpus_perplexity(corpus)

10334 -31745.58810864042


21.584069021216223

In [110]:
corpus_perplexity(df_test.text.values)

871654 -2661949.8855554415


21.197994526450948

## Try to improve the perplexity score of your model by:
- modifying the preprocessing phase of the corpus,
- increasing or decreasing number of tokens in the model (bi grams, 4-grams, etc.),
- varying the delta parameter in the Additive Laplace smoothing step.