# n-gram model

In [1]:
import numpy as np

In [2]:
import pandas as pd

df = pd.read_csv('tokenized.csv')

In [3]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,eliciting priors from experts,title
1,2,,,what is normality ?,title
2,3,,,what are some valuable statistical analysis op...,title
3,4,,,assessing the significance of differences in d...,title
4,6,,,the two cultures : statistics vs . machine lea...,title


## Train-test split

In [4]:
test = df[df.category=='title']
train = df[df.category!='title']

In [5]:
train

Unnamed: 0,post_id,parent_id,comment_id,text,category
91736,1,,,how should i elicit prior distributions from e...,post
91737,2,,,in many different statistical methods there is...,post
91738,3,,,what are some valuable statistical analysis op...,post
91739,4,,,i have two groups of data . each with a differ...,post
91740,5,3.0,,the r project r is valuable and significant be...,post
...,...,...,...,...,...
810595,279994,,536471.0,"it does run , and gives very valid looking est...",comment
810596,279998,,536439.0,it seems to me that you are correct ; the doub...,comment
810597,279998,,536514.0,it would not be the first time a grader has mi...,comment
810598,279999,,536802.0,the basic idea is to compare the clustering co...,comment


## Prefix-word matrix

In [6]:
from nltk.util import ngrams

In [7]:
line = train.text.values[0]
line

'how should i elicit prior distributions from experts when fitting a bayesian model ?'

In [8]:
list(ngrams(line.split(), n=3, left_pad_symbol='<s>', right_pad_symbol='</s>'))

[('how', 'should', 'i'),
 ('should', 'i', 'elicit'),
 ('i', 'elicit', 'prior'),
 ('elicit', 'prior', 'distributions'),
 ('prior', 'distributions', 'from'),
 ('distributions', 'from', 'experts'),
 ('from', 'experts', 'when'),
 ('experts', 'when', 'fitting'),
 ('when', 'fitting', 'a'),
 ('fitting', 'a', 'bayesian'),
 ('a', 'bayesian', 'model'),
 ('bayesian', 'model', '?')]

In [9]:
from collections import defaultdict, Counter

In [10]:
word_matrix = defaultdict(Counter)

In [11]:
from tqdm.notebook import tqdm

In [12]:
for text in tqdm(train.text):
    for x in ngrams(text.split(), n=3, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bigram, token = x[:-1], x[-1]
        word_matrix[bigram][token] += 1

HBox(children=(FloatProgress(value=0.0, max=718864.0), HTML(value='')))




## Text generation

In [13]:
import random

In [14]:
def make_temp_sample(elems, probas, tau = 1.0):
    probas = [p**(1.0/tau) for p in probas]
    return random.choices(elems, probas)[0]

In [15]:
def generate_text(bigram, n=None):
    first, second = bigram
    result = [first, second]
    while second != '</s>':
        cnt = word_matrix[(first, second)]
        if not cnt:
            break
        pairs = cnt.most_common()
        words, probas = list(zip(*pairs))
        token = make_temp_sample(words, probas, tau=1)
        result.append(token)
        first = second
        second = token
        if n is not None and len(result) == n:
            break
    return ' '.join(result)

In [16]:
generate_text(('you', 'see'), n=100)

'you see a more frequentist like approach ? or is there a better approach than psm . if you are trying to do more preprocessing after initial communalities as smc of p instead of evaluating how recoverable the solutions different ? or please define all possible types , so by the intercept : or produce , so in a more sophisticated hampel filter . so i only got half of a dense representation from a bayesian approach or at least pages long . given the value of x values , i do not see why , please elaborate a little'

## Probability of a sentence

In [17]:
VOCABULARY_SIZE = len({token for text in train.text for token in text.split()})

In [18]:
def calc_trigram_proba(trigram, delta=1.0):
    bigram = trigram[:-1]
    token = trigram[-1]
    freqs = word_matrix.get(bigram, {})
    count_tok = freqs.get(token, 0)
    count_pref = sum(freqs.values())
    p = (count_tok + delta) / (count_pref + delta * VOCABULARY_SIZE)
    return p

In [19]:
def calc_text_proba(text):
    trigrams = ngrams(text.split(), n=3, left_pad_symbol='<s>', right_pad_symbol='</s>')
    p = 1.0
    for trigram in trigrams:
        p *= calc_trigram_proba(trigram)
    return p

In [20]:
text = train.sample(1).text.values[0]
print(text)

calc_text_proba(text)

no , i mean training set error where it is written . the training error is the number of misclassified examples in the training set divided by training set size . similarly test set error is number of misclassified examples in test set divided by training set size . also you may want to check coursera machine learning class , especially videos for advice for applying machine learning . those advice are quite relevant to your situation .


7.45363538246843e-288

## Perplexity

In [21]:
import math

In [22]:
def calc_perplexity(text):
    trigrams = list(ngrams(text.split(), n=3, left_pad_symbol='<s>', right_pad_symbol='</s>'))
    pp = 1.0
    for trigram in trigrams:
        pp *= (1 / calc_trigram_proba(trigram))** (1/len(trigrams)) 
    return pp

In [23]:
text = train.sample(1).text.values[0]
print(text)

calc_perplexity(text)

i am predicting the binary class , ie if it is in top or not , of a security based upon it is performance using predictors from current time . so it is simply a cross sectional classifier . as of now i have used random forest and neural net for this purpose . now i want to extend it so that i can one step ahead prediction of the class . please suggest some starting point . i understand it might be open ended question . thanks for reading . i know how to use time series , but i am not sure how to go so for a classifier . also all the predictors are numeric variables , none of them are categorical . i am doing all in r , so it would be great if i get related pointers , not a strict constraint though .


2068.6367285395027

In [24]:
sum(calc_perplexity(text) for text in tqdm(train.text))

HBox(children=(FloatProgress(value=0.0, max=718864.0), HTML(value='')))




4033762010.4495244