# M1 Loading and Preparing the Dataset

## Objective

The goal of this preliminary milestone is to load and preprocess the dataset. The raw text is noisy and we want to remove nonwords and non-ASCII characters, keep punctuation to a minimum, and reduce the overall vocabulary of the corpus.

- Although this corpus is not as noisy as a text directly extracted from a social network (for example, Twitter or Facebook), it is still not as structured as academic papers or newspaper articles. Furthermore, the corpus displays some interesting particularities, such as the presence of HTML markup and LaTeX-formatted equations. The corpus is also rich in specific entities, names of theorems, and statistical test algorithms, and it mixes colloquial writing with more formally structured paragraphs.


- The garbage-in, garbage-out golden rule of machine learning is also applicable to language models. Simply put, if we skip the preprocessing/cleaning part of the project, the vocabulary of our language model will be too vast and noisy to make any sense. Generated text, for instance, may mix in mathematical symbols with punctuation signs or random HTML tags and numbers. By reducing the volume of the corpus vocabulary, we increase the relevance and quality of the generated text and improve the reliability of sentence selection based on their respective probabilities. We also reduce the memory imprint of our code and its execution time.


- Preprocessing the text to reduce noise and vocabulary size is an iterative process. You should start simple and further refine the preprocessing steps after building and evaluating your first language models.~

In [2]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

## Load the dataset into a pandas DataFrame

In [3]:
df = pd.read_csv('~/data/stackexchange_812k.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 812132 entries, 0 to 812131
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   post_id     812132 non-null  int64  
 1   parent_id   75535 non-null   float64
 2   comment_id  553076 non-null  float64
 3   text        812132 non-null  object 
 4   category    812132 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 31.0+ MB


In [5]:
df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


In [6]:
df.text.sample(10)

146574    <p>In this question on <a href="https://stacko...
165751    <p>My company makes widgets. We take a random ...
49517     Determining the PMF of the maximum of dependen...
346266    Did you find out what different inputs (levels...
809970    Is this is a balanced design? Also, I assume t...
330062    @PeterFlom I've cleared the question up a bit ...
566356    The model would be different if the categories...
806572    @rvl Thanks for the interest. I asked this bec...
96792     <p><strong>N.B.</strong>: <em>This was previou...
15296     How to explain smoothing functions in the logi...
Name: text, dtype: object

## Use regular expressions to remove elements that are not words, such as HTML tags, LaTeX expressions, URLs, digits, and line returns.

In [7]:
HTML = "<[^>]*>"
LATEX = "\$[^>]*\$"
URLS = "http\S+"
CRS = "[\r\n]+"
DIGITS = "\$[^>]*\$"
SPACES = "\s\s+"
PUNCT = '"#$%&()*+/:;<=>@[\\]^_`{|}~”“'
pattern = r"[{}]".format(PUNCT)

def clean_text(text):
    """
        text: a string        
        return: modified initial string
    """
    text = re.sub(HTML,' ', text)
    text = re.sub(LATEX,' ', text)
    text = re.sub(URLS,' ', text)
    text = re.sub(CRS,' ', text)
    text = re.sub(DIGITS,' ', text)
    text = re.sub(pattern,' ', text)
    text = re.sub(SPACES,' ', text)
    text = re.sub(DIGITS,' ', text)
    return text.strip()

In [8]:
clean_text('Formulate hypotheses when $\mu_A < \mu_B$')

'Formulate hypotheses when'

In [9]:
clean_text('See my response to <a href="https://stackoverflow.com/questions/2252144/datasets-for-running-statistical-analysis-on')

'See my response to a href'

In [10]:
# Sample of comments
for p in df[df.category == 'comment'].text.sample(3).values:
  print('-' * 20)
  print(p)

--------------------
@ssdecontrol Yes, true. My comment was at least partly in jest. Should have put a smiley at the end. Sorry!
--------------------
I'm still carefully reading the references and will likely have follow up questions, but this is most definitely the answer I was looking for.
--------------------
Have a look at this question: http://stats.stackexchange.com/questions/77573/invariance-property-of-mle-what-is-the-mle-of-theta2-of-normal-barx2


In [11]:
df.text = df.text.apply(clean_text)

In [12]:
# Post clean sample of comments
for p in df[df.category == 'comment'].text.sample(3).values:
  print('-' * 20)
  print(p)

--------------------
Note that if you have more than 2 classes then 55 is far better then coin flipping. several classes sounds to be more than 2 classes
--------------------
winperikle very relevant indeed thanks! Though, i wonder how should I include the 3rd order interaction term. Any idea?
--------------------
could you make the statistical query clearer, the question is heavily weighted towards code which would be more suitable to stack overflow. Are you wanting readers to work out why code if giving an unexpected answer if so try SO or are you asking for the statistical explanation of the observed results? For the latter more figures and background to your data would be needed.


## Remove texts that contain blanks only.

In [13]:
df.text.count()

812132

In [14]:
df[df.text.str.len() == 0].text.count()

1422

1422 out of 812132 entries have a zero length text.

In [15]:
df = df[df.text.str.len() > 0]

In [16]:
df.text.count()

810710

## Remove texts that are extremely large or too short to add any information to the model. 

We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.

In [17]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
df['tokens'] = df.text.apply(lambda t : tokenizer.tokenize(t.lower()))

In [18]:
df['n_tokens'] = df.tokens.apply(len)

In [19]:
df.n_tokens.describe()

count    810710.000000
mean         63.246199
std         122.586727
min           1.000000
25%          16.000000
50%          36.000000
75%          72.000000
max       14835.000000
Name: n_tokens, dtype: float64

In [20]:
df.n_tokens.max()

14835

In [21]:
df = df[(df.n_tokens > 4) & (df.n_tokens < 5000)].reset_index(drop = True)
df.shape

(791172, 7)

## Use a tokenizer to create a version of the original text that is a string of space-separated lowercase tokens. 

For instance,

- Thank you!, This equation y = ax + by=ax+b, is very helpful.

    would be transformed to:

    thank you ! this equation , is very helpful .

- “retrieve a distance matrix” is a matter of coding. It also might be irrelevant: one can imagine creative answers.

    becomes, if you choose to remove double quotes from the original text:

    retrieve a distance matrix is a matter of coding. it also might be irrelevant : one can imagine creative answers .

Note that punctuation signs (, . : !) are also represented as tokens.

In [22]:
from nltk import word_tokenize
from nltk import Text

In [23]:
def space_separated_lower(text):
    tokens = word_tokenize(text.lower())
    return " ".join(list(filter(lambda x: x not in ['“', "”"], tokens)))

In [24]:
text = '“retrieve a distance matrix” is a matter of coding. It also might be irrelevant: one can imagine creative answers.'
space_separated_lower(text)

'retrieve a distance matrix is a matter of coding . it also might be irrelevant : one can imagine creative answers .'

In [25]:
df['tokens'] = df.text.apply(space_separated_lower)

KeyboardInterrupt: 

## Export the resulting DataFrame into a CSV file.

In [None]:
import csv
df.to_csv("../data/stackexchange_cleaned.csv", quoting = csv.QUOTE_ALL, index = False)

# M2 N-gram Language Model

## Objective

In this second milestone of the liveProject, the objective is to build an n-gram language model that is defined by the probabilities of all the n-grams in the corpus. Assuming that each token only depends on n-1 previous tokens, the language model is fully defined by the probability of any token in the corpus given its n-1 previous tokens (the prefix).

You will use the language model to complete the following tasks:

Generate text and complete queries from sequences of n-grams, using temperature sampling to tune the randomness of the generated text.
Calculate the probability of a sentence and select the most probable sentence among several candidates.
Score the quality of the language model using perplexity.
Handle out-of-vocabulary (OOV) tokens with Laplace smoothing.

Although simple in its approach, an n-gram language model with additive smoothing is a fast and reliable way to build a model that you can exploit for simple tasks such as query completion and sentence selection, provided that the training dataset is large and specific enough for the domain of interest.

This first language model will serve as a baseline for the more complex language models that we will create in subsequent tasks. It also underlines the different problems and challenges inherent to any NLP task, such as handling out-of-vocabulary tokens, the importance of cleaning the original raw data, and the quality assessment of a language model.

An n-gram model is defined as the probabilities of all the n-grams in the corpus. Under certain Markovian independence assumptions, this is equivalent to evaluating the probability of any token given its n-1 previous tokens (the prefix). For a given prefix, the probabilities of all the following tokens add up to 1 and constitute the probability distribution of the prefix.

For instance, in our current corpus, the prefix “how many” may be followed by the words “people,” “times,” or “ways,” with respective frequencies of 0.46, 0.31, and 0.23, while the prefix “the model” is followed by the words “parameters” or “is” or a period, with frequencies 0.43, 0.36, and 0.21, and so forth.

In an n-gram language model, the probability of a token given a prefix of n-1 tokens is given by its maximum likelihood estimate (MLE).

In [None]:
# Set some global parameters

# Displaying all columns when displaying dataframes
pd.options.display.max_columns = None

# We will work with trigrams 
ngrams_degree = 3


## Split the dataset into a training and a testing subset. 

Use the category “title” for the testing set and the categories “comment” and “post” for the training set. The short length of titles will make them good candidates later as seeds for text generation.

In [None]:
df = pd.read_csv('../data/stackexchange_cleaned.csv').sample(frac=1).reset_index(drop = True)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.text[:5]

In [None]:
df['tokens'] = df.tokens.apply(lambda t : tokenizer.tokenize(' '.join(t)))

In [None]:
df.sample(5).tokens.values

In [None]:
# split the dataset into train and testing subset
df_train = df[df.category.isin(['post', 'comment'])].copy()
df_test = df[df.category.isin(['title'])].copy()

In [None]:
# Display the dimensions of the dataframe 
print("-- Training set: {}\n".format(df_train.shape))
# and the 1st 5 lines
print(df_train.head())

print("\n-- Testing set {}\n".format(df_test.shape))
print(df_test.head())

## Build the matrix of prefix—word frequencies.

- Use the ngrams function from nltk.utils to generate all n-grams from the corpus.


- Set the following: left_pad_symbol = \<s> and right_pad_symbol = \</s>.

### Counting bigrams and following tokens

We build a counts object defined as a defaultdict(Counter). 

Taking into account all trigrams (ngrams_degree = 3) that we break into prefix (bigrams) followed by single tokens. 

The counts object will have the bigrams as keys and for each key a Counter of all the potential tokens. 

For instance, if the corpus contains a 100 instances of "*how many people*" and a 120 instances of "*how many times*" we would get the following entry:

    counts[('how', 'many')] = Counter('people': 100, 'times': 120, .... )

Similarly if the corpus contains "*the model is*" 500 times and "*the model parameters*" 200 times, we end up with:

    counts[('the', 'model')] = Counter('is': 500, 'parameters': 200, .... )

To split the tokens into bigramns we use the [ntlk.ngrams](https://www.nltk.org/api/nltk.html#nltk.util.ngrams) function:


    Return the ngrams generated from a sequence of items, as an iterator.
    For example:

    >>> from nltk.util import ngrams
    >>> list(ngrams([1,2,3,4,5], 3))
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

The next cell should take a couple of minutes.

Note that we build the mode on the training subset df_train and leave the testing subset aside.

In [None]:
from collections import defaultdict, Counter
from nltk.util import ngrams
from tqdm import tqdm
import numpy as np

In [None]:
def n_grams(sentence, ngrams_degree=3):
    return ngrams(
        sentence, 
        n = ngrams_degree,  
        pad_right = True, 
        pad_left = True, 
        left_pad_symbol = "<s>", 
        right_pad_symbol = "</s>")

In [None]:
sentence = "the difference between the two approaches is discussed here"
list(n_grams(sentence.split(), 4 ) )[:10]

In [None]:
counts = defaultdict(Counter)
for tokens in tqdm(df_train.tokens.values):
    for ngram in n_grams(tokens):      
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        counts[prefix][token] +=1

In [None]:
print("we have {} bigrams".format(len(counts.keys())))

In [None]:
import random

for i in range(5):
    prefix = random.choice(list(counts.keys()))
    print("{}: \t{}".format(prefix,counts[prefix]))

In [None]:
tokens_count = [ len(v)   for k,v in counts.items() ]

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize=(12,6))
plt.hist(tokens_count, bins = 100);

In [None]:
bigrams_with_single_tokens = [ k   for k,v in counts.items() if len(v) == 1 ]
bigrams_with_two_tokens = [ k   for k,v in counts.items() if len(v) == 2 ]

print("{} bigrams_with_single_tokens".format(len(bigrams_with_single_tokens)))
print("{} bigrams_with_two_tokens".format(len(bigrams_with_two_tokens)))

In [None]:
tokens_dict = { k:len(v)   for k,v in counts.items() if len(v) > 10000 }
tokens_dict

In [None]:
for prefix, tokens in counts.items():
    print("prefix=", prefix, "\ntokens=", tokens.most_common(10))
    print(sum(counts[prefix].values()))
    break

###  token / prefix probabilities

To obtain token / prefix probabilities using the Maximum Likelihood Estimator, we must simply normalize each (prefix - token) count by the total number of the prefix occurence. 

$$p(token / prefix) = \frac{count(prefix + token)} {count(prefix)}$$


Keeping the same defaultdict(Counter) structure for the freq object, we should obtain something similar to 


    freq[('how', 'many')] = {'people': 0.14, 'times': 120, .... }

with 
* p(people / how many) = c('how many people') / c('how many') 
* p(times / how many) = c('how many times') / c('how many')

In [None]:
freq = defaultdict(dict)
for prefix, tokens in counts.items():
    total = sum(counts[prefix].values())
    for token, count in tokens.items():
        freq[prefix][token] = count / total

In [None]:
for i in range(5):
    prefix = random.choice(list(freq.keys()))
    print("{}: \t{}".format(prefix,freq[prefix]))

## Write a text generation function with the following features:

- Takes a bigram as input and generates the next token


- Iteratively slides the prefix over the generated text so that the new prefix includes the most recent token; generates the next token


- To generate each next token, samples the list of words associated with the prefix using the probability distribution of the prefix


- Stops the text generation when a certain number of words have been generated or the latest token is a \</s>

In [None]:
def generate(text, n_words = 40):
    for i in range(n_words):
        prefix = tuple(text.split()[-ngrams_degree+1:])
        if len(freq[prefix]) == 0:
            break
        candidates  = list(freq[prefix].keys())
        probabilities = list(freq[prefix].values())
        text += ' ' + np.random.choice(candidates, p = probabilities)
        if text.endswith('</s>'):
            break
    return text

In [None]:
tuple('the model'.split()[-ngrams_degree+1:])

In [None]:
text      = 'the model'
print()
print(generate(text))

print()
text      = 'that distribution'
print(generate(text))

print()
text      = 'to determine'
print(generate(text))

## Write a function that can estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences.


Split the sentence into trigrams and use the chain rule to calculate the probability of the sentence as a product of the bigrams—tokens probabilities.

- Estimate the probability of a sentence and use it to select the most probable sentence out of several candidate sentences.

- Similar to the above process calculate the candidates and initial_probabilites from freq dictionary

- Here we will modify the initial_probabilites using temperature and normalizing it to generate random candidates. 

### Temperature sampling

As you may have noticed, for some bigrams, one particular token may be much more frequent than the others potential tokens. 

For instance:

* ('building', 'machine'): 	{'learning': 0.875, 'classification': 0.125}

when generating the next token based on the bigram "*building machine*", most of the times the word "learning" will be chosen instead of "classification".

In order to compensate these imbalances and improve the chances of less frequent tokens to be chosen we can sample with temperature.

In order to increase the randomness of the next token selection given a prefix, we can flatten the distribution using the temperature $$\tau$$ to define a new probability distribution as such:

$$f_{\tau}(p_i) = \frac{ p_i^{\frac{1}{\tau}} }{ \sum_j p_j^{\frac{1}{\tau}} }$$

See [this post](https://stats.stackexchange.com/questions/255223/the-effect-of-temperature-in-temperature-sampling) for a more in-depth explanation on temperature sampling.

In [None]:
def generate_temp(text, temperature = 1, n_words = 30):
    for i in range(n_words):
        prefix = tuple(text.split()[-ngrams_degree+1:])
        if len(freq[prefix]) == 0:
            break
        candidates  = list(freq[prefix].keys())
        initial_probabilities = list(freq[prefix].values())
        denom   = sum([p ** temperature for p in initial_probabilities])
        probabilities  = [p ** temperature / denom  for p in initial_probabilities]
        text  += ' ' + np.random.choice(candidates, p = probabilities)
        if text.endswith('</s>'):
            break
    return text

In [None]:
text  = 'the model'
# text  = 'to determine'
# text  = 'not sure'

for tau in [0.01, 0.5, 1, 3, 10]:
    print(tau)
    print(generate_temp(text, temperature = tau))

## Implement the perplexity scoring function for a given sentence and for the training corpus.

Let's now implement a way to measure the quality of our model.

The idea is to estimate the probability of a test sentence given our model. 
An uncommon sentence should be less probable than a common one.


Notes : 
  1. At this point the sentence should exist in the corpus. Our model does not know yet how to handle out-of-vocabulary (OOV) bigrams, trigrams or tokens.
  2. To avoid the problem of underflow caused by multiplying multiple very small floats, we work in the log space:

So instead of calculating perplexity with (case ngrams_degree = 3):
 
$$PP(w_{1},\cdots, w_N) = ( \prod_{i = 3}^{N} \frac{1}{ p(w_i/ w_{i-2}w_{i-1} )} )^{\frac{1}{N}}$$

We compute

$$PP(w_{1},\cdots, w_N) = \exp [ - \frac{1}{N} {\sum_{i = 3}^{N} \log {p(w_i/ w_{i-2}w_{i-1}} } ) ]$$

In [None]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

def perplexity(sentence):
    sentence = tokenizer.tokenize(sentence.lower())
    N = len(sentence)
    logprob = 0

    for ngram in n_grams(sentence): 
        try:
            prefix = ngram[:ngrams_degree-1] 
            token = ngram[ngrams_degree-1]
            logprob += np.log(freq[prefix][token])
        except:
            pass

    return np.exp(- logprob / N)

In [None]:
sentence = "the difference between the two approaches is discussed here"
print("[perplexity {:.2f}] {}".format(perplexity(sentence), sentence))

sentence = "this question really belongs on a different site"
print()
print("[perplexity {:.2f}] {}".format(perplexity(sentence), sentence))

sentence = "The function may only be linear in the region where the points were taken"
print()
print("[perplexity {:.2f}] {}".format(perplexity(sentence), sentence))

## Implement additive Laplace smoothing to give a non-zero probability to missing prefix—token combinations when calculating perplexity.

### Out of Vocabulary (OOV) 

The main weakness of our model so far is that it does not know how to handle elements that are not already in the original corpus.

Since both when generating text and when calculating perplexity we use the count of the prefix in the corpus, when that prefix is missing, the counts = 0  which causes problems with logs and divisions.

To remediate to that problem we can artificially assign a probability (although a very low one) to missing ngrams and tokens.

This method is called Laplace smoothing. It relies on calculating the frequency of a token / prefix with:

$$ p(token / prefix) = \frac{ count( prefix + token) + \delta}{count(prefix) + \delta \times |N| }$$


Where 

* N is the total number of prefixes in the model
* delta is an arbitrary number 

When the prefix is missing from the original corpus, the probability of a token / prefix will now be:

$$p(token / prefix) = \frac{1} { | N |}$$

Let's implement that perplexity with Laplace Smoothing


In [None]:
def perplexity_laplace(sentence, delta = 1):
    sentence = tokenizer.tokenize(sentence.lower())
    N = len(sentence)
    logprob = 0
    for ngram in n_grams(sentence): 
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        if prefix in list(counts.keys()):
            total = sum(counts[prefix].values())
            if token in counts[prefix].keys():
                logprob += np.log((counts[prefix][token] + delta)/ (total + delta * N))
            else:
                logprob += np.log((delta) / (total + delta * N ))
        else:
            logprob += - np.log(N)
  
    return np.exp(- logprob / N)

In [None]:
# calculate the perplexity of sentences that were not present in the original corpus.

sentence = "this model belongs on a different planet"
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 10), sentence))

sentence = "this question really belongs on a different site."
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 10), sentence))

## Calculate the perplexity of the language model on the test set composed of titles.

Perplexity on the test corpus and sentence probability.

How do we calculate the perplexity of a model on a test corpus?

Let's say we have *m* sentences in the corpus, the perplexity of the corpus is given by 

$$ PP(Corpus) = P(S_1, \cdots, S_m)^{-\frac{1}{N}} $$

We can assume that the sentences are independent

$$ PP(Corpus) = (\prod_{k = 1}^{m}  P(S_k))^{-\frac{1}{N}} $$

Which we calculate in the log space to avoid underflow

$$ PP(Corpus) = \exp ( -\frac{1}{N} \sum_{k = 1}^{m}  log(P(S_k)) $$

So to calculate the perplexity on a test corpus we need to calculate the probability of each single sentence.

The following function calculates the probability of a sentence. 

Instead of using laplace smoothing to deal with the missing bigrams and tokens, we will simply skip missing elements to make the function faster.
Implementing laplace smoothing requires several extra conditions that are taking too much time to run.

In [None]:
def sentence_log_probability(sentence, delta = 1, ngrams_degree = 3):
    sentence = tokenizer.tokenize(sentence.lower())
    logprob = 0
    for ngram in n_grams(sentence, ngrams_degree):
        prefix = ngram[:ngrams_degree-1]
        token = ngram[ngrams_degree-1]
        try:
            logprob += np.log( freq[prefix][token] )
        except:
            pass

    return logprob

In [None]:
def corpus_perplexity(corpus, ngrams_degree = 3):
    # start by calculating the total number of tokens in the corpus
    all_sentences = ' '.join(corpus)

    all_tokens =  tokenizer.tokenize(all_sentences.lower())
    N = len(tokens)

    logprob = 0
    probs = []
    for sentence in tqdm(corpus):
        lp = sentence_log_probability(sentence, ngrams_degree)
        probs.append(lp)
        if lp != np.inf:
            logprob += lp
        else:
            print(lp)
#     print(probs)        
    print(logprob, N)
    return np.exp( - logprob / N)

In [None]:
# The perplexity of a sample of 1000 titles
corpus = df_test.text.sample(1000, random_state = 8).values
corpus_perplexity(corpus)

In [None]:
# and the perplexity of the whole test corpus
corpus_perplexity(df_test.text.values)

## Try to improve the perplexity score of your model as follows:

- Modify the preprocessing phase of the corpus.


- Increase or decrease the number of tokens in the model (bigrams, 4-grams, and so on).


- Vary the delta parameter in the additive Laplace smoothing step.

### Modify the preprocessing phase of the corpus.

Not clear what should change, so not tried.

### Increase or decrease the number of tokens in the model (bigrams, 4-grams, and so on).

In [None]:
# The perplexity of a sample of 1000 titles and using 4-grams
# original 3-gram score was -31728.947433616246
# did perplexity increase?
corpus = df_test.text.sample(1000, random_state = 8).values
corpus_perplexity(corpus, ngrams_degree=4)

In [None]:
# and the perplexity of the whole test corpus using 4-grams
# original 3-gram score was -2671335.5473894435
# did perplexity decrease?
corpus_perplexity(df_test.text.values, ngrams_degree=4)

### Vary the delta parameter in the additive Laplace smoothing step.

Perplexity using delta value of 10:

[perplexity 145.18] this model belongs on a different planet

[perplexity 35.54] this question really belongs on a different site.

Below using a delta value of 20 seems to reduce perplexity:

In [None]:
# calculate the perplexity of sentences that were not present in the original corpus.

sentence = "this model belongs on a different planet"
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 20), sentence))

sentence = "this question really belongs on a different site."
print("[perplexity {:.2f}] {}".format(perplexity_laplace(sentence, delta = 20), sentence))

## Building an n-gram language model using NLTK

Since version 3.4 the nltk library includes a language model module.

Let's install the right version of nltk. Feel free to install any version > 3.4.5. 

After running the pip install command below you will need to restart the runtime. This will erase all the local variables. So we will reload and prepare the dataset from scratch.

In [None]:
import nltk 
nltk.__version__

In [None]:
import pandas as pd
import numpy as np
import re
import csv
from tqdm import tqdm
from collections import defaultdict, Counter
from nltk.util import ngrams

ngrams_degree = 3

In [None]:
# Load data into pandas dataframe, shuffle it and reset the index
# ../data/stackexchange_812k.csv
# ../data/stackexchange_cleaned.csv
df = pd.read_csv('../data/stackexchange_812k.tokenized.csv').sample(frac=1).reset_index(drop = True)

In [None]:
df['tokens'] = df.tokens.apply(lambda txt : txt.split())
df_train = df[df.category.isin(['post','comment'])].copy()
df_test = df[df.category.isin(['title'])].copy()

In [None]:
from nltk.lm import MLE
from nltk.lm import Vocabulary
from nltk.lm.preprocessing import padded_everygram_pipeline

In [None]:
# len(list(vocab)) = 50543343

In [None]:
# define the model
model = MLE(ngrams_degree,vocabulary=Vocabulary(unk_cutoff = 20))

train, vocab = padded_everygram_pipeline(ngrams_degree, df_train.tokens.values)

# fit the model
model.fit(train, vocab)

Then you can use the perplexity and generate functions of the lm module.

In [None]:
model.perplexity(ngrams(df_test.tokens.values[0], 2))

In [None]:
sentence = "the difference between the two approaches is discussed here"
print("[perplexity {:.2f}] {}".format(model.perplexity(sentence), sentence))

sentence = "this question really belongs on a different site"
print()
print("[perplexity {:.2f}] {}".format(model.perplexity(sentence), sentence))

sentence = "The function may only be linear in the region where the points were taken"
print()
print("[perplexity {:.2f}] {}".format(model.perplexity(sentence), sentence))

### Signature: model.generate(num_words=1, text_seed=None, random_seed=None)

Generate words from the model.

:param int num_words: How many words to generate. By default 1.

:param text_seed: Generation can be conditioned on preceding context.

:param random_seed: A random seed or an instance of `random.Random`. If provided,
makes the random sampling part of generation reproducible.

:return: One (str) word or a list of words generated from model.

Examples:

\>>> from nltk.lm import MLE

\>>> lm = MLE(2)

\>>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])

\>>> lm.fit([[("a",), ("b",), ("c",)]])

\>>> lm.generate(random_seed=3)

\'a'

\>>> lm.generate(text_seed=['a'])

\'b'

In [None]:
model.generate(num_words=10, random_seed=2, text_seed=['people'])

In [None]:
random_seed=2
n_words=40
text      = 'the model'
print()
print(text+' '+' '.join(model.generate(n_words,text_seed=text,random_seed=random_seed)))

print()
text      = 'that distribution'
print(text+' '+' '.join(model.generate(n_words,text_seed=text,random_seed=random_seed)))

print()
text      = 'to determine'
print(text+' '+' '.join(model.generate(n_words,text_seed=text,random_seed=random_seed)))

# M3 Deep Learning Language Model

## Objective

In this milestone, we will build a language model using a long short-term memory (LSTM) neural network. The problem is framed as a multiclass classification problem where the number of classes corresponds to the size of the vocabulary. The number can be quite large. Given a sequence of n-grams, the classifier predicts the following token as a class. The input of the neural network is an array of sequences of tokens for the design matrix and the output is a vector of labels that corresponds to the target token.

When the vocabulary size is too large, training the model takes too long and the performance degrades. The challenge, therefore, lies in finding the right balance between the feasibility of the task and the quality of the model by reducing the vocabulary size but preserving its diversity.

Your goal is to create a language model that generates high-quality text with a low perplexity score on a validation set and is reasonably fast to train.

Creating a deep learning token-based language model brings specific challenges. Successfully implementing and training such a model on a real-world dataset requires the following:

- Optimizing Python structures and control flows to minimize memory impact and reduce processing times

- Balancing data diversity and dataset reduction


There are many parameters to handle both in the data processing and model fitting phases, and finding the right balance is also a challenge.

The language model building approach is entirely different from the n-gram approach. Instead of estimating tokens’ probability distributions using a maximum likelihood approach (counting occurrences of tokens) as we did for the n-grams language model, we train a classification model using a recurrent neural network approach.

There will be a comparison of the n-gram language model built in the previous milestone with this deep learning model, highlighting the strengths, weaknesses, and difficulties inherent to each method.

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import re
import csv
from tqdm import tqdm
from collections import defaultdict, Counter

In [None]:
# Import required libraries

# Note that we will not use keras tokenizer but keep using the same NLTK tokenizer from task 1
from nltk.tokenize import WordPunctTokenizer

# We use Keras here for simplicity. Replace with your neural network of choice.

#Load Keras libraries

# dataframe display option
pd.options.display.max_columns = None

## Preparing the data

### Load the dataset that was prepared in Milestone 1.

In [None]:
# setup variables
POSTS_TYPE = 'post'
MIN_TOKEN_LENGTH = 100
MAX_TOKEN_LENGTH = 200
DF_SAMPLE_COUNT = 20000

TOKENS_MIN_COUNT = 10

SEQUENCE_WINDOW = 4
SEQUENCE_LEN = 13

In [None]:
df_full = pd.read_csv('../data/stackexchange_812k.tokenized.csv').sample(frac=1).reset_index(drop = True)

### The original dataset is too large and needs to be reduced. To reduce it, you can, for instance, use the following techniques:

- Filter out items that have too many or too few tokens.

- Select items of a certain type, such as posts, comments, or titles.

- Sub-sample items randomly.

In [None]:
df_full.describe()

In [None]:
df = df_full[
            (df_full.category == POSTS_TYPE) & 
            (df_full.n_tokens > MIN_TOKEN_LENGTH)  & 
            (df_full.n_tokens < MAX_TOKEN_LENGTH)
        ].sample(DF_SAMPLE_COUNT).reset_index(drop = True)

print("df.shape: ", df.shape)
print(df.text.sample(2).values)

In [None]:
# transform the tokens field from white space separated strings into list of tokens
df['tokens'] = df.tokens.apply(lambda t : np.array(t.split()))
print(df.tokens.sample().values)

### Build the vocabulary as the set of all unique tokens to construct the list of token indexes.

Filtering on token frequency is one way to reduce the overall size of the vocabulary.

In [None]:
#generate vocabulary
#filter out words that are too scarce
import itertools
all_tokens = list(itertools.chain.from_iterable(df.tokens))

#filter out least common tokens
from collections import Counter
counter_tokens = Counter(all_tokens)

vocab_size  = len(set(all_tokens))
vocab       = list(set(all_tokens))
print("original number of tokens", len(all_tokens))
print("original vocab_size", vocab_size)

#remove all tokens that appear in less than TOKENS_MIN_COUNT times
fltrd_tokens = [ token for token in all_tokens if counter_tokens[token] > TOKENS_MIN_COUNT ]

print("new number of tokens", len(fltrd_tokens))
print("new vocab_size", len(set(fltrd_tokens)))

vocab_size  = len(set(fltrd_tokens))
vocab       = list(set(fltrd_tokens))
vocab.append('UNK')
vocab_size +=1 

In [None]:
# rejected tokens
rejected_tokens =  [ token for token in all_tokens if counter_tokens[token] <= TOKENS_MIN_COUNT ]

In [None]:
print("len(rejected_tokens): ", len(rejected_tokens))
print(np.random.choice(rejected_tokens, 100, replace = False))
# len(rejected_tokens):  25497

### Set a fixed sequence length and build sequences of token indexes from the corpus. (See, for instance, Keras pad_sequences.)

In [None]:
vocab.append('UNK')
vocab_size +=1 

In [None]:
mapping = { w : i for i, w in enumerate(vocab) }

def getidx(token):
    try:
        return mapping[token]
    except:
        return mapping['UNK']

df['tokens_idx'] = df.tokens.apply(lambda tokens : np.array([getidx(token) for token in tokens]))

In [None]:
print(df.tokens_idx.head(2).values)

### Split the sequences into predictors and labels (keras.utils.to_categorical).

In [None]:
# most likely using tf version 2.9 - go back to 2.8 and original path works
from keras.preprocessing.sequence import pad_sequences

In [None]:
sequence = [[1], [2, 3], [4, 5, 6]]
pad_sequences(sequence)

In [None]:
# Generate sequences
def generate_sequences(sentence):
    sequences = []
    _end = SEQUENCE_WINDOW
    while _end < len(sentence) + SEQUENCE_WINDOW:
        sequences.append(sentence[:_end])
        _end += SEQUENCE_WINDOW
    padded_seqs = pad_sequences(sequences, maxlen=SEQUENCE_LEN, padding='pre')
    return padded_seqs

In [None]:
# Apply the sequence generation 
multi_sequences = df.tokens_idx.apply(generate_sequences)

In [None]:
i = 0
for d in tqdm(multi_sequences.values):
    if i == 0:
        all_sequences = d
    else:
        all_sequences = np.concatenate( ( all_sequences, d )  )
    i +=1
print("\nsequences.shape: ",all_sequences.shape)
# expected sequences.shape:  (722881, 13)

In [None]:
# sample N% of the sequences to reduce the input dataset.
if True:
    mask = np.random.choice([False, True], len(all_sequences), p=[0.50, 0.50])
    sequences = all_sequences[mask].copy()
else:
    sequences = all_sequences.copy()
    print("\nsequences.shape: ",sequences.shape)

In [None]:
# Splits the sequences into predictors and labels
# tf.keras.utils.to_categorical
# import tensorflow as tf
# from tensorflow.keras.utils import to_categorical
from keras.utils.np_utils import to_categorical
from tensorflow import keras

In [None]:
to_categorical([0, 1, 2, 3], num_classes=4)

In [None]:
# create the predictors and labels for the classificaton task.
predictors  = sequences[:,:-1]
label       = sequences[:,-1]

print("predictors.shape", predictors.shape)
print("label.shape", label.shape)

# The to_categorical Keras function transforms the vocab_size vector of labels into a one hot encoded matrix of dimension (n, vocab_size)
label_cat       = to_categorical(label, num_classes=vocab_size)

print("label_cat.shape", label_cat.shape)

## The model

The data is now ready to be used to fit a neural network.

### Define a simple sequential model with an embedding layer, LSTM(s), and a dense layer with softmax activation. 

Feel free to experiment with dropouts and different optimizers. Here, we would be focusing on Keras to perform language modeling in this milestone.

In [None]:
# !pip install tensorflow==2.10.1

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import RMSprop

In [None]:
'''
Define model
an embedding dimension (32, 64, ...), 
2 LSTM layers 
followed by a dense layer with softmax activation
the optimizer is RMSprop with a learning rate of 0.01
'''
embedding_dimension = 64
model = Sequential()
model.add(
    Embedding(vocab_size,
        embedding_dimension,
        input_length=SEQUENCE_LEN -1)
    )
model.add(LSTM(128, return_sequences = True))
model.add(LSTM(64))
model.add(Dense(vocab_size, activation='softmax'))
optimizer = RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'])

print(model.summary())

### Specify the number of epochs, the batch size, and other fitting parameters.

In [None]:
batch_size = 256
epochs = 4
verbose = 1

### Fit the network.

In [None]:
'''
Model Fitting!
'''
model.fit(predictors, label_cat, batch_size = batch_size, epochs=epochs, verbose=verbose)


In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

## Assessing the results

### Write a function that generates text

In [None]:
# helper function to sample an index from a probability array
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
def generate_text(nmax, text, temperature):
    n = 0
    tokens = tokenizer.tokenize(text)
    while (len(tokens) < nmax) :
        n +=1
        # only takes known words into account
        tokens_idx = [ vocab.index(word) if word in vocab else vocab.index('UNK') for word in tokens  ]
        # print(tokens_idx)
        tokens_list = pad_sequences([tokens_idx], maxlen=SEQUENCE_LEN-1, padding='pre')
        probas = model.predict(tokens_list, verbose=0)[0]
        next_word_idx = sample(probas, temperature = temperature)
        next_word = vocab[next_word_idx]
        # print(next_word_idx, next_word)

        # next_word = np.random.choice(vocab, p = probas)
        if next_word != '?':
            print(next_word, probas[vocab.index(next_word)]  )
            text += ' ' + next_word
        # print(text)
        tokens = tokenizer.tokenize(text)
        if n> 200:
            break;
    return text

### Generate some text and take note of the following:
- Token repetitions
- Missing punctuation
- Other anomalies

In [None]:
generate_text(15, 'a random variable', 3)

### Write a function that calculates the perplexity of a sentence and apply it to a subset of sentences to evaluate the model.

In [None]:
SEQUENCE_WINDOW = 1

 # and define the perplexity for a sentence

def perplexity(sentence):
    # tokenize
    tokens = tokenizer.tokenize(sentence.lower())
    N = len(tokens)
    # find the indexes of the tokens from the vocabulary
    tokens_idx = [ vocab.index(word) if word in vocab else vocab.index('UNK') for word in tokens  ]
    # generate a N x SEQUENCE_LEN array of padded sequences 
    sequences = generate_sequences(tokens_idx)
    predictors  = sequences[:,:-1]
    label       = sequences[:,-1]
    # the probabilities of all the words in the vocab given each padded sequence
    probas = model.predict(predictors, verbose=0)
    # add the log of the probability of the label given the padded sequence
    logprob = 0
    for k in range(N):
        p = probas[k,label[k]]
        logprob += np.log( p  )    
    return np.exp(- logprob / N), logprob

In [None]:
sentence = "In a fixed-effects model only time-varying variables can be used."
print(sentence, perplexity(sentence))

sentence = "I know a pretty little place in Southern California, down San Diego way."
print(sentence, perplexity(sentence))

sentence = "This that is noon but yes apple whatever did regression variable"
print(sentence, perplexity(sentence))

### Define a validation set, such as 1,000 titles.

In [None]:
# Validation set
df_valid = df_full[(df_full.category == 'title') & (df_full.n_tokens > 10)].sample(100, random_state = 88).reset_index(drop = True)
print("df_valid",df_valid.shape)
print(df_valid.head(2))

In [None]:
def corpus_perplexity(corpus):
    # start by calculating the total number of tokens in the corpus
    all_sentences = ' '.join(corpus)
    all_tokens =  tokenizer.tokenize(all_sentences.lower())
    N = len(all_tokens)
    logproba = 0
    perps = []
    for sentence in corpus:
        pp, logp = perplexity(sentence)
        logproba += logp
        perps.append(pp)
        print ("{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}\t{}".format(pp, np.mean(perps), logp, logproba, np.exp( - logproba / (N  )), sentence  ))

    return np.exp( - logproba / (N)), perps

### Transform that validation set into sequences of tokens using the training vocabulary.

In [None]:
# Calculate Perplexity score on the validation set
corpus = df_valid.tokens.values
perplexity_score, scores = corpus_perplexity(corpus)
print(" Corpus perplexity: {:.2f}".format(perplexity_score ))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import math

In [None]:
plt.title("perplexity scores of sentences - histogram")
plt.hist([sc for sc in scores if sc < 5000], bins=30);
plt.xlabel('perplexity')
plt.ylabel('# of sentences')
plt.grid(alpha = 0.3)

### Tune the neural net and the parameters of the preprocessing phase to improve the model’s perplexity score.

# M4 Character-based Language Model with AllenNLP

## Objective

In this milestone, we will switch to character-based language models. We will implement the same deep learning multinomial classification approach that we completed in the previous milestone.

The goal remains to predict the next token given a preceding sequence of tokens. However, by using characters as tokens, instead of words, we solve two problems:

- There are no more out-of-vocabulary (OOV) tokens since all the characters are known in advance.

- The total number of classes to predict is reduced to a few dozen characters instead of thousands of different words. (This is true for alphabetic-based scripts such as Latin, Arabic, or Cyrillic but not in the case of logographic scripts used in Mandarin, Korean, or Japanese.)

To build a character-based language model on our domain-specific corpus, we will use the AllenNLP framework. AllenNLP is a state-of-the-art NLP framework created by the Allen Institute for AI. Its generic approach allows us to work on a wide range of NLP problems. AllenNLP is based on PyTorch.

Experimenting with character-based language models underlines its differences compared to word-based models in terms of the implementation process and the resulting outputs (generated texts and perplexity scores). In particular, switching from words to characters as the target multiclass has two main advantages:

- The number of target classes is drastically reduced from thousands of tokens to less than a hundred characters.

- All characters are known in advance.

The abstraction level of the AllenNLP framework makes it particularly well suited to handle all sorts of NLP tasks (POS, NER, and so on). And the investment required to learn the framework is well worth it.

In [28]:
import re
from typing import Dict, List, Tuple, Set

import pandas as pd
import numpy as np
import torch
import torch.optim as optim
from collections import Counter
from torch.nn import LSTM, Linear
from allennlp.common.file_utils import cached_path
from allennlp.common.util import START_SYMBOL, END_SYMBOL
from allennlp.data.fields import TextField
from allennlp.data.instance import Instance
from allennlp.data.data_loaders import SimpleDataLoader
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, CharacterTokenizer
from allennlp.data.vocabulary import Vocabulary, DEFAULT_PADDING_TOKEN
from allennlp.models import Model
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.trainer import Trainer
from allennlp.training.gradient_descent_trainer import GradientDescentTrainer

In [29]:
!cuda

True


## 1. Explore and analyze the set of unique characters present in the dataset.

In [30]:
df_full = pd.read_csv('../data/stackexchange_812k.tokenized.csv').sample(frac=1).reset_index(drop = True)

In [31]:
def split_text(text):
    return [char for char in text]

In [5]:
set(split_text('well'))

{'e', 'l', 'w'}

In [6]:
#Concatenate all the original texts from the dataset and list the unique 
text = ''.join(df_full.text.values).lower()
print(len(text))
print(type(text))
print(text[:100])
# text.split()[:100]

239219650
<class 'str'>
confusion about pooling layer, is it trainable or not?is your case somehow different from dozens of 


In [32]:
# all_characters = np.unique(np.array(split_text(text)))

In [33]:
all_characters = [s for s in text]
unique_characters = np.unique(all_characters) 
print(unique_characters)

['\t' '\x0b' '\x0c' ' ' '!' "'" ',' '-' '.' '?' '\\' 'a' 'b' 'c' 'd' 'e'
 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w'
 'x' 'y' 'z' '\x7f' '\xa0' '¡' '¢' '£' '¥' '¦' '§' '¨' '©' 'ª' '«' '¬'
 '\xad' '®' '¯' '°' '±' '²' '³' '´' 'µ' '¶' '·' '¹' 'º' '»' '¼' '½' '¾'
 '¿' '×' 'ß' 'à' 'á' 'â' 'ã' 'ä' 'å' 'æ' 'ç' 'è' 'é' 'ê' 'ë' 'ì' 'í' 'î'
 'ï' 'ð' 'ñ' 'ò' 'ó' 'ô' 'õ' 'ö' '÷' 'ø' 'ù' 'ú' 'û' 'ü' 'ý' 'ā' 'ă' 'ą'
 'ć' 'č' 'ē' 'ĕ' 'ė' 'ę' 'ğ' 'ī' 'ı' 'ĺ' 'ļ' 'ł' 'ń' 'ō' 'ő' 'œ' 'ř' 'ś'
 'ş' 'š' 'ū' 'ů' 'ŷ' 'ź' 'ž' 'ƒ' 'ơ' 'ƴ' 'ț' 'ȳ' 'ɑ' 'ə' 'ɛ' 'ɣ' 'ɪ' 'ɵ'
 'ʃ' 'ʊ' 'ʒ' 'ʼ' 'ˆ' 'ˇ' 'ˈ' 'ˉ' 'ˌ' '˙' '˚' '˜' '̀' '́' '̂' '̃' '̄' '̅'
 '̇' '̈' '̧' '̶' '̸' '͝' ';' '΄' 'ά' 'έ' 'ή' 'ί' 'α' 'β' 'γ' 'δ' 'ε' 'ζ'
 'η' 'θ' 'ι' 'κ' 'λ' 'μ' 'ν' 'ξ' 'ο' 'π' 'ρ' 'ς' 'σ' 'τ' 'υ' 'φ' 'χ' 'ψ'
 'ω' 'ό' 'ύ' 'ώ' 'ϐ' 'ϕ' 'ϵ' 'а' 'б' 'в' 'г' 'д' 'е' 'ж' 'з' 'и' 'й' 'к'
 'л' 'м' 'н' 'о' 'п' 'р' 'с' 'т' 'у' 'ф' 'х' 'ц' 'ч' 'ш' 'щ' 'ъ' 'ы' 'ь'
 'э' 'ю' 'я' 'ё' 'є' 'א' 'ב' 'ד' 'ה' 'ו' 'ח' 'י' 'כ' '

In [34]:
# Split string into list of characters and use Counter from collections library to find most common characters

char_count = Counter(all_characters)

# Limit the count of characters to MAX_VOCAB_SIZE using char_count.most_common
char_count.most_common(10)

[(' ', 41838479),
 ('e', 22649492),
 ('t', 18848027),
 ('a', 15614041),
 ('i', 15307965),
 ('o', 14812551),
 ('s', 13076523),
 ('n', 12756881),
 ('r', 11311575),
 ('h', 8004404)]

In [36]:
# limit the allowed characters to MAX_VOCAB_SIZE
MAX_VOCAB_SIZE = 40
valid_characters = [t[0] for t in  char_count.most_common(MAX_VOCAB_SIZE)]
valid_characters.sort()
print(valid_characters)

[' ', '!', "'", ',', '-', '.', '?', '\\', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'β', '–', '—', '‘', '’', '═']


## 2. Subsample the dataset to take into account the titles only.

In [37]:
POSTS_TYPE = 'title'
DF_SAMPLE_COUNT = 10000

In [38]:
df_full.columns

Index(['post_id', 'parent_id', 'comment_id', 'text', 'category', 'tokens',
       'n_tokens'],
      dtype='object')

In [12]:
# subsample the original dataset

df = df_full[df_full.category == POSTS_TYPE].sample(DF_SAMPLE_COUNT).reset_index(drop=True)

print("df.shape: ", df.shape)

print(df.text.sample(2).values)

df.shape:  (10000, 7)
['Is gradient checking useless in high dimensional setting?'
 'Linear regression what does the F statistic, R squared and residual standard error tell us?']


## 3. Implement the character tokenization of the dataset and transform the tokens into AllenNLP instances.

In [39]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [40]:
tokenizer = CharacterTokenizer()

In [42]:
train_set = df.text.apply(lambda txt : tokenizer.tokenize(txt.lower())).values

In [44]:
train_set[:10]

array([list([d, o, e, s,  , s, p, e, a, r, m, a, n, ', s,  , i, n, d, i, c, a, t, e,  , a, g, r, e, e, m, e, n, t, ?]),
       list([c, r, o, s, s, -, v, a, l, i, d, a, t, i, o, n,  , e, r, r, o, r,  , a, p, p, r, o, x, i, m, a, t, i, o, n,  , i, n, c, o, n, s, i, s, t, e, n, t,  , w, i, t, h,  , t, h, e,  , t, e, s, t,  , s, e, t,  , e, r, r, o, r]),
       list([h, e, l, p,  , s, o, l, v, i, n, g,  , f, o, r,  , l, o, g,  , l, i, k, e, l, i, h, o, o, d]),
       list([c, o, r, r, e, l, a, t, i, o, n,  , e, s, t, i, m, a, t, i, o, n,  , o, n,  , h, a, l, f, -, n, o, r, m, a, l,  , d, i, s, t, r, i, b, u, t, i, o, n]),
       list([t, e, c, h, n, i, q, u, e, s,  , a, n, d,  , t, i, p, s,  , f, o, r,  , i, n, t, e, r, p, r, e, t, i, n, g,  , a,  , c, l, u, s, t, e, r,  , a, n, a, l, y, s, i, s]),
       list([t, e, s, t,  , w, h, e, t, h, e, r,  , d, i, f, f, e, r, e, n, c, e,  , i, n,  , p, r, o, p, o, r, t, i, o, n, s,  , d, i, f, f, e, r, s,  , f, r, o, m,  , a,  , n, o, n, -, z, e, 

In [45]:
# generate an Instance for each list of token. 
# The function takes a list of tokens and an indexer as input and returns an instance composed of the input and output tokens.
def tokens_to_instance(tokens: List[Token], token_indexers: Dict[str, TokenIndexer]):
    tokens = list(tokens)
    tokens.insert(0, Token(START_SYMBOL))
    tokens.append(Token(END_SYMBOL))

    input_field  = TextField(tokens[:-1], token_indexers)
    output_field = TextField(tokens[1:], token_indexers)
    return Instance({'input_tokens': input_field, 'output_tokens': output_field})        

In [46]:
token_indexers = {'tokens': SingleIdTokenIndexer()}
instances = [tokens_to_instance(tokens, token_indexers) for tokens in train_set]

In [47]:
token_counts = {char: 1 for char in valid_characters}
vocab = Vocabulary({'tokens': token_counts})

## 4. Design an RNN with AllenNLP that includes the following:
- An embedding of the tokens
- A seq2seq LSTM layer
- A feed-forward layer that outputs a probability distribution of the characters

In [48]:
EMBEDDING_SIZE = 32
HIDDEN_SIZE = 256
BATCH_SIZE = 128

In [49]:
class RNNLanguageModel(Model):
    def __init__(self,
                 embedder: TextFieldEmbedder,
                 hidden_size: int,
                 max_len: int,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)

        self.embedder = embedder

        # initialize a Seq2Seq encoder, LSTM
        self.rnn = PytorchSeq2SeqWrapper(LSTM(EMBEDDING_SIZE, HIDDEN_SIZE, batch_first=True))
        self.hidden2out = Linear(in_features=self.rnn.get_output_dim(), out_features=vocab.get_vocab_size('tokens'))
        self.hidden_size = hidden_size
        self.max_len = max_len
        
    def forward(self, input_tokens, output_tokens):
        '''
        This is the main process of the Model where the actual computation happens. 
        Each Instance is fed to the forward method. 
        It takes dicts of tensors as input, with same keys as the fields in your Instance (input_tokens, output_tokens)
        It outputs the results of predicted tokens and the evaluation metrics as a dictionary. 
        '''

        # code goes here
        embeddings = self.embedder(input_tokens)
        mask = get_text_field_mask(input_tokens)
        rnn_hidden = self.rnn(embeddings, mask)
        out_logits = self.hidden2out(rnn_hidden)
        loss = sequence_cross_entropy_with_logits(out_logits, output_tokens['tokens']['tokens'], mask)

        return {'loss': loss}

    def generate(self) -> Tuple[List[Token], torch.tensor]:

        # code goes here
        start_symbol_idx = self.vocab.get_token_index(START_SYMBOL, 'tokens')
        end_symbol_idx = self.vocab.get_token_index(END_SYMBOL, 'tokens')
        padding_symbol_idx = self.vocab.get_token_index(DEFAULT_PADDING_TOKEN, 'tokens')
        
        log_likelihood = 0.
        words = []
        state = (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))
        
        word_idx = start_symbol_idx
        
        for i in range(self.max_len):
            tokens = torch.tensor([[word_idx]])

            embeddings = self.embedder({'tokens': {'tokens':tokens}})
            output, state = self.rnn._module(embeddings, state)
            output = self.hidden2out(output)

            log_prob = torch.log_softmax(output[0, 0], dim=0)

            dist = torch.exp(log_prob)

            word_idx = start_symbol_idx

            while word_idx in {start_symbol_idx, padding_symbol_idx}:
                word_idx = torch.multinomial(
                    dist, num_samples=1, replacement=False).item()

            log_likelihood += log_prob[word_idx]

            if word_idx == end_symbol_idx:
                break

            token = Token(text=self.vocab.get_token_from_index(word_idx, 'tokens'))
            words.append(token)
        
        return words, log_likelihood

## 5. Train the model.

In [50]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), embedding_dim=EMBEDDING_SIZE)

embedder = BasicTextFieldEmbedder({"tokens": token_embedding})

model = RNNLanguageModel(embedder=embedder, hidden_size=HIDDEN_SIZE, max_len=80, vocab=vocab)
model = model.to(device)

# device = torch.cuda.current_device()

# model.to(device)

In [52]:
data_loader = SimpleDataLoader(instances,BATCH_SIZE , shuffle=True)
data_loader.index_with(vocab)
data_loader.set_target_device(device)
optimizer = optim.Adam(model.parameters(), lr=5.e-3)

In [54]:
trainer = GradientDescentTrainer(model=model,
                                data_loader=data_loader,
                                optimizer=optimizer,
                                num_epochs=40
                                # cuda_device=-1
                                )

trainer.train()

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

{'best_epoch': 39,
 'peak_worker_0_memory_MB': 7252.85546875,
 'peak_gpu_0_memory_MB': 107.44140625,
 'training_duration': '0:04:08.826468',
 'epoch': 39,
 'training_loss': 0.8103542312791076,
 'training_worker_0_memory_MB': 7252.85546875,
 'training_gpu_0_memory_MB': 105.78369140625}

## 6. Evaluate the model by calculating the loss of some sentences and by generating text.

In [55]:
def predict(text: str, model: Model) -> float:
    tokenizer = CharacterTokenizer()
    tokens = tokenizer.tokenize(text)
    
    token_indexers = {'tokens': SingleIdTokenIndexer()}
    instance = tokens_to_instance(tokens, token_indexers)
    output = model.forward_on_instance(instance)
    print(output)

In [56]:
sentence = "In a fixed-effects model only time-varying variables can be used."
predict(sentence, model)

sentence = "I know a pretty little place in Southern California, down San Diego way."
predict(sentence, model)

sentence = "This that is noon but yes apple whatever did regression variable"
predict(sentence, model)

{'loss': 1.4806435}
{'loss': 4.4319882}
{'loss': 2.4859297}


Cannot resolve this error: Could not run 'aten::values' with arguments from the 'CPU' backend.

In [58]:
for _ in range(50):
    tokens, _ = model.cpu().generate()
    print(''.join(token.text for token in tokens))

how to add covariance the binomial logit method to friedman? a set of importance
phose binary correspored for optimal e-values after its because continuous measu
in last feature selection deskets for linear regression?-normal distributions ar
r-dead to approximate from a link packages with binomial probability estimator??
nominal model removed beta distribution? ligh? 'c bell-curve why? mcmclut to ove
hidden markov model when i testing? what model with zero number read? , are not 
data analysis of auding clusters makes theourned from treatment median specis tr
question simulation of an averaging normality? term order thing in a multiple gr
if which vs variation tees the coefficients of a single predictors incom-negativ
negative individual change methods to plot decision tree? is variable! majoding 
confused about frequency distribution? number of variables? archices? then formu
any good modeling . var x y x, y? linear model? of ensing at not process? for a 
simple rungbu'n survival res