# N-gram language modeling
Let's train n-gram language models that better handle 0 probabilities and then sample (generate) from them!

## Setup
No need to run this unless you haven't successfully installed `nltk` or `scikit-learn` yet.

In [None]:
! pip install --user nltk
! pip install --user scikit-learn

Now select **Kernel > Restart Kernel** from the menu bar.

In [None]:
# Test importing
import nltk
import sklearn

# Train an n-gram language model on Reuters data that handles 0 probabilities

## Load news text from Reuters
Reuters from the '90s. Old news.

In [None]:
# Only need to run this once on your CRCD account
import nltk
nltk.download('reuters')

In [None]:
import nltk
from nltk.corpus import reuters

## Preprocess the text into ngrams

In [None]:
sents = reuters.sents() # Load all sentences in the corpus and lowercase

# Lowercase the data
sents = [[word.lower() for word in doc] for doc in sents]
sents[0][:10]

In [None]:
# Randomly split into training and test sets to evaluate perplexity
from sklearn.model_selection import train_test_split

random_seed = # FILL IN an integer
train, test = train_test_split(sents, test_size=0.1, random_state=random_seed)
print(len(train))
print(len(test))

In [None]:
# Prepare the input for n-gram language model training with NLTK (sequences of n-grams)
from nltk.lm.preprocessing import padded_everygram_pipeline

n = # what order of ngram (2 for bigram, 3 for trigram, etc)
processed_train, vocab = padded_everygram_pipeline(n, train) 
# unfortunately can't inspect this as it's a generator that's evaluated ("filled in") lazily

## Train an n-gram language model with Lidstone smoothing
To handle `<UNK>` (unseen) words, we will add a small pseudocount of 0.001, which is called "Lidstone" smoothing as the pseudocount $\gamma$ is $0 < \gamma < 1$. At $\gamma = 1$, this is the same as Laplace smoothing.

In [None]:
from nltk.lm import Lidstone

# Initiate and fit an ngram language model
lm = Lidstone(gamma=0.001, order=n) # you can also play around with changing gamma
lm.fit(processed_train, vocab)

In [None]:
# Test how it handles unseen words
weird_word = 'weirdo' # FILL IN a rare word that likely does not occur in '90s news
print(lm.vocab.lookup(weird_word)) # How is it treating that word?
print(lm.score(weird_word)) # What probability is it giving that word in the model?

## Evaluate perplexity on the test set

In [None]:
# Get test set into the same input format as the training set

processed_test_generator, vocab = padded_everygram_pipeline(n, test) 
processed_test = [list(el) for el in list(processed_test_generator)]
len(processed_test)

In [None]:
lm.perplexity(processed_test)

Look at that! A real value instead of `inf`.

## Sample (generate) from this trained language model

In [None]:
# Run this cell as many times as you like to take new samples (generate new phrases)
# Record or copy the cell if there are any good ones you want to save and report back to the clas

num_tokens =  # FILL IN with how many tokens you want to generate
prompt = [] # FILL IN with a list of tokens as a prompt (prior context). Or pass an empty list to just start generating
generated_toks = lm.generate(num_words=num_tokens, text_seed=prompt)
' '.join(prompt + generated_toks)

# Train an n-gram language model from a different data source
Alright, let's train an n-gram language model from different data sources. You can choose from the following options, or try loading some other text data if you want!
* Airbnb descriptions
* Shakespeare plays

Whichever you choose, you can skip to the corresponding part of the notebook.

## Airbnb descriptions

In [None]:
# Load data
import pandas as pd

airbnb_filepath = '' # FILL IN the filepath to the CSV file with the Airbnb listings you should have somewhere still from session 2
# If you don't have any Airbnb data, open and run session2_text_normalization.ipynb
listings = pd.read_csv(airbnb_filepath) # reads CSV file into a pandas dataframe
len(listings)

In [None]:
# Preprocess description column
from nltk import word_tokenize
from tqdm.auto import tqdm # for progress bar
tqdm.pandas()

def preprocess_airbnb(text):
    stray_html = ' ' # FILL IN with HTML tag that we removed from this data earlier. Or if you forget this, just leave it as a blank
    processed = text.replace(stray_html, ' ').lower()
    return word_tokenize(processed)

processed_airbnb = listings.description.dropna().progress_map(preprocess_airbnb).tolist()
processed_airbnb[0]

In [None]:
# Prepare input for NLTK
from nltk.lm.preprocessing import padded_everygram_pipeline

n = # what order of ngram
listings_input, vocab = padded_everygram_pipeline(n, processed_airbnb) 

In [None]:
# Train n-gram language model
from nltk.lm import Lidstone

# Initiate and fit an ngram language model
lm = Lidstone(gamma=0.001, order=n) # You can also play around with change the gamma value
lm.fit(listings_input, vocab)

In [None]:
# Run this cell as many times as you like to take new samples (generate new phrases)
# Record or copy the cell if there are any good ones you want to save and report back to the clas

num_tokens =  # FILL IN with how many tokens you want to generate
prompt = [] # FILL IN with a list of tokens as a prompt (prior context). Or pass an empty list to just start generating
generated_toks = lm.generate(num_words=num_tokens, text_seed=prompt)
' '.join(prompt + generated_toks)

## Shakespeare plays

In [None]:
# Load Shakespeare plays

shakespeare = pd.read_csv('data/shakespeare_plays.csv', delimiter=';', header=None, names=['line_id', 'play', 'something', 'something_else', 'character', 'text'])
shakespeare.info()
shakespeare.head()

In [None]:
# Preprocess Shakespeare play lines
from nltk import word_tokenize
from tqdm.auto import tqdm # for progress bar
tqdm.pandas()

def preprocess_shakespeare(text):
    processed = text.lower()
    return word_tokenize(processed)

processed = shakespeare.text.dropna().progress_map(preprocess_shakespeare).tolist()

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

n = # FILL IN what order of ngram
shakespeare_input, vocab = padded_everygram_pipeline(n, processed) 

In [None]:
from nltk.lm import Lidstone

# Initiate and fit an ngram language model
lm = Lidstone(gamma=0.001, order=n)
lm.fit(shakespeare_input, vocab)

In [None]:
# Run this cell as many times as you like to take new samples (generate new phrases)
# Record or copy the cell if there are any good ones you want to save and report back to the clas

num_tokens =  # FILL IN with how many tokens you want to generate
prompt = [] # FILL IN with a list of tokens as a prompt (prior context). Or pass an empty list to just start generating
generated_toks = lm.generate(num_words=num_tokens, text_seed=prompt)
' '.join(prompt + generated_toks)