<h1><center>HW4 Language Modeling (LM)</center></h1>

In this homework, you will first implement a simple bigram language model on a dataset containing news headlines, learn basic concepts of marcov modeling, words sampling, and perplexity. 

Then things start get very fun and open ended. You will be shown a simple word based RNN LM. Understand how it works, and then apply changes to it as you wish. Things you can try but not limited to:

1. Word based RNN model with subword embedding
2. Character based RNN model
2. Try different model architecture
3. Try different training corprus
4. Personalized LM

**You are given the following files**:
- `Language_Modeling.ipynb`: Notebook file with starter code
- `headlines.train`: Training set to train your model
- `headlines.dev`: Test set to report your model’s performance
- `glove_300d.csv`: Glove embedding truncated for the vocab in the training data
- `../utils/`: folder containing all utility code for the series of homeworks

**Deriverables**:
- pdf or html of the notebook
- A report of your own model if you have 

### ======================== Coding starts here ====================

# Setup

##  Load functions

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os, sys, random
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path

from utils.hw4 import (load_data, load_data_char, gen_vocab, START, END, UNK, 
                       load_embedding)
from utils.general import sigmoid, tanh, show_keras_model

## Load data 

In [None]:
# The input is trucated for fast iteration
# Remember to use the full set of data for your final model training
# It may take some time
headlines_train = load_data("headlines.train")[:10000]
headlines_dev = load_data("headlines.dev")[:100]

# Before we begin, let's look at what some of the headlines look like. 
# Run the following code block as many times as you want to get a sense 
# of what kind of headlines we hope to generate.
for headline in random.sample(headlines_train, 5):
    print(START + ' '.join(headline) + END)

In [None]:
# Calculate the vocab list and the embedding 
# It (might) be helpful to remove low frequency words, so the model learns how to
# treat unseen vocabulary
vocab, re_vocab = gen_vocab(headlines_train, 4)
sent_len = max([len(s) for s in headlines_train]) + 1

print("Size of vocab: ", len(vocab))
print("Longest setence length: ", sent_len)

# Load the embedding, trick is played to fill the missing vocab
# you can look into the source file to see what it actually does
# This embedding file is truncated for vocab used in this dataset
# If you are to train your own model with your own data, remember to download
# the original embedding here: https://nlp.stanford.edu/projects/glove/
glove = load_embedding('glove_300d.csv', vocab=vocab)

glove.T.head()

In [None]:
# Transform the DF to np array
glove = glove.values

## Util function

In [None]:
def to_label(token):
    """
    Simply transfer a token to its numerical label, if the token is not int
    the vocab, return the label of UNK
    input: 
        token: str
        
    output:
        int
    """
    return re_vocab.get(token, re_vocab[UNK])

def to_embedding(X):
    """
    For the 2 dimensional input X filled with the vocabulary label, 
    return an np.array of their embedding
    
    input:
        X: np.array(n_sample, sent_len)
        
    return:
        embdding
    """
    embedding = np.zeros((len(X), len(X[0]), glove.shape[1]))
    
    for i in range(len(X)):
        for j in range(len(X[0])):
            embedding[i,j,:] = glove[X[i][j]] 
    
    return embedding

def sample_with_weight(prob, avoid_UNK=True):
    """
    For a given probability distribution, return a random int sampled by the
    probability distribution. 
    
    input:
        prob: list of float probability
        avoid_UNK: boolean, if UNK should be excluded
    """
    unk_idx = re_vocab[UNK]
    
    if avoid_UNK: 
        prob[unk_idx] = 0 # Make sure we do not use UNK in the generated text
    
    # If the distribution is 0, use uniform distribution
    if prob.sum() <= 0:
        prob[1:] = 1.0
        
    return np.random.choice(range(len(prob)), p=prob/prob.sum())

# Tri-gram (second order) Markov Model

## Build FNN  Bi-Gram model

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Reshape

n_gram = 2

# For simplicity, we use the embedding of words to feed the model, therefore
# no need to add a Embedding layer in the begining. But for a possibly better performance
# you can add a embedding layer, even better if you use the glove embedding matrix as the
# initial value for the embedding layer
# This is useful also because we have filled the embedding with random values for those missing
# vocabularies, allowing the embedding matrix to relax during training will improve the performance 
# for these words as well. But be prepared that this would slow down the training
FNN_model = Sequential()
FNN_model.add(Reshape(target_shape=(n_gram * glove.shape[1],), 
                      input_shape=(n_gram, glove.shape[1],)))
FNN_model.add(Dense(100, activation="relu", name="Dense-1"))
FNN_model.add(Dense(len(vocab), activation="softmax", name="Dense-2"))

FNN_model.summary()
show_keras_model(FNN_model)

## Training data generator

In [None]:
from keras.utils import to_categorical
import random

def gen_sample_FNN(data, batch_size=1000, one_hot=True):
    """
    For training the model, we need to shift the data by -1 to produce
    label, i.g.
    ["word1", "word2", "word3", "word4"] --> 
    X: [[START, STSRT],
        [START, 1],
        [1, 2],
        [2, 3],
        [3, 4]]
    Y: [1, 2, 3, 4, ...] if one_hot is False, the label is translated to 
        one-hot if ont_hot is True
    
    inputs:
        data: list of list of strings
        batch_size: int
        one_hot: boolean
        
    outputs:
        X: np.array(batch_size, n_gram, embedding_dim)
        Y: np.array(batch_size, ) or np.array(batch_size, vocab_size)
        
    batch size is used to control the size for each data batch
    set batch_size = -1 if you don't want to generate by batch
    """
    if batch_size == -1:
        batch_size = sum([len(s) + 1 for s in data])
        
    while True:
        # Use shuffle so the order in each epoch is different
        random.shuffle(data)

        X, Y = [], []
        for d in data:
            encodes = [re_vocab[START], re_vocab[START]] +\
                      [to_label(t) for t in d] +\
                      [re_vocab[END]]
            for i in range(len(encodes) - 2):
                X.append([encodes[i], encodes[i+1]])
                Y.append(encodes[i+2])

                if len(X) >= batch_size:
                    X = to_embedding(X)
                    Y = np.array(Y)
                    
                    if one_hot:
                        Y = to_categorical(Y, num_classes=len(re_vocab))
                        
                    yield X, Y
                    X, Y = [], []

## Generate text

In [None]:
def generate_text_FNN(model, max_len=sent_len-1, seed=None):
    """
    For a given FNN model, generate text. If seed is not provided,
    use START as initial seed.
    
    inputs:
        model: FNN model
        max_len: int, maximum length of the setence
        seed: str, the seed word used to generate the text
    """
    
    result = []
    
    """
    Add your code here
    
    hints:
    1. It's a trigram model, what your intial seed look like?
    2. The prediction of each state should return a list of probability, use the 
       `sample_with_weight` function to help you sample the next word.
    3. When the word END is sampled, you need to stop the setence. Also use the max_len
       to force ending the setence to avoid the program running forever.
    """
        
    return ' '.join(result)

In [None]:
# Before we train the model, let first check if the text generation function
# works as expected. Don't worry if the sentence doesn't make any sense.
# We haven't trained the model yet!
generate_text_FNN(FNN_model, seed="china")

## Calculate Perplexity

In [None]:
def calculate_perplexity_FNN(model, X, y):
    """
    For a given FNN model, and test data, calcualte the perplexity.
    The definition of perplexity is:
    
    perplexity = exp(- \sum_i log(P_i) / N)
    
    inputs:
        model: FNN model
        X: np.array(n_sample, n_gram, embedding_dim)
        y: np.array(n_sample), int label of the next word    
    """
    
    """
    Add your code here
    
    hits:
        1. First make the prod prediction using the model
        2. The probability at the position of y is what you look for
    
    When you have too much UNK word, you will find the perplexity to be lower, but it doesn't 
    really mean your model is better, can you think why?
    """
    
    return perplexity

In [None]:
# Let check the perplexity for the untrained model
# Is your value close to the number of vocabulary? 
# Is this a coincidence?
X_dev_FNN, y_dev_FNN = next(gen_sample_FNN(headlines_dev, batch_size=-1, one_hot=False))

calculate_perplexity_FNN(FNN_model, X_dev_FNN, y_dev_FNN)

## Training the model

In [None]:
"""
Let's use the function defined above to report the model performance
after each epoch
"""
def on_epoch_end_FNN(epoch, logs):
    print('----- Generating text after Epoch: %d' % epoch)
    for i in range(3):
        print(generate_text_FNN(FNN_model))
    print('Current perplexity on dev data: ', 
          calculate_perplexity_FNN(FNN_model, X_dev_FNN, y_dev_FNN), '\n')

In [None]:
from keras.callbacks import LambdaCallback

"""
Notice how the metrics / generated text evolve after each epoch
"""
FNN_model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy', 'top_k_categorical_accuracy'])

batch_size = 512
steps_per_epoch = sum([len(s) + 1 for s in headlines_train]) // batch_size + 1
FNN_model.fit_generator(gen_sample_FNN(headlines_train, batch_size=batch_size), 
                        epochs = 10, steps_per_epoch=steps_per_epoch,
                        callbacks=[LambdaCallback(on_epoch_end=on_epoch_end_FNN)])

# Word-based RNN Language Model

## Build LSTM model

In [None]:
from keras.layers import Dense, LSTM, Activation, TimeDistributed

# For simplicity, we use the embedding of words to feed the model, therefore
# no need to add a Embedding layer in the begining. But for a possibly better performance
# you can add a embedding layer, even better if you use the glove embedding matrix as the
# initial value for the embedding layer
# This is useful also because we have filled the embedding with random values for those missing
# vocabularies, allowing the embedding matrix to relax during training will improve the performance 
# for these words as well. But be prepared that this would slow down the training

# Unfortunately Keras does not have an easy way to support dynamic length of input for RNN model.
# So we use the sent_len to truncate all the sentences.
batch_size = 10
RNN_train_model = Sequential()
RNN_train_model.add(
    LSTM(128, input_shape=(sent_len, glove.shape[1]), return_sequences=True)
    )
RNN_train_model.add(TimeDistributed(Dense(len(vocab), activation='softmax')))
RNN_train_model.summary()
show_keras_model(RNN_train_model)

In [None]:
from keras.layers import Dense, LSTM, Activation, TimeDistributed

"""
# It's tricky to explain why we need the RNN_pred_model. 
# The RNN_train_model.predict requires a fix length of input (sent_len in our case).
# This is not convenient for us because we need to generate the next text one by one.
# The trick we play here is to create a shadow model having only 1 time step. We will
# copy the parameter of the RNN_train_model to this model once it's trained.
# Check generate_text_RNN function to understand details, and there is some discussion 
# here: "https://github.com/keras-team/keras/issues/8771"
"""

RNN_pred_model = Sequential()
RNN_pred_model.add(
    LSTM(128, batch_input_shape=(1, 1, glove.shape[1]), return_sequences=True, stateful=True)
    )
RNN_pred_model.add(TimeDistributed(Dense(len(vocab), activation='softmax')))
RNN_pred_model.summary()
show_keras_model(RNN_pred_model)

## Training data generator

In [None]:
from keras.preprocessing.sequence import pad_sequences

def gen_sample_RNN(data, batch_size=100, one_hot=True):
    """
    The input is the same to the FNN model, but the output training data is different.
    
    inputs:
        data: list of list of string
        batch_size: int
        one_hot: boolean
        
    output:
        X: np.array(batch_size, sent_len, embedding_dim)
        Y: np.array(batch_size, sent_len, ) or np.array(batch_size, sent_len, vocab_size)
    """
    if batch_size == -1:
        batch_size = len(data)
        
    while True:
        # Shuffle the data so data order is different for different epochs
        random.shuffle(data)

        X, Y = [], []
        for s in data:
            X.append([to_label(START)] + [to_label(t) for t in s])
            Y.append([to_label(t) for t in s] + [to_label(END)])
            
            if len(X) >= batch_size:   
                X = pad_sequences(sequences=X, maxlen=sent_len, padding='post', value=to_label(END))
                Y = pad_sequences(sequences=Y, maxlen=sent_len, padding='post', value=to_label(END))
          
                if one_hot: Y = to_categorical(Y, num_classes=len(re_vocab))
                
                yield to_embedding(X), Y
                
                X, Y = [], []

## Generate text

In [None]:
def generate_text_RNN(model, max_len=sent_len-1, seed=None):
    """
    Use the RNN_pred_model to generate text. Notice how we use the stateful model to generate
    the next word one by one. Make sure you fully understand each line of this code. 
    """
    if seed is None:
        seed = START
        result = []
    else:
        result = [seed]
    
    model.reset_states()
    
    for i in range(max_len):
        X = to_embedding([[to_label(seed)]])
        idx = sample_with_weight(model.predict(X)[0][0])
        
        if vocab[idx] == END: break
            
        seed = vocab[idx]
        result.append(seed)
        
    return ' '.join(result)

In [None]:
generate_text_RNN(RNN_pred_model, seed="china")

## Calculate Perplexity

In [None]:
def calculate_perplexity_RNN(model, X, y):
    """
    For a given FNN model, and test data, calcualte the perplexity.
    The definition of perplexity is:
    
    perplexity = exp(- \sum_i log(P_i) / N)
    
    inputs:
        model: FNN model
        X: np.array(n_sample, sent_len, embedding_dim)
        y: np.array(n_sample, sent_len), int labels
    """
    
    """
    Add your code here
    
    hits:
        1. First make the prod prediction using the RNN_train_model
        2. The probability at the position of y is what you look for
        3. All sentences have fixed length, meaning a sentence can have multiple padding END at the end
           of a sentence. Consider stop counting the perplexity once you hit the first END, otherwise
           your perplexity will seem too good.
    
    When you have too much UNK word, you will find the perplexity to be lower, but it doesn't 
    really mean your model is better, can you think why?
    """
    
    return perplexity

In [None]:
# Let check the perplexity for the untrained model
# Is your value close to the number of vocabulary? 
# Is this a coincidence?
X_dev_RNN, y_dev_RNN = next(gen_sample_RNN(headlines_dev, batch_size=-1, one_hot=False))

calculate_perplexity_RNN(RNN_train_model, X_dev_RNN, y_dev_RNN)

## Train model

In [None]:
"""
Let's use the function defined above to report the model performance
after each epoch
"""
def on_epoch_end_RNN(epoch, logs):
    RNN_pred_model.set_weights(RNN_train_model.get_weights())
    print('----- Generating text after Epoch: %d' % epoch)
    for i in range(3):
        print(generate_text_RNN(RNN_pred_model))
    print('Current perplexity on dev data: ', 
          calculate_perplexity_RNN(RNN_train_model, X_dev_RNN, y_dev_RNN), '\n')

In [None]:
"""
Notice how the metrics / generated text evolve after each epoch
"""
batch_size = 10
num_batches = len(headlines_train) // batch_size 
RNN_train_model.compile(loss='categorical_crossentropy', optimizer='adam')
RNN_train_model.fit_generator(gen_sample_RNN(headlines_train, batch_size), num_batches, 3,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end_RNN)])