# HW 2: Language Modeling

In this homework you will be building several varieties of language models.

## Goal

We ask that you construct the following models in PyTorch:

1. A trigram model with linear-interpolation. $$p(y_t | y_{1:t-1}) =  \alpha_1 p(y_t | y_{t-2}, y_{t-1}) + \alpha_2 p(y_t | y_{t-1}) + (1 - \alpha_1 - \alpha_2) p(y_t) $$
2. A neural network language model (consult *A Neural Probabilistic Language Model* http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
3. An LSTM language model (consult *Recurrent Neural Network Regularization*, https://arxiv.org/pdf/1409.2329.pdf) 
4. Your own extensions to these models...


Consult the papers provided for hyperparameters.

 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [None]:
# Text text processing library
import torchtext
from torchtext.vocab import Vectors

The dataset we will use of this problem is known as the Penn Treebank (http://aclweb.org/anthology/J93-2004). It is the most famous dataset in NLP and includes a large set of different types of annotations. We will be using it here in a simple case as just a language modeling dataset.

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [None]:
# Our input $x$
TEXT = torchtext.data.Field()

Next we input our data. Here we will use the first 10k sentences of the standard PTB language modeling split, and tell it the fields.

In [None]:
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".", 
    train="train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)

The data format for language modeling is strange. We pretend the entire corpus is one long sentence.

In [None]:
print('len(train)', len(train))

Here's the vocab itself. (This dataset has unk symbols already, but torchtext adds its own.)

In [None]:
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

When debugging you may want to use a smaller vocab size. This will run much faster.

In [None]:
if False:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)

The batching is done in a strange way for language modeling. Each element of the batch consists of `bptt_len` words in order. This makes it easy to run recurrent models like RNNs. 

In [None]:
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=10, device=-1, bptt_len=32, repeat=False)

Here's what these batches look like. Each is a string of length 32. Sentences are ended with a special `<eos>` token.

In [None]:
it = iter(train_iter)
batch = next(it) 
print(vars(batch))
print("Size of text batch [max bptt length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 2])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

The next batch will be the continuation of the previous. This is helpful for running recurrent neural networks where you remember the current state when transitioning.

In [None]:
batch = next(it)
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 3].data]))
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 4].data]))

There are no separate labels. But you can just use an offset `batch.text[1:]` to get the next word.

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 3 different torch models that take in batch.text and produce a distribution over the next word. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw2-s18

For the final Kaggle test, we will have you do a next word prediction task. We will provide a 10 word prefix of sentences, and it is your job to predict 10 possible next word candidates

In [None]:
!head input.txt

As a sample Kaggle submission, let us build a simple unigram model.  

In [None]:
from collections import Counter
count = Counter()
for b in iter(train_iter):
    count.update(b.text.view(-1).data.tolist())
count[TEXT.vocab.stoi["<eos>"]] = 0
predictions = [TEXT.vocab.itos[i] for i, c in count.most_common(20)]
print(predictions)
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print("%d,%s"%(i, " ".join(predictions)), file=fout)


In [None]:
!head sample.txt

The metric we are using is mean average precision of your 20-best list. 

$$MAP@20 = \frac{1}{|D|} \sum_{u=1}^{|D|} \sum_{k=1}^{20} Precision(u, 1:k)$$

Ideally we would use log-likelihood or ppl as discussed in class, but this is the best Kaggle gives us. This takes into account whether you got the right answer and how highly you ranked it. 

In particular, we ask that you do not game this metric. Please submit *exactly 20* unique predictions for each example.


As always you should put up a 5-6 page write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

# Neural Probabalistic Language Model

In [15]:
# Text text processing library
import torchtext
import torch
import math 
import torch.nn as nn 
from torchtext.vocab import Vectors
from torch.autograd import Variable
DEBUG = True 
# Our input $x$
TEXT = torchtext.data.Field()
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".", 
    train="train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

if DEBUG == True:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)
    print('len(TEXT.vocab)', len(TEXT.vocab))
    
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=12, device=-1, bptt_len=32, repeat=False, shuffle=False)

len(TEXT.vocab) 10001
len(TEXT.vocab) 1002


In [25]:
class LSTM(nn.Module):
    
    def __init__(self, V_vocab_dim, M_embed_dim, H_hidden_dim, N_seq_len, B_batch_size):
        super(LSTM, self).__init__()
        self.batch_size = B_batch_size
        self.hidden_dim = H_hidden_dim
        self.vocab_dim = V_vocab_dim
        self.embed = nn.Embedding(V_vocab_dim, M_embed_dim)
        self.dropout = nn.Dropout(p=0.3)
        self.lstm = nn.LSTM(M_embed_dim, self.hidden_dim, dropout=0.3)
        self.fc = nn.Linear(self.hidden_dim, V_vocab_dim)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        if torch.cuda.is_available():
            return (torch.autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)).cuda(),
                torch.autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)).cuda())
        else:   
            return (torch.autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)),
                torch.autograd.Variable(torch.zeros(1, self.batch_size, self.hidden_dim)))

    def forward(self, sentence):
        #print(sentence.shape)
        #input size N_seq_len x B_batch_size
        embeds = self.embed(sentence)
        #print(embeds.shape)
        # embeds size N_seq_len x B_batch_size x M_embed_dim
        lstm_out, self.hidden = self.lstm(embeds, self.hidden)
        #print(lstm_out.shape)
        # lstm_out N_seq_len x B_batch_size x H_hidden_dim
        out = self.fc(self.dropout(lstm_out))
        #print(out.shape)
        # out N_seq_len x B_batch_size x V_vocab_dim
        return out
    
    def repackage_hidden(self, h):
        if type(h) == Variable:
            return Variable(h.data)
        else:
            return tuple(self.repackage_hidden(v) for v in h)

In [26]:
def evaluate(model, data_iterator):
    # Turn on evaluation mode which disables dropout.
    #model.eval()
    total_loss = 0
    batch_count = 0
    for batch in iter(data_iterator):
        #model.hidden = model.init_hidden()
        model.hidden = model.repackage_hidden(model.hidden)
        output = model(batch.text)
        batch_loss = criterion(output.view(-1, model.vocab_dim), batch.target.view(-1)).data
        total_loss += batch_loss
        batch_count += 1
    return total_loss[0] / batch_count

def train_batch(model, criterion, optim, batch, target):
    # initialize hidden vectors
    model.zero_grad()
    #model.hidden = model.init_hidden()
    model.hidden = model.repackage_hidden(model.hidden)
    # calculate forward pass
    y = model(batch)
    # calculate loss
    loss = criterion(y.view(-1, model.vocab_dim), target.view(-1))
    # backpropagate and step
    loss.backward()
    torch.nn.utils.clip_grad_norm(model.parameters(), 0.25)
    optim.step()
    return loss.data[0]

# training loop
def run_training(model, criterion, optim, data_iterator, val_iter):

    for e in range(n_epochs):
        batches = 0
        epoch_loss = 0
        for batch in iter(data_iterator):
            batch_loss = train_batch(model, criterion, optim, batch.text, batch.target)
            batches += 1
            epoch_loss += batch_loss
        epoch_loss /= batches
        print("Epoch ", e, " Loss: ", epoch_loss, "Perplexity: ", math.exp(epoch_loss))
        train_loss = evaluate(model, data_iterator)
        print("Epoch Train Loss: ", train_loss, "Perplexity: ", math.exp(train_loss))
        val_loss = evaluate(model, val_iter)
        print("Epoch Val Loss: ", val_loss, "Perplexity: ", math.exp(val_loss))
        torch.save(model.state_dict(), 'LSTM_small_model.pt')

In [28]:
# size of the embeddings and vectors
n_embedding = 30
n_hidden = 30
seq_len = 32
batch_size = 12

# initialize LSTM
lstm_model = LSTM(len(TEXT.vocab), n_embedding, n_hidden, seq_len, batch_size)

n_epochs = 10
learning_rate = .5
criterion = nn.CrossEntropyLoss()
optim = torch.optim.SGD(lstm_model.parameters(), lr = learning_rate)


run_training(lstm_model, criterion, optim, train_iter, val_iter)



Epoch  0  Loss:  4.47050894769151 Perplexity:  87.40119433178403


  


Epoch Train Loss:  4.309644532863486 Perplexity:  74.4140324953781
Epoch Val Loss:  4.352856571810233 Perplexity:  77.70010213179337
Epoch  1  Loss:  4.2357210825416285 Perplexity:  69.11149586332118
Epoch Train Loss:  4.184478343788724 Perplexity:  65.65924043081858
Epoch Val Loss:  4.229981417482999 Perplexity:  68.71595524657562
Epoch  2  Loss:  4.140948725751194 Perplexity:  62.86243237589756
Epoch Train Loss:  4.112700798352954 Perplexity:  61.11154484647933
Epoch Val Loss:  4.156256641130991 Perplexity:  63.83212824659654
Epoch  3  Loss:  4.078079010917351 Perplexity:  59.03196110744583
Epoch Train Loss:  4.059363786271169 Perplexity:  57.937438756913835
Epoch Val Loss:  4.1050750258055375 Perplexity:  60.64729448904507
Epoch  4  Loss:  4.03260193901188 Perplexity:  56.40748931785042
Epoch Train Loss:  4.018811635171417 Perplexity:  55.634951906998936
Epoch Val Loss:  4.065726957172927 Perplexity:  58.30728001499659


KeyboardInterrupt: 

In [None]:
lstm_model = LSTM(len(TEXT.vocab), n_embedding, n_hidden, seq_len, batch_size)
lstm_model.load_state_dict(torch.load('LSTM_small_model.pt'))
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print(l)
        input_tokens = l.split(" ")[:-1]
        input_index = torch.LongTensor([TEXT.vocab.stoi[t] for t in input_tokens]).unsqueeze(1)
        lstm_model.hidden = lstm_model.init_hidden()
        output = lstm_model(torch.autograd.Variable(input_index))
        clean_output = output[-1, 0, :].view(-1, lstm_model.vocab_dim)
        max_values, max_indices = torch.topk(clean_output, 20)
        print(" ".join([TEXT.vocab.itos[int(i)] for i in max_indices.data[0, :]]))

        #print("%d,%s"%(i, " ".join(predictions)), file=fout)