# HW 2: Language Modeling

In this homework you will be building several varieties of language models.

## Goal

We ask that you construct the following models in PyTorch:

1. A trigram model with linear-interpolation. $$p(y_t | y_{1:t-1}) =  \alpha_1 p(y_t | y_{t-2}, y_{t-1}) + \alpha_2 p(y_t | y_{t-1}) + (1 - \alpha_1 - \alpha_2) p(y_t) $$
2. A neural network language model (consult *A Neural Probabilistic Language Model* http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
3. An LSTM language model (consult *Recurrent Neural Network Regularization*, https://arxiv.org/pdf/1409.2329.pdf) 
4. Your own extensions to these models...


Consult the papers provided for hyperparameters.

 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [None]:
# Text text processing library
import torchtext
from torchtext.vocab import Vectors

The dataset we will use of this problem is known as the Penn Treebank (http://aclweb.org/anthology/J93-2004). It is the most famous dataset in NLP and includes a large set of different types of annotations. We will be using it here in a simple case as just a language modeling dataset.

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [None]:
# Our input $x$
TEXT = torchtext.data.Field()

Next we input our data. Here we will use the first 10k sentences of the standard PTB language modeling split, and tell it the fields.

In [None]:
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".", 
    train="train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)

The data format for language modeling is strange. We pretend the entire corpus is one long sentence.

In [None]:
print('len(train)', len(train))

Here's the vocab itself. (This dataset has unk symbols already, but torchtext adds its own.)

In [None]:
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

When debugging you may want to use a smaller vocab size. This will run much faster.

In [None]:
if False:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)

The batching is done in a strange way for language modeling. Each element of the batch consists of `bptt_len` words in order. This makes it easy to run recurrent models like RNNs. 

In [None]:
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=10, device=-1, bptt_len=32, repeat=False)

Here's what these batches look like. Each is a string of length 32. Sentences are ended with a special `<eos>` token.

In [None]:
it = iter(train_iter)
batch = next(it) 
print(vars(batch))
print("Size of text batch [max bptt length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 2])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

The next batch will be the continuation of the previous. This is helpful for running recurrent neural networks where you remember the current state when transitioning.

In [None]:
batch = next(it)
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 3].data]))
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 4].data]))

There are no separate labels. But you can just use an offset `batch.text[1:]` to get the next word.

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 3 different torch models that take in batch.text and produce a distribution over the next word. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw2-s18

For the final Kaggle test, we will have you do a next word prediction task. We will provide a 10 word prefix of sentences, and it is your job to predict 10 possible next word candidates

In [None]:
!head input.txt

As a sample Kaggle submission, let us build a simple unigram model.  

In [None]:
from collections import Counter
count = Counter()
for b in iter(train_iter):
    count.update(b.text.view(-1).data.tolist())
count[TEXT.vocab.stoi["<eos>"]] = 0
predictions = [TEXT.vocab.itos[i] for i, c in count.most_common(20)]
print(predictions)
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print("%d,%s"%(i, " ".join(predictions)), file=fout)


In [None]:
!head sample.txt

The metric we are using is mean average precision of your 20-best list. 

$$MAP@20 = \frac{1}{|D|} \sum_{u=1}^{|D|} \sum_{k=1}^{20} Precision(u, 1:k)$$

Ideally we would use log-likelihood or ppl as discussed in class, but this is the best Kaggle gives us. This takes into account whether you got the right answer and how highly you ranked it. 

In particular, we ask that you do not game this metric. Please submit *exactly 20* unique predictions for each example.


As always you should put up a 5-6 page write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

# Neural Probabalistic Language Model

In [94]:
# Text text processing library
import torchtext
import torch
import math 
import torch.nn as nn 
from torchtext.vocab import Vectors
DEBUG = True 
# Our input $x$
TEXT = torchtext.data.Field()
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".", 
    train="train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

if DEBUG == True:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)
    print('len(TEXT.vocab)', len(TEXT.vocab))
    
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=12, device=-1, bptt_len=32, repeat=False, shuffle=False)

len(TEXT.vocab) 10001
len(TEXT.vocab) 1002


In [95]:
class NPLM(nn.Module):

    def __init__(self, V_vocab_dim, M_embed_dim, H_hidden_dim, N_seq_len):
        super(NPLM, self).__init__()
        self.vocab_size = V_vocab_dim
        self.embed = nn.Embedding(V_vocab_dim, M_embed_dim)
        self.hidden_linear = nn.Linear(M_embed_dim*N_seq_len, H_hidden_dim, bias=True)
        self.tanh_act = nn.Tanh()
        self.U_linear = nn.Linear(H_hidden_dim, V_vocab_dim, bias=True)
        self.W_linear = nn.Linear(M_embed_dim*N_seq_len, V_vocab_dim, bias=True)
        self.softmax = nn.LogSoftmax()

    def forward(self, x):
        x_embed = self.embed(x)
        print(x.shape)
        #print(x_embed.shape)
        x_flat = torch.autograd.Variable(torch.cat(x_embed.data, dim=1))
        #print(x_flat.shape)
        hidden_feat = self.hidden_linear(x_flat)
        #print(hidden_feat.shape)
        hidden_act = self.tanh_act(hidden_feat)
        #print(hidden_act.shape)
        hidden_out = self.U_linear(hidden_act)
        #print(hidden_out.shape)
        direct = self.W_linear(x_flat)
        #print(direct.shape)
        out = direct + hidden_out
        #print(out.shape)
        return self.softmax(out)

In [96]:
def evaluate(model, data_iterator):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0
    batch_count = 0 
    for batch in iter(data_iterator):
        for i in range(batch.text.size(0) - 5):
            output = model(batch.text[i:i+5])
            target = batch.target[i+5]
            total_loss += criterion(output, target).data
            batch_count += 1 
    return total_loss[0] / batch_count

def train_batch(model, criterion, optim, batch, label):
    # initialize hidden vectors
    model.zero_grad()
    # calculate forward pass
    y = model(batch)
    # calculate loss    
    loss = criterion(y, label)
    # backpropagate and step
    loss.backward()
    optim.step()
    return loss.data[0]

# training loop
def train(model, criterion, optim, data_iterator):

    for e in range(n_epochs):
        batches = 0
        epoch_loss = 0
        avg_loss = 0
        for batch in iter(data_iterator):
            for i in range(batch.text.size(0) - 5): 
                batch_loss = train_batch(model, criterion, optim, batch.text[i:i+5], batch.target[i+5])
                batches += 1
                epoch_loss += batch_loss
                avg_loss = ((avg_loss * (batches - 1)) + batch_loss) / batches
        print("Epoch ", e, " Loss: ", epoch_loss, "Perplexity: ", math.exp(avg_loss))
        loss = evaluate(model, val_iter)
        print("Epoch Val Loss: ", loss, "Perplexity: ", math.exp(loss)) 
        loss = evaluate(model, test_iter)
        print("Epoch Test Loss: ", loss, "Perplexity: ", math.exp(loss))

# size of the embeddings and vectors
n_embedding = 30
n_hidden = 60
seq_len = 5

# initialize LSTM
#npl = NPLM(len(TEXT.vocab), n_embedding, n_hidden, seq_len)

n_epochs = 10
learning_rate = .1
criterion = nn.NLLLoss()
optim = torch.optim.SGD(npl.parameters(), lr = learning_rate)


train(npl, criterion, optim, test_iter)



Epoch  0  Loss:  22673.985867977142 Perplexity:  79.34773217255078


  


Epoch Val Loss:  4.03994298864294 Perplexity:  56.82310315089835
Epoch Test Loss:  4.03994298864294 Perplexity:  56.82310315089835
Epoch  1  Loss:  21528.389946460724 Perplexity:  63.615235842668454
Epoch Val Loss:  3.901693085093557 Perplexity:  49.486162502651055
Epoch Test Loss:  3.901693085093557 Perplexity:  49.486162502651055
Epoch  2  Loss:  20885.206489682198 Perplexity:  56.192433694483874
Epoch Val Loss:  3.8063414773823303 Perplexity:  44.98555676089384
Epoch Test Loss:  3.8063414773823303 Perplexity:  44.98555676089384
Epoch  3  Loss:  20430.30175971985 Perplexity:  51.47161198226394
Epoch Val Loss:  3.7365888430748457 Perplexity:  41.95463196198353
Epoch Test Loss:  3.7365888430748457 Perplexity:  41.95463196198353
Epoch  4  Loss:  20079.03037214279 Perplexity:  48.09940214962427
Epoch Val Loss:  3.681728033371914 Perplexity:  39.71496359275268
Epoch Test Loss:  3.681728033371914 Perplexity:  39.71496359275268
Epoch  5  Loss:  19793.89736711979 Perplexity:  45.525255551392

In [97]:
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print(l)
        input_tokens = l.split(" ")[-6:-1]
        input_index = torch.LongTensor([TEXT.vocab.stoi[t] for t in input_tokens]).unsqueeze(1)
        output = npl(torch.autograd.Variable(input_index))
        max_values, max_indices = torch.topk(output, 20)
        print(" ".join([TEXT.vocab.itos[int(i)] for i in max_indices.data[0, :]]))
        
        #print("%d,%s"%(i, " ".join(predictions)), file=fout)

but while the new york stock exchange did n't fall ___

market says <eos> <unk> to august it of new that steel in mr. a 's six was meanwhile he N
some circuit breakers installed after the october N crash failed ___

and financing N <unk> <eos> rose from had in mr. is its yet make 's did earnings year to index
the N stock specialist firms on the big board floor ___

many <unk> a firms <eos> its will and he the to company risk N general their they by in j.
big investment banks refused to step up to the plate ___

he <unk> <eos> new the secretary that is in late a as now to they n't then have such de
heavy selling of standard & poor 's 500-stock index futures ___

closed <unk> the that have still a $ N plans line points to n't commission debt much be yet out
seven big board stocks ual amr bankamerica walt disney capital ___

and in <unk> <eos> british director chairman is mr. soviet says under it are the of james management to does
once again the specialists were not able to handle the __



<eos> and of <unk> first 's to N it car with a retail he in $ for name him but
by lifting <unk> production the expansion will also lower the ___

<unk> future is he to tax can and major <eos> for a offer quarter in have had of the 's
quantum is also tightening its grip on its one large ___

that <unk> to campaign and in would is by according ever based a as <eos> N big said talks more
through a venture with its investment banker first boston corp. ___

<unk> N is after <eos> drexel that annual $ monday 's it as in and which the to loss firms
some analysts speculate the weakening stock may yet attract a ___

<unk> <eos> N to price that it yesterday of economy in $ with and bid 's on last boston offer
the name <unk> in rumors is british petroleum co. which ___

<unk> to the that earnings looking and in <eos> division a of for firms get were on been mr. big
asked about a bid for quantum a <unk> spokesman says ___

is that <unk> N has big a have would to plan for at from while mr. paid wil


<unk> first buy-out in with is N a he and of that such under for second <eos> would to average
california 's <unk> federal bank awarded its $ N million ___

N <unk> <eos> the year from says $ in for it this a which have to said is but was
the account was previously handled by davis ball & <unk> ___

<eos> <unk> he to its it mr. and a under N is begin traders be new last an all for
royal crown <unk> co. has ended its relationship with the ___

of <unk> to down and 's said will <eos> market from a index the francisco for N at would is
the account had billed about $ N million in N ___

<unk> and <pad> <eos> a in because to from year is that higher it close trading on up with quarter
as expected young & rubicam inc. along with two senior ___

share <unk> offered years notes have to <eos> the he year for that by international debt mr. individual with john
the government has charged that they <unk> <unk> officials to ___

<unk> international into <eos> 's a canadian N that two to union man 

<unk> investment with the of a price corp. their york face income yesterday believe and it is to $ in
in addition many cable-tv systems themselves are airing more local ___

<unk> to traders around buy-out industry the corp. loans buy earnings also with it that a concern more are 's
its watchers are on the whole a <unk> group of ___

<unk> in whose on would <eos> need and customers officials for N that net have is says major who the
that 's less than one-third the time that viewers watch ___

<eos> its <unk> to the in and executive but been who a of he or west not from what it
the brief attention viewers give <unk> could put it at ___

<unk> through <eos> the loss lynch is march and help prices N been up industry that of market york several
its strategy in the past has been to serve as ___

as N the <unk> at <eos> and buyers on their buy from an to too its robert into have expected
it focused on building up its news bureaus around the ___

francisco says market of <unk> and is center t

them the british for <unk> big week in on expected and soviet earnings savings a that bank of when its
the amount of income <unk> up for each man woman ___

<unk> <eos> a its in and he their as but by traders the might new said transportation that loan corp.
per capita personal income ranged from $ N in mississippi ___

<unk> year <eos> on and the its mr. N now west when that china market to each had if company
there are N million students in college this fall up ___

<unk> N <eos> n't to $ a and in is it market street at the stock he they new was
about N N are women and N N are <unk> ___

<eos> and close <unk> a there try that 's rate in on rose n't been it national he is yesterday
trouble with a capital <unk> and that <unk> with <unk> ___

and <unk> <eos> that the last other would in made there a of for have it at did than N
more than N years ago prof. harold hill the con ___

and is <eos> says such <unk> many in that its there the on it last which he a but black
now <unk> spirits on

<unk> and <eos> from but that the in a might on companies quickly at it N stock of would mr.
most often they do just that because stocks have proved ___

<unk> <eos> prices your a the in and to new is at lot N that down has on july of
if you bought after the crash you did very very ___

<unk> the during after kind 'll says <eos> in was have he prices morgan other that today to begin and
the $ N billion california public employees retirement system for ___

<unk> N in that at is to and have ' mr. he it growth <eos> the for we before were
the last crash taught institutional investors that they have to ___

such <unk> efforts the a be <eos> and recent said 's that N more about or settlement mr. major dealers
those that pulled out of stocks <unk> it he said ___

<unk> <eos> he 's with again orders N n't at company in to might dropped medical up although and this
stocks as measured by the standard & poor 's 500-stock ___

futures <unk> <eos> to by in it a the with of off adds its can market

$ N million of remic mortgage securities offered by citicorp ___

and <eos> <unk> executives might a group ' sell have but i on the bonds its wo that we from
the offering series N is backed by freddie mac N ___

and drop last to in <eos> a N year according it its old of was high from the <unk> that
$ N million of stripped mortgage securities underwritten by <unk> ___

<unk> and customers <eos> a mr. we total executives ltd. inc. most capital in its at N all have but
the agency 's first strips issue collateralized by freddie mac ___

and <unk> <eos> the its a in N bank to been says 's is when be that mr. $ he
the <unk> securities will be <unk> by <unk> securities into ___

commission and once executive commercial have still new the old N of to it for soon & we that in
the <unk> securities pay the principal from the underlying freddie ___

<unk> is and the banks <eos> a that N in other working agency he finance price it area mr. domestic
freddie mac said the <unk> securities were priced 

many of them stress that the selling can be orderly ___

<eos> of <unk> the off by last been to seeking he in we as 's says a company and very
on thursday william <unk> a seattle money manager used futures ___

capital <unk> services case turned this the that share N computers mr. expected there a better already got do n't
he thinks the underlying inflation rate is around N N ___

share a from <eos> $ with N over traders before for issue that to <unk> some when its in the
in the pension accounts he manages mr. <unk> has raised ___

when <eos> less taken it and already years <unk> in project a mr. from that N says to could for
he thinks government officials are <unk> to let a recession ___

a <eos> did in be something take <unk> said would of that thing drop bid is it for and his
so he thinks the government will <unk> on the side ___

<unk> u.s. remain and mr. says as late a world according so exchange of then if help yesterday higher that
as a result mr. <unk> says i think the ball ___

N computers stocks <eos> to sell further was much he the bought a with for any position there but would
during the quarter the company realized a pretax gain of ___

<unk> N in and new after <eos> to that would own include on paper the which stock john market for
combined foreign exchange and bond trading profits dipped N N ___

a in <eos> before for to $ <unk> N year from with futures the that its it was sales this
gains from first chicago 's venture capital unit a big ___

like <eos> holdings center even through official she the washington longer most agency n't means <unk> things to calif. without
greece 's second <unk> of general elections this year is ___

<unk> in department market buying there officials firm analyst around growth and said it other impact makers trade to make
for those hoping to see a <unk> of political <unk> ___

and its <eos> the <unk> enough of finance for got up in black a who you make says 're 's
in the <unk> round of voting <unk> gave no clear ___

and $ bi

<unk> <eos> such major four are to of their the much N include according by and will that who in
on friday andrew <unk> intel president and chief executive officer ___

<unk> in <eos> sold a to build the david N them after will was company leading were would on profit
our bookings improved as the quarter <unk> and september was ___

N chicago <eos> are with black <unk> in it increase could has new be before near to $ will says
for the full quarter our bookings were higher than the ___

magazine banks <unk> of for <eos> at 's the to closed N expected was that department and been year point
for the nine-month period intel reported net of $ N ___

or in and a <unk> N they that on year the <eos> its at to from of mr. 's $
revenue amounted to $ N billion up slightly from $ ___

million billion <unk> we this sale then a <eos> to or that us and the in day it but made
after N years in prison mr. sisulu the <unk> former ___

and <eos> the make of a <unk> is work in an its very when $ they that 