# HW 2: Language Modeling

In this homework you will be building several varieties of language models.

## Goal

We ask that you construct the following models in PyTorch:

1. A trigram model with linear-interpolation. $$p(y_t | y_{1:t-1}) =  \alpha_1 p(y_t | y_{t-2}, y_{t-1}) + \alpha_2 p(y_t | y_{t-1}) + (1 - \alpha_1 - \alpha_2) p(y_t) $$
2. A neural network language model (consult *A Neural Probabilistic Language Model* http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
3. An LSTM language model (consult *Recurrent Neural Network Regularization*, https://arxiv.org/pdf/1409.2329.pdf) 
4. Your own extensions to these models...


Consult the papers provided for hyperparameters.

 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [1]:
!pip install -q torch torchtext opt_einsum
!pip install -qU git+https://github.com/harvardnlp/namedtensor

In [2]:
%run models/trigram.py
%run models/utils.py

In [3]:
# Text text processing library
import torchtext
from torchtext.vocab import Vectors
import torch.sparse as sp
import torch
from tqdm import tqdm_notebook as tqdm
from torch import nn
from namedtensor import ntorch
import namedtensor.nn as nnn

The dataset we will use of this problem is known as the Penn Treebank (http://aclweb.org/anthology/J93-2004). It is the most famous dataset in NLP and includes a large set of different types of annotations. We will be using it here in a simple case as just a language modeling dataset.

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [6]:
# Our input $x$
TEXT = torchtext.data.Field()

Next we input our data. Here we will use the first 10k sentences of the standard PTB language modeling split, and tell it the fields.

In [7]:
# Data distributed with the assignment
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".", 
    train="train.txt", validation="valid.txt", test="valid.txt", text_field=TEXT)

The data format for language modeling is strange. We pretend the entire corpus is one long sentence.

In [8]:
print('len(train)', len(train))

len(train) 1


Here's the vocab itself. (This dataset has unk symbols already, but torchtext adds its own.)

In [9]:
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

len(TEXT.vocab) 10001


When debugging you may want to use a smaller vocab size. This will run much faster.

In [10]:
if False:
    TEXT.build_vocab(train, max_size=1000)
    len(TEXT.vocab)

The batching is done in a strange way for language modeling. Each element of the batch consists of `bptt_len` words in order. This makes it easy to run recurrent models like RNNs. 

In [11]:
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(
    (train, val, test), batch_size=10, bptt_len=32, repeat=False)

Here's what these batches look like. Each is a string of length 32. Sentences are ended with a special `<eos>` token.

In [12]:
it = iter(train_iter)
batch = next(it) 
print("Size of text batch [max bptt length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 2])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Size of text batch [max bptt length, batch size] torch.Size([32, 10])
Second in batch tensor([   8,  202,   77,    5,  183,  561, 3837,   18,  975,  976,    7,  943,
           5,  157,   78, 1571,  289,  645,    3,   30,  132,    0,   20,    2,
         273, 7821,   17,    9,  117, 2815,  969,    6])
Converted back to string:  in part because of buy programs generated by stock-index arbitrage a form of program trading involving futures contracts <eos> but interest <unk> as the day wore on and investors looked ahead to


The next batch will be the continuation of the previous. This is helpful for running recurrent neural networks where you remember the current state when transitioning.

In [13]:
batch = next(it)
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Converted back to string:  the release later this week of two important economic reports <eos> the first is wednesday 's survey of purchasing managers considered a good indicator of how the nation 's manufacturing sector fared


In [14]:
for batch in train_iter:
    print(tensor_to_text(batch.text[:,3], TEXT))
    break

say they also find it <unk> that cbs news is apparently concentrating on mr. hoffman 's problems as a <unk> <eos> this is dangerous and <unk> abbie 's life says ms. <unk>


In [15]:
batch.target

tensor([[ 9972,     5,   202,    39,     9,    99,  1176,   654,   374,     4],
        [ 9973,    28,    77,    60,   630,    90,    20,   271,    39,    49],
        [ 9975,   247,     5,   678,   564,  2255,     7,     9,   276,  1077],
        [ 9976,    61,   183,    15,     9,     2,     0,   501,     0,    13],
        [ 9977,    12,   561,     0,   224,   313,   155,   274,  8018,     4],
        [ 9981,   216,  3837,    11,   185,  1642,    24,  1560,     3,    22],
        [ 9982,     5,    18,  1017,   128,     5,  1891,     3,  2786,    70],
        [ 9983,     0,   975,   310,    19,  1064,    31,    15,  3660,    41],
        [ 9984,  1847,   976,    14,     7,     9,     7,  3020,  4360,     3],
        [ 9985,    10,     7,  1078,   829,   714,     0,     6,    81,  1324],
        [ 9987,     4,   943,  6361,     0,     2,     0,   586,   635,   160],
        [ 9988,    72,     5,    17,     3,   265,     3,    85,    28,    18],
        [ 9989,   547,   157,    24,    

There are no separate labels. But you can just use an offset `batch.text[1:]` to get the next word.

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 3 different torch models that take in batch.text and produce a distribution over the next word. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw2-s18

For the final Kaggle test, we will have you do a next word prediction task. We will provide a 10 word prefix of sentences, and it is your job to predict 10 possible next word candidates

In [16]:
!head input.txt

but while the new york stock exchange did n't fall ___
some circuit breakers installed after the october N crash failed ___
the N stock specialist firms on the big board floor ___
big investment banks refused to step up to the plate ___
heavy selling of standard & poor 's 500-stock index futures ___
seven big board stocks ual amr bankamerica walt disney capital ___
once again the specialists were not able to handle the ___
<unk> james <unk> chairman of specialists henderson brothers inc. it ___
when the dollar is in a <unk> even central banks ___
speculators are calling for a degree of liquidity that is ___


As a sample Kaggle submission, let us build a simple unigram model.  

In [17]:
from collections import Counter
count = Counter()
for b in iter(train_iter):
    count.update(b.text.view(-1).data.tolist())
count[TEXT.vocab.stoi["<eos>"]] = 0
predictions = [TEXT.vocab.itos[i] for i, c in count.most_common(20)]
with open("sample.txt", "w") as fout: 
    print("id,word", file=fout)
    for i, l in enumerate(open("input.txt"), 1):
        print("%d,%s"%(i, " ".join(predictions)), file=fout)

In [18]:
!head sample.txt

id,word
1,the <unk> N of to a in and 's that for $ is it said on by at as from
2,the <unk> N of to a in and 's that for $ is it said on by at as from
3,the <unk> N of to a in and 's that for $ is it said on by at as from
4,the <unk> N of to a in and 's that for $ is it said on by at as from
5,the <unk> N of to a in and 's that for $ is it said on by at as from
6,the <unk> N of to a in and 's that for $ is it said on by at as from
7,the <unk> N of to a in and 's that for $ is it said on by at as from
8,the <unk> N of to a in and 's that for $ is it said on by at as from
9,the <unk> N of to a in and 's that for $ is it said on by at as from


The metric we are using is mean average precision of your 20-best list. 

$$MAP@20 = \frac{1}{|D|} \sum_{u=1}^{|D|} \sum_{k=1}^{20} Precision(u, 1:k)$$

Ideally we would use log-likelihood or ppl as discussed in class, but this is the best Kaggle gives us. This takes into account whether you got the right answer and how highly you ranked it. 

In particular, we ask that you do not game this metric. Please submit *exactly 20* unique predictions for each example.


As always you should put up a 5-6 page write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

In [30]:
# Trigram model
%run models/trigram.py
%run models/utils.py

model_tri = Trigram(TEXT)
model_tri.get_probabilities(train_iter)
out = model_tri.predict(batch_text)

optimizer = torch.optim.Adam([model_tri.log_weights], lr=0.1)

def cb(**kwargs):
    print(kwargs['epoch'], kwargs['loss'].item(), 
          torch.softmax(model_tri.log_weights, dim=0))

train_model(model_tri, trigram_loss_fn, optimizer, val_iter, val_iter=None,
            inner_callback=cb)

HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))




NameError: name 'batch_text' is not defined

In [405]:
# class NNLang(nn.Sequential):
#     def __init__(self, embedding_dim, hidden_dims, TEXT):
#         super().__init__()
#         self.embeds = nn.Embedding(len(TEXT.vocab), embedding_dim)
#         self.self.Linear(5,)

1

In [514]:
%run models/neural_net_lang.py

In [517]:
net = get_nn_lang_model(embedding_dim=10,
                        hidden=10,
                        TEXT=TEXT,
                        n_hidden_layers=3)

In [518]:
net

Sequential(
  (0): Pad()
  (1): Embedding(10001, 10)
  (2): Flatten()
  (3): Linear(in_features=320, out_features=10, bias=True)
  (4): Tanh()
  (5): Linear(in_features=10, out_features=10, bias=True)
  (6): Tanh()
  (7): Linear(in_features=10, out_features=10, bias=True)
  (8): Tanh()
  (9): Linear(in_features=10, out_features=10001, bias=True)
  (10): Softmax()
)

In [483]:
embedding_dim = 20
hidden = 100
seqlen = 32
pad_i = TEXT.vocab.stoi['<pad>']

class Pad(nn.Module):
    def __init__(self, seqlen, pad_i):
        super().__init__()
        self.seqlen = seqlen
        self.pad_i = pad_i
    def forward(self, x):
        init = torch.ones(self.seqlen, x.shape[1]) * pad_i
        init[-x.shape[0]:, :] = x
        return init.long()

    
class Flatten(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, x):
        return x.permute(1,0,2).flatten(start_dim=1,end_dim=2)

net = nn.Sequential(
    Pad(seqlen, pad_i),
    nn.Embedding(len(TEXT.vocab), embedding_dim),
    Flatten(),
    nn.Linear(seqlen * embedding_dim, hidden),
    nn.Tanh(),
    nn.Linear(hidden, len(TEXT.vocab)),
    nn.Softmax(dim=0)
)

In [484]:
criterion = nn.CrossEntropyLoss()
def loss_fn(model, batch):
    return criterion(model(batch.text), batch.target[-1,:])
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

In [486]:
def cb(**kwargs):
    print(kwargs['epoch'], kwargs['train_loss'])
train_model(net, loss_fn=loss_fn, optimizer=optimizer, train_iter=train_iter, 
            callback=cb, progress_bar=True)

HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))

0 26634.14468574524


HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))

1 26134.826528549194


HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))

2 25545.120619773865


HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))

3 25191.55255126953


HBox(children=(IntProgress(value=0, max=2905), HTML(value='')))

4 25005.816497802734


In [492]:
val_batch = next(iter(val_iter))

In [511]:
_, arg = torch.topk( net(val_batch.text), 20)
answers = [tensor_to_text(argmax, TEXT).split(' ') for argmax in arg]
for i, ans in enumerate(answers):
    print(f"{tensor_to_text(val_batch.text[:, i], TEXT)}\n{'/'.join(ans)}\n{tensor_to_text(val_batch.target[-1, i].unsqueeze(0), TEXT)}\n\n\n")

consumers may want to move their telephones a little closer to the tv set <eos> <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play
sales/problems/comes/lose/future/mixte/focus/bills/officer/deposit/before/schemes/open/generally/again/sam/methods/contractor/consequences/denver
in



said <eos> he said he would n't comment on the cftc plan until the exchange has seen the full proposal <eos> but at a meeting last week tom <unk> the board of
china/america/both/only/immediately/the/an/pacific/make/hbo/not/ge/acquisitions/rules/everyone/ad/lockheed/decision/expect/rose
trade



analysis <eos> he found N still <unk> and N fairly valued <eos> nicholas parks a new york money manager expects the market to decline about N N <eos> i 've been two-thirds
know/include/groups/ministry/kageyama/half/galileo/profitable/made/offer/cities/fournier/sit/dead/division/compared/means/companies/decline/individual
in



novel of <unk> like <unk> herself still ruled

In [417]:
embeds = nn.Embedding(len(TEXT.vocab), embedding_dim)

In [430]:
embeds(batch.text).permute(1,0,2).flatten(start_dim=1,end_dim=2).shape

torch.Size([10, 640])

In [463]:
net.train?

[0;31mSignature:[0m [0mnet[0m[0;34m.[0m[0mtrain[0m[0;34m([0m[0mmode[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Sets the module in training mode.

This has any effect only on certain modules. See documentations of
particular modules for details of their behaviors in training/evaluation
mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
etc.

Returns:
    Module: self
[0;31mFile:[0m      /anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py
[0;31mType:[0m      method
