# HW 3: Neural Machine Translation

In this homework you will build a full neural machine translation system using an attention-based encoder-decoder network to translate from German to English. The encoder-decoder network with attention forms the backbone of many current text generation systems. See [Neural Machine Translation and Sequence-to-sequence Models: A Tutorial](https://arxiv.org/pdf/1703.01619.pdf) for an excellent tutorial that also contains many modern advances.

## Goals


1. Build a non-attentional baseline model (pure seq2seq as in [ref](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)). 
2. Incorporate attention into the baseline model ([ref](https://arxiv.org/abs/1409.0473) but with dot-product attention as in class notes).
3. Implement beam search: review/tutorial [here](http://www.phontron.com/slides/nlp-programming-en-13-search.pdf)
4. Visualize the attention distribution for a few examples. 

Consult the papers provided for hyperparameters, and the course notes for formal definitions.

This will be the most time-consuming assignment in terms of difficulty/training time, so we recommend that you get started early!

## Setup

This notebook provides a working definition of the setup of the problem itself. Feel free to construct your models inline, or use an external setup (preferred) to build your system.

In [None]:
# Text text processing library and methods for pretrained word embeddings
from torchtext import data
from torchtext import datasets

We first need to process the raw data using a tokenizer. We are going to be using spacy, which can be installed via:  
  `[sudo] pip install spacy`  
  
Tokenizers for English/German can be installed via:  
  `[sudo] python -m spacy download en`  
  `[sudo] python -m spacy download de`
  
This isn't *strictly* necessary, and you can use your own tokenization rules if you prefer (e.g. a simple `split()` in addition to some rules to acccount for punctuation), but we recommend sticking to the above.

In [None]:
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]


Note that we need to add the beginning-of-sentence token `<s>` and the end-of-sentence token `</s>` to the 
target so we know when to begin/end translating. We do not need to do this on the source side.

In [None]:
BOS_WORD = '<s>'
EOS_WORD = '</s>'
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, eos_token = EOS_WORD) # only target needs BOS/EOS

Let's download the data. This may take a few minutes.

**While this dataset of 200K sentence pairs is relatively small compared to others, it will still take some time to train. So we are going to be only working with sentences of length at most 20 for this homework. Please train only on this reduced dataset for this homework.**

In [None]:
MAX_LEN = 20
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN), 
                                         filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
                                         len(vars(x)['trg']) <= MAX_LEN)
print(train.fields)
print(len(train))
print(vars(train[0]))

Now we build the vocabulary and convert the text corpus into indices. We are going to be replacing tokens that occurred less than 5 times with `<unk>` tokens, and take the rest as our vocab.

In [None]:
MIN_FREQ = 5
DE.build_vocab(train.src, min_freq=MIN_FREQ)
EN.build_vocab(train.trg, min_freq=MIN_FREQ)
print(DE.vocab.freqs.most_common(10))
print("Size of German vocab", len(DE.vocab))
print(EN.vocab.freqs.most_common(10))
print("Size of English vocab", len(EN.vocab))
print(EN.vocab.stoi["<s>"], EN.vocab.stoi["</s>"]) #vocab index for <s>, </s>

Now we split our data into batches as usual. Batching for MT is slightly tricky because source/target will be of different lengths. Fortunately, `torchtext` lets you do this by allowing you to pass in a `sort_key` function. This will minimizing the amount of padding on the source side, but since there is still some padding you will inadvertendly "attend" to these padding tokens. 

One way to get rid of padding is to pass a binary `mask` vector to your attention module so its attention score (before the softmax) is minus infinity for the padding token. Another way (which is how we do it for our projects, e.g. opennmt) is to manually sort data into batches so that each batch has exactly the same source length (this means that some batches will be less than the desired batch size, though).

However, for this homework padding won't matter too much, so it's fine to ignore it.

In [None]:
BATCH_SIZE = 32
train_iter, val_iter = data.BucketIterator.splits((train, val), batch_size=BATCH_SIZE, device=0,
                                                  repeat=False, sort_key=lambda x: len(x.src))

Let's check to see that the BOS/EOS token is indeed appended to the target (English) sentence.

In [None]:
batch = next(iter(val_iter))
print("Source")
print(batch.src)
print("Target")
print(batch.trg)
print(batch.src.volatile)

Success! Now that we've processed the data, we are ready to begin modeling.

# Baseline Model

In [1]:
import torch
import random
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torchtext import data
from torchtext import datasets
import spacy
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

In [2]:
use_cuda = torch.cuda.is_available()
print(use_cuda)

True


In [3]:
# Set up 
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

BOS_WORD = '<s>'
EOS_WORD = '</s>'
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, eos_token = EOS_WORD) # only target needs BOS/EOS

MAX_LEN = 20
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN), 
                                         filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
                                         len(vars(x)['trg']) <= MAX_LEN)
print(train.fields)
print(len(train))
print(vars(train[0]))

MIN_FREQ = 5
DE.build_vocab(train.src, min_freq=MIN_FREQ)
EN.build_vocab(train.trg, min_freq=MIN_FREQ)
print(DE.vocab.freqs.most_common(10))
print("Size of German vocab", len(DE.vocab))
print(EN.vocab.freqs.most_common(10))
print("Size of English vocab", len(EN.vocab))
# print(DE.vocab.stoi["<s>"], DE.vocab.stoi["</s>"]) #vocab index for <s>, </s>
print(EN.vocab.stoi["<s>"], EN.vocab.stoi["</s>"]) #vocab index for <s>, </s>
print(EN.vocab.stoi["<pad>"])

{'src': <torchtext.data.field.Field object at 0x7f6f9c785550>, 'trg': <torchtext.data.field.Field object at 0x7f6f9c785780>}
119076
{'src': ['David', 'Gallo', ':', 'Das', 'ist', 'Bill', 'Lange', '.', 'Ich', 'bin', 'Dave', 'Gallo', '.'], 'trg': ['David', 'Gallo', ':', 'This', 'is', 'Bill', 'Lange', '.', 'I', "'m", 'Dave', 'Gallo', '.']}
[('.', 113253), (',', 67237), ('ist', 24189), ('die', 23778), ('das', 17102), ('der', 15727), ('und', 15622), ('Sie', 15085), ('es', 13197), ('ich', 12946)]
Size of German vocab 13353
[('.', 113433), (',', 59512), ('the', 46029), ('to', 29177), ('a', 27548), ('of', 26794), ('I', 24887), ('is', 21775), ("'s", 20630), ('that', 19814)]
Size of English vocab 11560
2 3
1


In [4]:
print(DE.vocab.stoi["<pad>"])
BATCH_SIZE = 64
if use_cuda: 
    train_iter, val_iter = data.BucketIterator.splits((train, val), batch_size=BATCH_SIZE, device=0,
                                                  repeat=False, sort_key=lambda x: len(x.src))
else: 
    train_iter, val_iter = data.BucketIterator.splits((train, val), batch_size=BATCH_SIZE, device=-1,
                                                  repeat=False, sort_key=lambda x: len(x.src)) 

1


In [5]:
print(torch.cuda.device_count())
batch = next(iter(train_iter))
print("Source")
print(batch.src)
#     print("Target")
#print(batch.trg)
print(len(list(train_iter)))

1
Source
Variable containing:
    23     26    589  ...     481    218     12
     4      4    320  ...     282     59    503
   130     19     13  ...    1371     11      9
        ...            ⋱           ...         
    69      0      3  ...    1062      0      5
   154      0   1933  ...       4   1167   1898
     2      2     16  ...       2      2      2
[torch.cuda.LongTensor of size 16x64 (GPU 0)]

1861


In [6]:
# bidir models
SOS_token = 2
EOS_token = 3
PAD_token = 1 

class EncoderLSTM(nn.Module):
    def __init__(self, input_size, h_size, batch_size, n_layers=1, dropout=0, bidir=False):
        super(EncoderLSTM, self).__init__()
        self.num_layers = n_layers
        self.hidden_size = h_size
        self.batch_size = batch_size
        self.bidir=bidir
        self.embed = nn.Embedding(input_size, h_size)
        self.lstm = nn.LSTM(h_size, h_size, dropout=dropout, num_layers=n_layers, bidirectional=bidir)

    def forward(self, input_src, hidden):
        embedded = self.embed(input_src)
        output, hidden = self.lstm(embedded, hidden)
        return output, hidden

    def init_hidden(self):
        if self.bidir: 
            bi_dir_layers  = 2
        else: 
            bi_dir_layers  = 1
        result = (Variable(torch.zeros(self.num_layers*bi_dir_layers, self.batch_size, self.hidden_size)),
                  Variable(torch.zeros(self.num_layers*bi_dir_layers, self.batch_size, self.hidden_size)))
        if use_cuda:
            return (Variable(torch.zeros(self.num_layers*bi_dir_layers, self.batch_size, self.hidden_size)).cuda(),
                    Variable(torch.zeros(self.num_layers*bi_dir_layers, self.batch_size, self.hidden_size)).cuda())
        else:
            return result

class Attn(nn.Module):
    def __init__(self, hidden_size):
        super(Attn, self).__init__()
        self.hidden_size = hidden_size

    def forward(self, hidden, encoder_outputs):
        max_len = encoder_outputs.size(0)
        this_batch_size = encoder_outputs.size(1)
        # hidden -> target_len x batch_size x hidden_dim
        hidden = hidden.transpose(0, 1) # batch_size x target_len x hidden_dim
        
        # encoder_outputs -> max_len x batch_size x hidden_dim
        encoder_outputs = encoder_outputs.permute(1, 2, 0)
        
        attn_energies = torch.bmm(hidden, encoder_outputs) # B x S
        

        return F.softmax(attn_energies, dim=2)
        
    
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, batch_size, dropout=0.1, n_layers=1, max_length=MAX_LEN):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size*2
        self.output_size = output_size
        self.max_length = max_length
        self.num_layers = n_layers
        self.batch_size = batch_size

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = Attn(hidden_size)
        
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, num_layers=n_layers, dropout=dropout)
        self.out = nn.Linear(self.hidden_size*2, self.output_size)

    def forward(self, input_data, hidden, encoder_outputs):
        #input_len x batch_size 

        embedded = self.embedding(input_data) #batch_size x target_len x hidden dim

        #lstm_output -> target_len x batch_size x hidden_dim
        lstm_output, lstm_hidden = self.lstm(embedded, hidden)

        #attn input 0 to T-1 
        if hidden[0].size()[0] != 1: 
            attn_hidden = hidden[0][-1].unsqueeze(0)
        else: 
            attn_hidden = hidden[0]
        
        if(lstm_output.size()[0] > 1):  
            attn_input = torch.cat((attn_hidden, lstm_output[:-1]))
        else: 
            attn_input = attn_hidden
        # encoder_outputs -> max_len x batch_size x hidden_dim
        attn_weights = self.attn(attn_input, encoder_outputs)
        
        # context = batch_size x target_length x hidden_dim 
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1)) 
        
        context = context.transpose(1, 0) #target_length x batch_size x hidden_dim
        
        output = torch.cat((lstm_output, context), 2)

        # Final output layer
        final_output = F.log_softmax(self.out(output), dim=2)
        final_output = self.dropout(final_output)
        return final_output, lstm_hidden, attn_weights



    def init_hidden(self):
        result = (Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)),
                  Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)))
        if use_cuda:
            return (Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)).cuda(),
                    Variable(torch.zeros(self.num_layers, self.batch_size, self.hidden_size)).cuda())
        else:
            return result

In [8]:
def prune(beams, k):
    """
    Prunes all but the top k beams, by summative score
    """
    beams.sort(key=lambda x: x[1], reverse=True) #sort beams by second element (score)
    return beams[:k] #return top k

def evaluate_kaggle(encoder, decoder, string, k = 3, ngrams = 3, max_length = 20, batch_size=1):
    # Run string through encoder

    encoder_input = string.unsqueeze(1).expand(-1, batch_size)

    layers = encoder.num_layers
    encoder_hidden = encoder.init_hidden()
    encoder_output_short, encoder_hidden = encoder(encoder_input, encoder_hidden)
    
    #expand encoder outputs
    input_length = string.size()[0]
    encoder_outputs = Variable(torch.zeros(max_length, batch_size, encoder.hidden_size*2))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
 
    print(encoder_output_short.shape)
    print(encoder_outputs.shape)
    encoder_outputs[:input_length, :, :] = encoder_output_short
    
    #decoder_input = Variable(torch.ones(1, batch_length).long()*SOS_token)

    decoder_input = Variable(torch.ones(1, batch_size).long()*SOS_token) #1 x batch_length
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    
    if layers != 1: 
        decoder_hidden = (torch.cat((encoder_hidden[0][-layers:], encoder_hidden[0][-layers:]), dim=2) , 
                      torch.cat((encoder_hidden[1][-layers:], encoder_hidden[1][-layers:]), dim=2)) 
    else: 
         decoder_hidden = (torch.cat((encoder_hidden[0][0].unsqueeze(0), encoder_hidden[0][1].unsqueeze(0)), dim=2) , 
                      torch.cat((encoder_hidden[1][0].unsqueeze(0), encoder_hidden[1][1].unsqueeze(0)), dim=2)) 
      

    # base case - get top k predictions from SOS_token
    decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)

    # Get most likely word index from output
    
    topk_probs, topk_word_idx = decoder_output.data.topk(k, dim = 2)
    print(topk_word_idx[:, 0].shape)
    print(' '.join([EN.vocab.itos[id] for id in topk_word_idx[:, 0][0]]))
    decoder_input = Variable(topk_word_idx[:, 0]) # Chosen word is next input
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    
    beam_outputs = [([topk_word_idx.view(-1)[i]], topk_probs.view(-1)[i]) for i in range(k)]
    

    # non base case
    for trg_word_idx in range(0, ngrams - 1): # <s> shouldn't count
        decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
        

        # Get most likely word index from output
        topk_probs, topk_word_idx = decoder_output.data.topk(k, dim = 2)
        
        #create top k*k-by-1 matrix
        temp_beam = []
        for i in range(k):
            beam_words = beam_outputs[i][0]
            beam_score = beam_outputs[i][1]
            for j in range(k):
                index = i * k + j
                curr_word_index = topk_word_idx.view(-1)[index]
                curr_score = topk_probs.view(-1)[index]
                
                temp_beam.append((beam_words + [curr_word_index], beam_score + curr_score))
                
        
        #prune k*k-by-1 matrix to 1xk
        beam_outputs = prune(temp_beam, k)
        #print(beam_outputs)
        #set beams equal to decoder_input
        new_beams = [beam[0] for beam in beam_outputs]

        new_beam_input = [[beam[0][-1]] for beam in beam_outputs]
        #new_beam_input = new_beams
        decoder_input = Variable(torch.LongTensor(new_beam_input)).transpose(0,1) # Chosen beams are next input
        decoder_input = decoder_input.cuda() if use_cuda else decoder_input
        

    #out_idx = zip(*beam_outputs)
    kaggle_outputs = ['|'.join([EN.vocab.itos[id] for id in beam]) for beam in new_beams]
    return ' '.join(kaggle_outputs)

In [9]:
hidden_size = 512
encoder2 = EncoderLSTM(len(DE.vocab), hidden_size, batch_size=100, dropout=0.3, n_layers=2, bidir=True)
decoder2 = AttnDecoderRNN(hidden_size, len(EN.vocab), batch_size=100, dropout=0.3, n_layers=2)

encoder2.load_state_dict(torch.load('10_attn_encoder_model.pt'))
decoder2.load_state_dict(torch.load('10_attn_decoder_model.pt'))

In [10]:
match = 0 
total = 0 
top_match = 0
for batch in iter(val_iter): 
    for t in range(batch.src.size()[1]): 
        string = batch.src[:,t]
        decode_str = batch.trg[:, t]
        #print (' '.join([DE.vocab.itos[id.data[0]] for id in string]))
        #print (' '.join([EN.vocab.itos[id.data[0]] for id in decode_str[1:]]))
        answer_token ='|'.join([EN.vocab.itos[id.data[0]] for id in decode_str[1:4]])
        output_tokens = evaluate_kaggle(encoder2.cuda(), decoder2.cuda(), string, k = 100, ngrams = 3, batch_size=100).split(" ")
        print(answer_token)
        print(output_tokens)
        if answer_token in output_tokens: 
            match += 1 
            if answer_token in output_tokens[:3]: 
                top_match += 1 
        total += 1 
    print(top_match/total)
    print(match/total)
print("accuracy: ", match/total)
#64: 0.46875, 0.65625 #128 0.4609 0.633

torch.Size([6, 100, 1024])
torch.Size([20, 100, 1024])
torch.Size([1, 100])
stuff OK 10 For water next after done him human thought Why My bit find better end four each whole technology long wanted looking away second Okay women working five try between Do place system few called his most How 'll back thing In where work lot she through problem day tell If come course year today ! being started put ca These show other two Thank when way their world will think these You as want more time things have do " was of . in and 're But We about - there my like new down even take
I|want|to
['stuff|end|able', 'stuff|end|number', 'stuff|end|country', 'stuff|end|future', 'stuff|end|mean', 'stuff|end|Let', 'stuff|end|Because', 'stuff|end|change', 'stuff|end|A', 'stuff|end|percent', 'stuff|end|;', 'stuff|end|course', 'stuff|end|story', 'stuff|end|year', 'stuff|end|tell', 'stuff|end|problem', 'stuff|end|point', 'stuff|end|before', 'stuff|end|brain', 'stuff|end|old', 'stuff|end|part', 'stuff|end|find',

KeyboardInterrupt: 

In [None]:
with open('source_test.txt', 'r') as fp: 
    lines = fp.readlines()
    
print(len(lines))
def escape(l):
    return l.replace("\"", "<quote>").replace(",", "<comma>")

with open('sample1.txt', 'w') as fp: 
    fp.write('id,word\n')
    for i in range(len(lines)): 
        if (i%100 == 0): 
            print(i)
        line = lines[i]
        tokens = line.strip("\n").split(" ")
        input_index = [DE.vocab.stoi[t] for t in tokens]
        input_index = Variable(torch.Tensor((input_index)).long().cuda())
        output_str = evaluate_kaggle(encoder2.cuda(), decoder2.cuda(), input_index, k = 100, ngrams = 3, batch_size=100)
        output_str = escape(output_str)
        fp.write(str(i+1) + ',' + output_str + '\n')

In [None]:
with open('sample1.txt', 'r') as fp: 
    lines = fp.readlines()
    with open('sample2.txt', 'w') as wp:
        wp.write('id,word\n')
        for i in range(1, len(lines)): 
            line=lines[i]
            tokens = line.split(",")
            print(tokens)
            wp.write(str(i) + ',' + tokens[1])

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw3-s18/

For the final Kaggle test, we will provide the source sentence, and you are to predict the **first three words of the target sentence**. The source sentence can be found under `source_test.txt`

In [None]:
!head source_test.txt

Similar to HW1, you are to predict the 100 most probable 3-gram that will begin the target sentence. The submission format will be as follows, where each word in the 3-gram will be separated by "|", and each 3-gram will be separated by space. For example, here is what an example submission might look like with 5 most-likely 3-grams (instead of 100).

```
id,word
1,Newspapers|talk|about When|I|was Researchers|call|the Twentysomethings|like|Alex But|before|long
2,That|'s|what Newspapers|talk|about You|have|robbed It|'s|realizing My|parents|wanted
3,We|forget|how We|think|about Proust|actually|links Does|any|other This|is|something
4,But|what|do And|it|'s They|'re|on My|name|is It|only|happens
```

When you print out your data, you will need to escape quotes and commas with the following command so that Kaggle does not complain. 

In [None]:
def escape(l):
    return l.replace("\"", "<quote>").replace(",", "<comma>")

You should perform your hyperparameter search/early stopping/write-up based on perplexity, not the above metric. (In practice, people use a metric called [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf), which is roughly a geometric average of 1-gram, 2-gram, 3-gram, 4-gram precision, with a brevity penalty for producing translations that are too short.)

Finally, as always please put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/
