<a href="https://colab.research.google.com/github/marissa-graham/deep_learning/blob/master/Lab_7_(Transformer_Original).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installs and imports

In [0]:
!pip3 install torch spacy torchtext==0.2.3

!python -m spacy download en
!python -m spacy download es
!python -m spacy download de

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 30kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x58fb8000 @  0x7f20ad2e62a4 0x594e17 0x626104 0x51190a 0x4f5277 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x4f3338 0x510fb0 0x5119bd 0x4f6070
Collecting torchtext==0.2.3
[?25l  Downloading https://files.pythonhosted.org/packages/78/90/474d5944d43001a6e72b9aaed5c3e4f77516fbef2317002da2096fd8b5ea/torchtext-0.2.3.tar.gz (42kB)
[K    100% |████████████████████████████████| 51kB 19.3MB/s 
Collecting numpy>=1.15.0 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/22/02/bae88c4aaea4256d890adbf3f7cf33e59a443f9985cf91

In [0]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable

from torchtext import data, datasets
import spacy

import numpy as np
import math
import copy
import time

# Model Architecture

## Model-building helper functions

Effectively identical in jupyter and colab

In [0]:
# Called in Encoder, Decoder, EncoderLayer, DecoderLayer, MultiHeadedAttention
def clones(module, N):
    """Produce N identical copies of a layer (for multi-headed attention)."""
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

# Called by MultiHeadedAttention
def attention(query, key, value, mask=None, dropout=0.0):
    """Compute scaled dot product attention."""
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    
    # jupyter: dropout default is None, and this line becomes
    # if dropout is not None: p_attn = dropout(p_attn)
    # Effectively does the same thing
    p_attn = F.dropout(p_attn, p=dropout)
    return torch.matmul(p_attn, value), p_attn

## Non-structural class pieces of the model 

All identical in jupyter and colab

In [0]:
# Used to generate predictions
class Generator(nn.Module):
    """Standard linear + softmax generation step."""
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

# Standard piece of the model; called by make_model
class PositionwiseFeedForward(nn.Module):
    """Two linear transformations with a ReLU in between."""
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        # Torch linears have a `b` by default. 
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

# Allow the model to pay attention to relative positions of tokens
class PositionalEncoding(nn.Module):
    """
    Sinusoid-based positional encoding to allow the model to extrapolate to 
    sequence lengths longer than those seen in training.
    
    One sinusoid of a different frequency for each dimension in the embedding.
    """
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        
        pe = torch.zeros(max_len, d_model) # 2d
        #position = torch.arange(0, max_len).unsqueeze(1) 
        #div_term = torch.exp(torch.arange(0, d_model, 2) *
        #                     -(math.log(10000.0) / d_model))
        position = torch.arange(0, max_len).unsqueeze(1).double()
        div_term = torch.exp(torch.arange(0, d_model, 2).double() *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        
        """
        pe = np.zeros((max_len, d_model))
        position = np.arange(0, max_len).reshape((max_len, 1))
        div_term = np.exp(np.arange(0, d_model, 2) * -np.log(10000.0/d_model))
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)
        pe = torch.tensor(pe, dtype=torch.long).unsqueeze(0)
        """
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)#.float()
        return self.dropout(x)

# Learned embedding model for conversion of input/output tokens.
class Embeddings(nn.Module):
    
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        # Scale by sqrt(d_model) (to match the scaled dot product?)
        return self.lut(x) * math.sqrt(self.d_model)
    
class MultiHeadedAttention(nn.Module):
    
    def __init__(self, h, d_model, dropout=0.1):
        """Take in model size and number of heads."""
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.p = dropout
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.p)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

## Structural class pieces of the model

All identical in jupyter and colab

In [0]:
class LayerNorm(nn.Module):
    """Basic layer normalization."""
    
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2*(x - mean)/(std + self.eps) + self.b_2
    
class SublayerConnection(nn.Module):
    """A residual (skip) connection followed by a layer norm."""
    
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # Apply residual connection to any sublayer with the same size
        return x + self.dropout(sublayer(self.norm(x)))

# This gets fed to an Encoder class
class EncoderLayer(nn.Module):
    """An encoder layer consists of self-attention and feed-forward."""
    
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        """Follow Figure 1 in the paper for connections."""
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)
    
class Encoder(nn.Module):
    """The main encoder is a stack of N encoder layers."""
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)
    
class DecoderLayer(nn.Module):
    """
    A decoder layer consists of self-attention, source-attention, and 
    feed-forward.
    """
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
 
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)
    
class Decoder(nn.Module):
    """The main decoder is a stack of N decoder layers."""
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

## Colab-style EncoderDecoder (RUN ONLY ONE OF THESE TWO)

Has only a "forward" function, which encodes and decodes.

In [0]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Made of the encoder and decoder,
    source and target embeddings, and a generator for making predictions.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        # Take in and process masked source and target sequences.
        memory = self.encoder(self.src_embed(src), src_mask)
        output = self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
        return output

## Jupyter-style EncoderDecoder 

Has callable "encode", "decode", and "foward" functions, and the forward function puts the encode and decode functions together.

In [0]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask,
                            tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

## Putting it all together

make_model is identical in Jupyter and Colab

In [0]:
def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, 
               dropout=0.1):
    """
    Put all the pieces of the model together, given the source and target 
    vocabularies, number of layers for the encoder and decoder, dimensions for
    the model embedding, dimensions for feed-forward, number of heads to use 
    for multi-headed attention, and dropout parameter.
    """
    
    # Convenience definition to make copies of things
    c = copy.deepcopy
    
    # Attention, feed-forward, & positional encoding to feed Encoder and Decoder
    attn = MultiHeadedAttention(h, d_model, dropout)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab))
    
    # Important from their (whose?) code. Initialize params w/ Glorot or fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

# Setup of model and associated necessities

## Classes

Everything in here is the same in both the colab and jupyter notebooks.

In [0]:
# Note: This part is incredibly important. 
# Need to train with this setup or the model is very unstable.

# Called by get_std_opt to make model_opt in main
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup**(-1.5)))
    
# Used to make 'criterion' in main
class LabelSmoothing(nn.Module):
    "Implement label smoothing."
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
        
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        
        # If the mask is not empty
        if mask.nelement() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))

# Make the batches to iterate over for training and validation
class MyIterator(data.Iterator):
    def create_batches(self):
        if self.train:
            def pool(d, random_shuffler):
                for p in data.batch(d, self.batch_size * 100):
                    p_batch = data.batch(
                        sorted(p, key=self.sort_key),
                        self.batch_size, self.batch_size_fn)
                    for b in random_shuffler(list(p_batch)):
                        yield b
            self.batches = pool(self.data(), self.random_shuffler)
            
        else:
            self.batches = []
            for b in data.batch(self.data(), self.batch_size,
                                          self.batch_size_fn):
                self.batches.append(sorted(b, key=self.sort_key))

## Helper functions

Everything in here is identical in jupyter and colab

In [0]:
# Called in main to make 'model_opt'
def get_std_opt(model):
    """ Set up default parameters for the Noam Optimizer. """
    return NoamOpt(model.src_embed[0].d_model, 2, 4000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), 
                             eps=1e-9))

# Called by make_std_mask 
def subsequent_mask(size):
    """
    Mask out subsequent positions to ensure predictions for position i can only
    depend on the known outputs at positions less than i.
    """
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0


# Called to set up the MyIterator things
global max_src_in_batch, max_tgt_in_batch
def batch_size_fn(new, count, sofar):
    """
    Ensure that the batch size padded to the maximum batchsize does not
    surpass a threshold.
    """
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch,  len(new.src))
    max_tgt_in_batch = max(max_tgt_in_batch,  len(new.trg) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)

## Colab-style Batch class

Makes the source and target masks inside the rebatch function, because they're required arguments of the Batch class.

In [0]:
class Batch:
    def __init__(self, src, trg, src_mask, trg_mask, ntokens):
        self.src = src
        self.trg = trg
        self.src_mask = src_mask
        self.trg_mask = trg_mask
        self.ntokens = ntokens
        
# Called by data_gen and rebatch
def make_std_mask(src, tgt, pad):
    """ Hide future words and batch padding. """
    src_mask = (src != pad).unsqueeze(-2)
    tgt_mask = (tgt != pad).unsqueeze(-2)
    tgt_mask = tgt_mask & Variable(subsequent_mask(tgt.size(-1)).type_as(
        tgt_mask.data))
    return src_mask, tgt_mask

# Called on the stuff in MyIterator at each epoch
def rebatch(pad_idx, batch):
    """
    Ensure that we have very evenly divided batches with minimal padding
    (despite variations in sentence length).
    """
    src, trg = batch.src.transpose(0, 1), batch.trg.transpose(0, 1)
    src_mask, trg_mask = make_std_mask(src, trg, pad_idx)
    return Batch(src, trg, src_mask, trg_mask, (trg[1:] != pad_idx).data.sum())

## Jupyter-style Batch class

Makes the source and target masks inside the Batch class, instead of in the rebatch function. 

Target is an optional keyword argument.

In [0]:
class Batch:
    "Object for holding a batch of data with mask during training."
    def __init__(self, src, trg=None, pad=0):
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if trg is not None:
            self.trg = trg[:, :-1]
            self.trg_y = trg[:, 1:]
            self.trg_mask = \
                self.make_std_mask(self.trg, pad)
            self.ntokens = (self.trg_y != pad).data.sum()
    
    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & Variable(
            subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))
        return tgt_mask
    
def rebatch(pad_idx, batch):
    "Fix order in torchtext to match ours"
    src, trg = batch.src.transpose(0, 1), batch.trg.transpose(0, 1)
    return Batch(src, trg, pad_idx)

## Loss and epoch running (colab style)

In [0]:
# Loss function (compute each timestep separately to optimize memory)
def loss_backprop(generator, criterion, out, targets, normalize):
    
    assert out.size(1) == targets.size(1)
    total = 0.0
    out_grad = []
    for i in range(out.size(1)):
        out_column = Variable(out[:, i].data, requires_grad=True)
        gen = generator(out_column)
        loss = criterion(gen, targets[:, i]) / normalize.float()
        total += loss.data[0]
        loss.backward()
        out_grad.append(out_column.grad.data.clone())
    out_grad = torch.stack(out_grad, dim=1)
    out.backward(gradient=out_grad)
    return total

def train_epoch(train_iter, model, criterion, opt, transpose=False):
    model.train()
    for i, batch in enumerate(train_iter):
        src, trg, src_mask, trg_mask = \
            batch.src, batch.trg, batch.src_mask, batch.trg_mask
        out = model.forward(src, trg[:, :-1], src_mask, trg_mask[:, :-1, :-1])
        loss = loss_backprop(model.generator, criterion, out, trg[:, 1:], 
                             batch.ntokens) 
                        
        model_opt.step()
        model_opt.optimizer.zero_grad()
        if i % 10 == 1:
            print("Batch", i, "Loss", np.round(loss.item(),4), 
                 "Learning Rate", np.round(model_opt._rate,8))
            
def valid_epoch(valid_iter, model, criterion, transpose=False):
    model.eval()
    total = 0
    for i, batch in enumerate(valid_iter):
        src, trg, src_mask, trg_mask = \
            batch.src, batch.trg, batch.src_mask, batch.trg_mask
        out = model.forward(src, trg[:, :-1], src_mask, trg_mask[:, :-1, :-1])
        loss = loss_backprop(model.generator, criterion, out, trg[:, 1:], 
                             batch.ntokens) 
        print("Batch", i, "validation loss:", loss.item())

## Loss and epoch running (jupyter style)

Keeps track of token number manually inside run_epoch, since it's not a property of the Batch class.

In [0]:
class SimpleLossCompute:
    "A simple loss compute and train function."
    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt
        
    def __call__(self, out, y, norm):
        
        #gen = self.generator(out)
        #loss = self.criterion(gen.contiguous().view(-1, gen.size(-1)), 
        #                      y.contiguous().view(-1)) / norm
        #loss.backward()
        total = 0.0
        out_grad = []
        
        
        if self.opt is not None:
            # Avoid calling loss.backward() each time during validation without
            # zeroing the gradients
            loss.backward()
            self.opt.step()
            self.opt.optimizer.zero_grad()
        return loss.data[0] * norm
    
def run_epoch(data_iter, model, loss_compute):
    "Standard Training and Logging Function"
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg, 
                            batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 50 == 1:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                    (i, loss / batch.ntokens, tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens

Both loss/epoch cells can be run with no problem--namespaces do not overlap.

# Actually run the model

## Stock Setup

Identical for both jupyter and colab (up to some parameters, as mentioned in comments)

### Dataset and vocabulary initialization

### DO NOT RUN EVERY TIME, IT TAKES A WHILE

In [0]:
# Set up tokenizing of input--modify slightly for en/es vs. en/de
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# Set up a few special tokens
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"

# data is a torchtext module, remember
SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                 eos_token = EOS_WORD, pad_token=BLANK_WORD)

# READ IN THE DATASET--WILL HAVE TO MODIFY FOR GENERAL CONFERENCE
MAX_LEN = 100
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(SRC, TGT), 
                        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
                                              len(vars(x)['trg']) <= MAX_LEN)
# What does the dataset look like? 
    # Has attributes .src and .trg
    # Can be fed into a torchtext data Iterator

# Set up vocabularies for each language
MIN_FREQ = 2 # how often must a word appear to be counted in the vocabulary
             # Note: it's 1 in the colab notebook, 2 in the jupyter

SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)


downloading de-en.tgz
.data/iwslt/de-en/IWSLT16.TEDX.dev2012.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TEDX.tst2013.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2011.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TEDX.tst2013.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.tst2011.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2010.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TEDX.tst2014.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.tst2014.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.tst2014.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TEDX.dev2012.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TEDX.tst2014.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.dev2010.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.tst2013.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2012.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2010.de-en.de.xml
.data/iwslt/de-en/IWSLT16.TED.dev2010.de-en.en.xml
.data/iwslt/de-en/IWSLT16.TED.tst2012.de-en.de.xml
.data/iwslt/de-en/train.tags.de-en.de
.data/iwslt/de-e

### Initialize actual model stuff

In [0]:
# Set up a few constant parameters
n_src = len(SRC.vocab)
n_tgt = len(TGT.vocab)
pad_idx = TGT.vocab.stoi["<blank>"]

# Batch size is 4096 in the colab notebook and 12000 in the jupyter
BATCH_SIZE = 4096

# Set up the model 
model = make_model(n_src, n_tgt, N=6)
model_opt = get_std_opt(model)
model.cuda()

# Set up the label smoothing to penalize overconfidence
criterion = LabelSmoothing(size=n_tgt, padding_idx=pad_idx, smoothing=0.1)
criterion.cuda()

# Set up training and validation iterators
train_iter = MyIterator(train, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=True)

valid_iter = MyIterator(val, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=False)

num_train_batches = 0
for i, batch in enumerate(train_iter):
    num_train_batches += 1
num_valid_batches = 0
for i, batch in enumerate(valid_iter):
    num_valid_batches += 1
print("\nNumber of training batches:", num_train_batches)
print("Number of validation batches:", num_valid_batches, "\n")




Number of training batches: 1111
Number of validation batches: 9 



  return Variable(arr, volatile=not train)


## Train (colab style) 

Must be paired with colab-style loss/epoch running.

In [0]:
# Train the model
for epoch in range(15):
    print("\n\n%%%%%%%%%% EPOCH " + str(epoch) + " %%%%%%%%%%\n")
    train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, 
                model_opt)
    valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)

It did actually train the whole time, I just don't have the output of that with the losses and stuff anymore after the session quit. The output for the translations from the full training time is below.

## Train (jupyter style)

Must be paired with jupyter-style loss/epoch running.

In [0]:
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 2000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), 
                             eps=1e-9))
train_loss = SimpleLossCompute(model.generator, criterion, opt=model_opt)
validation_loss = SimpleLossCompute(model.generator, criterion, opt=None)
for epoch in range(10):
    
    model.train()
    run_epoch((rebatch(pad_idx, b) for b in train_iter), model, train_loss)
    
    model.eval()
    loss = run_epoch((rebatch(pad_idx, b) for b in valid_iter), model, 
                     validation_loss)
    print(loss)

torch.Size([8, 23, 36320]) torch.Size([8, 23])


RuntimeError: ignored

## Produce translations (code found only in jupyter)

In [0]:
# Predict a translation using greedy decoding for simplicity
# You must use the jupyter EncoderDecoder class to run this!
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len-1):
        out = model.decode(memory, src_mask, 
                           Variable(ys), 
                           Variable(subsequent_mask(ys.size(1))
                                    .type_as(src.data)))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim = 1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, 
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], 
                       dim=1)
    return ys

# Decode the model to produce translations (first sentence in validation set)
model.eval()
for param in model.parameters():
    param.requires_grad = False

for i, batch in enumerate(valid_iter):
    
    src = batch.src.transpose(0, 1)[:1]
    src_mask = (src != SRC.vocab.stoi["<blank>"]).unsqueeze(-2)
    out = greedy_decode(model, src, src_mask, 
                        max_len=60, start_symbol=TGT.vocab.stoi["<s>"])
    
    #print(out.size())
    #print(out)
    #print(batch.trg.data.size())
    #print(batch.trg.data)
    print("Translation:", end="\t")
    for i in range(1, out.size(1)):
        sym = TGT.vocab.itos[out[0, i]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    
    print("Target:", end="\t")
    for i in range(1, batch.trg.size(0)):
        sym = TGT.vocab.itos[batch.trg.data[i, 0]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    print()

  return Variable(arr, volatile=not train)


Translation:	So I planted a plant in front of my house . 
Target:	So what I did , I planted a food forest in front of my house . 

Translation:	You can see the central figures , like the leaders are the group . 
Target:	You can see the hubs , like who are the leaders in the group . 

Translation:	And I 'll tell you , " I do n't know that 's because responsibility in my responsibility . " 
Target:	And the first answer is , " I do n't know , they do n't put me in charge of that . " 

Translation:	Over four decades , <unk> regimes have broken the regimes , the infrastructure and the moral structure of society , the moral structure of society . 
Target:	For four decades Gaddafi 's tyrannical regime destroyed the infrastructure as well as the culture and the moral fabric of Libyan society . 

Translation:	I was just three years old when my brother was born , and I was so excited that I had a new creature in my life . 
Target:	I was just three years old when my brother came along , and I was

# General Conference Dataset

In [0]:
!wget -O ./text_files.tar.gz 'http://liftothers.org/dokuwiki/lib/exe/fetch.php?media=cs501r_f2018:es-en-general-conference.tar.gz' 
!tar -xzf text_files.tar.gz

import glob
print(glob.glob('./*'))

--2018-10-21 18:28:03--  http://liftothers.org/dokuwiki/lib/exe/fetch.php?media=cs501r_f2018:es-en-general-conference.tar.gz
Resolving liftothers.org (liftothers.org)... 50.62.229.1
Connecting to liftothers.org (liftothers.org)|50.62.229.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18318204 (17M) [application/octet-stream]
Saving to: ‘./text_files.tar.gz’


2018-10-21 18:28:04 (12.5 MB/s) - ‘./text_files.tar.gz’ saved [18318204/18318204]

['./sample_data', './en-es_gc_2010-2017_en.txt', './text_files.tar.gz', './en-es_gc_2010-2017_es.txt', './en-es_conference.csv']


In [0]:
gc_data_csv = './en-es_conference.csv'

spacy_es = spacy.load('es')

def tokenize_es(text):
    return [tok.text for tok in spacy_es.tokenizer(text)]

SRC = data.Field(tokenize=tokenize_es, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                 eos_token = EOS_WORD, pad_token=BLANK_WORD)

gc_data = data.TabularDataset(gc_data_csv, 'csv',
                              fields=[('Index', None),('trg', TGT),('src', SRC)],
                              skip_header=True)

print(gc_data[0])
print(gc_data[0].src, gc_data[0].trg)
train, val = gc_data.split(split_ratio=0.99)
print(len(gc_data))
print(len(train), ":", len(val))

MAX_LEN = 100
MIN_FREQ = 1

SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

print(len(SRC.vocab))
print(len(TGT.vocab))

# Set up a few constant parameters
n_src = len(SRC.vocab)
n_tgt = len(TGT.vocab)
pad_idx = TGT.vocab.stoi["<blank>"]

<torchtext.data.example.Example object at 0x7ffaac7feba8>
['El', 'templo', 'tiene', 'un', 'lugar', 'en', 'el', 'centro', 'mismo', 'de', 'nuestras', 'creencias', 'más', 'sagradas', 'y', 'el', 'Señor', 'nos', 'pide', 'que', 'asistamos', ',', 'meditemos', ',', 'estudiemos', 'y', 'encontremos', 'significado', 'personal', 'y', 'aplicación', 'individual', '.'] ['The', 'temple', 'holds', 'a', 'place', 'at', 'the', 'very', 'center', 'of', 'our', 'most', 'sacred', 'beliefs', ',', 'and', 'the', 'Lord', 'asks', 'that', 'we', 'attend', ',', 'ponder', ',', 'study', ',', 'and', 'find', 'personal', 'meaning', 'and', 'application', 'individually', '.']
105786
104728 : 1058
46981
32034


In [0]:
# Batch size is 4096 in the colab notebook and 12000 in the jupyter
BATCH_SIZE = 4096

# Set up the model 
model = make_model(n_src, n_tgt, N=6)
model_opt = get_std_opt(model)
model.cuda()

# Set up the label smoothing to penalize overconfidence
criterion = LabelSmoothing(size=n_tgt, padding_idx=pad_idx, smoothing=0.1)
criterion.cuda()

# Set up training and validation iterators
train_iter = MyIterator(train, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=True)

valid_iter = MyIterator(val, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=False)

num_train_batches = 0
for i, batch in enumerate(train_iter):
    num_train_batches += 1
num_valid_batches = 0
for i, batch in enumerate(valid_iter):
    num_valid_batches += 1
print("\nNumber of training batches:", num_train_batches)
print("Number of validation batches:", num_valid_batches, "\n")




Number of training batches: 779
Number of validation batches: 12 



  return Variable(arr, volatile=not train)


In [0]:
# Train the model
for epoch in range(15):
    print("\n\n%%%%%%%%%% EPOCH " + str(epoch) + " %%%%%%%%%%\n")
    train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, 
                model_opt)
    valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)



%%%%%%%%%% EPOCH 0 %%%%%%%%%%



  # Remove the CWD from sys.path while we load stuff.


Batch 1 Loss 8.7362 Learning Rate 7e-07
Batch 11 Loss 8.1152 Learning Rate 4.19e-06
Batch 21 Loss 8.2874 Learning Rate 7.69e-06
Batch 31 Loss 9.1505 Learning Rate 1.118e-05
Batch 41 Loss 8.8966 Learning Rate 1.467e-05
Batch 51 Loss 8.2365 Learning Rate 1.817e-05
Batch 61 Loss 8.5972 Learning Rate 2.166e-05
Batch 71 Loss 8.1104 Learning Rate 2.516e-05
Batch 81 Loss 6.8483 Learning Rate 2.865e-05
Batch 91 Loss 8.0527 Learning Rate 3.214e-05
Batch 101 Loss 7.3726 Learning Rate 3.564e-05
Batch 111 Loss 7.3194 Learning Rate 3.913e-05
Batch 121 Loss 7.3304 Learning Rate 4.263e-05
Batch 131 Loss 7.0184 Learning Rate 4.612e-05
Batch 141 Loss 6.6664 Learning Rate 4.961e-05
Batch 151 Loss 6.8967 Learning Rate 5.311e-05
Batch 161 Loss 6.3782 Learning Rate 5.66e-05
Batch 171 Loss 6.1456 Learning Rate 6.009e-05
Batch 181 Loss 6.4922 Learning Rate 6.359e-05
Batch 191 Loss 5.629 Learning Rate 6.708e-05
Batch 201 Loss 6.0124 Learning Rate 7.058e-05
Batch 211 Loss 5.759 Learning Rate 7.407e-05
Batch 22

  return Variable(arr, volatile=not train)


Batch 0 validation loss: 3.7105350494384766
Batch 1 validation loss: 4.268636703491211
Batch 2 validation loss: 5.592027187347412
Batch 3 validation loss: 4.039651393890381
Batch 4 validation loss: 4.270091533660889
Batch 5 validation loss: 4.324169635772705
Batch 6 validation loss: 4.464666843414307
Batch 7 validation loss: 4.521942138671875
Batch 8 validation loss: 4.539643287658691
Batch 9 validation loss: 4.68431282043457
Batch 10 validation loss: 4.729336261749268
Batch 11 validation loss: 5.159358978271484


%%%%%%%%%% EPOCH 1 %%%%%%%%%%

Batch 1 Loss 3.9577 Learning Rate 0.00027287
Batch 11 Loss 4.8538 Learning Rate 0.00027636
Batch 21 Loss 4.1408 Learning Rate 0.00027986
Batch 31 Loss 4.5991 Learning Rate 0.00028335
Batch 41 Loss 4.1712 Learning Rate 0.00028685
Batch 51 Loss 4.4945 Learning Rate 0.00029034
Batch 61 Loss 4.7803 Learning Rate 0.00029383
Batch 71 Loss 4.5315 Learning Rate 0.00029733
Batch 81 Loss 3.5574 Learning Rate 0.00030082
Batch 91 Loss 3.7748 Learning Rate 0

In [0]:
# Predict a translation using greedy decoding for simplicity
# You must use the jupyter EncoderDecoder class to run this!
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len-1):
        out = model.decode(memory, src_mask, 
                           Variable(ys), 
                           Variable(subsequent_mask(ys.size(1))
                                    .type_as(src.data)))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim = 1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, 
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], 
                       dim=1)
    return ys# Decode the model to produce translations (first sentence in validation set)

model.eval()
for param in model.parameters():
    param.requires_grad = False
    
print(TGT.vocab)

for i, batch in enumerate(valid_iter):
    
    src = batch.src.transpose(0, 1)[:1]
    src_mask = (src != SRC.vocab.stoi["<blank>"]).unsqueeze(-2)
    out = greedy_decode(model, src, src_mask, 
                        max_len=60, start_symbol=TGT.vocab.stoi["<s>"])
    
    print("Translation:", end="\t")
    for i in range(1, out.size(1)):
        sym = TGT.vocab.itos[out[0, i]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    
    print("Target:", end="\t")
    for i in range(1, batch.trg.size(0)):
        sym = TGT.vocab.itos[batch.trg.data[i, 0]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    print()

<torchtext.vocab.Vocab object at 0x7ffac8d86438>


  return Variable(arr, volatile=not train)


Translation:	As daughters of God , you were born to lead . 
Target:	As daughters of God , you were born to lead . 

Translation:	Let not your heart be troubled , neither let it be afraid . ” 
Target:	Let not your heart be troubled , neither let it be afraid . ” 

Translation:	The other counselor was a prominent judge in the city . 
Target:	The other counselor was a prominent judge in the city . 

Translation:	My wife , Harriet was always the best at finding something inspirational , uplifting , or humorous to share . 
Target:	My wife , Harriet , was always the best at finding something inspirational , uplifting , or humorous to share . 

Translation:	“ And Jesus said unto them : Pray on ; nevertheless they did not cease to pray ” ( 3 Nephi 19:26 ) . 
Target:	“ And Jesus said unto them : Pray on ; nevertheless they did not cease to pray ” ( 3 Nephi 19:26 ) . 

Translation:	“ And he that receiveth my Father receiveth my Father ’s kingdom ; therefore all that my Father hath shall be given

# Two-layer training and results

In [0]:
# Set up the model 
model = make_model(n_src, n_tgt, N=2)
model_opt = get_std_opt(model)
model.cuda()

# Set up the label smoothing to penalize overconfidence
criterion = LabelSmoothing(size=n_tgt, padding_idx=pad_idx, smoothing=0.1)
criterion.cuda()

# Set up training and validation iterators
train_iter = MyIterator(train, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=True)

valid_iter = MyIterator(val, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=False)

num_train_batches = 0
for i, batch in enumerate(train_iter):
    num_train_batches += 1
num_valid_batches = 0
for i, batch in enumerate(valid_iter):
    num_valid_batches += 1
print("\nNumber of training batches:", num_train_batches)
print("Number of validation batches:", num_valid_batches, "\n")

# Train the model
for epoch in range(10):
    print("\n\n%%%%%%%%%% EPOCH " + str(epoch) + " %%%%%%%%%%\n")
    train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, 
                model_opt)
    valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)

  return Variable(arr, volatile=not train)



Number of training batches: 779
Number of validation batches: 12 



%%%%%%%%%% EPOCH 0 %%%%%%%%%%



  # Remove the CWD from sys.path while we load stuff.


Batch 1 Loss 8.7888 Learning Rate 7e-07
Batch 11 Loss 8.177 Learning Rate 4.19e-06
Batch 21 Loss 8.3963 Learning Rate 7.69e-06
Batch 31 Loss 9.2697 Learning Rate 1.118e-05
Batch 41 Loss 8.9931 Learning Rate 1.467e-05
Batch 51 Loss 8.3166 Learning Rate 1.817e-05
Batch 61 Loss 8.6715 Learning Rate 2.166e-05
Batch 71 Loss 8.1716 Learning Rate 2.516e-05
Batch 81 Loss 6.8748 Learning Rate 2.865e-05
Batch 91 Loss 8.1221 Learning Rate 3.214e-05
Batch 101 Loss 7.4344 Learning Rate 3.564e-05
Batch 111 Loss 7.3857 Learning Rate 3.913e-05
Batch 121 Loss 7.3995 Learning Rate 4.263e-05
Batch 131 Loss 7.0885 Learning Rate 4.612e-05
Batch 141 Loss 6.7308 Learning Rate 4.961e-05
Batch 151 Loss 6.9675 Learning Rate 5.311e-05
Batch 161 Loss 6.4291 Learning Rate 5.66e-05
Batch 171 Loss 6.2009 Learning Rate 6.009e-05
Batch 181 Loss 6.5411 Learning Rate 6.359e-05
Batch 191 Loss 5.6554 Learning Rate 6.708e-05
Batch 201 Loss 6.0499 Learning Rate 7.058e-05
Batch 211 Loss 5.7795 Learning Rate 7.407e-05
Batch 2

Accidentally had extra print statements in the results printer, have to print them again

In [0]:
# Decode the model to produce translations (first sentence in validation set)
model.eval()
for param in model.parameters():
    param.requires_grad = False
    
for i, batch in enumerate(valid_iter):
    
    src = batch.src.transpose(0, 1)[:1]
    src_mask = (src != SRC.vocab.stoi["<blank>"]).unsqueeze(-2)
    out = greedy_decode(model, src, src_mask, 
                        max_len=60, start_symbol=TGT.vocab.stoi["<s>"])
    
    print("Translation:", end="\t")
    for i in range(1, out.size(1)):
        sym = TGT.vocab.itos[out[0, i]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    
    print("Target:", end="\t")
    for i in range(1, batch.trg.size(0)):
        sym = TGT.vocab.itos[batch.trg.data[i, 0]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    print()

  return Variable(arr, volatile=not train)


Translation:	As daughters of God , you were born to lead . 
Target:	As daughters of God , you were born to lead . 

Translation:	Let not your heart be troubled , neither let it be afraid . ” 
Target:	Let not your heart be troubled , neither let it be afraid . ” 

Translation:	The other counselor was a prominent judge in the city . 
Target:	The other counselor was a prominent judge in the city . 

Translation:	My wife , Harriet , was always the best at finding something inspirational , uplifting , or humorous to share . 
Target:	My wife , Harriet , was always the best at finding something inspirational , uplifting , or humorous to share . 

Translation:	“ And Jesus said unto them : Pray on ; nevertheless they did not cease to pray ” ( 3 Nephi 19:26 ) . 
Target:	“ And Jesus said unto them : Pray on ; nevertheless they did not cease to pray ” ( 3 Nephi 19:26 ) . 

Translation:	“ And he that receiveth my Father ’s kingdom , and therefore all that my Father hath shall be given unto him . ” 

# One-layer training and results (+summary paragraph)

In [0]:
# Set up the model 
model = make_model(n_src, n_tgt, N=1)
model_opt = get_std_opt(model)
model.cuda()

# Set up the label smoothing to penalize overconfidence
criterion = LabelSmoothing(size=n_tgt, padding_idx=pad_idx, smoothing=0.1)
criterion.cuda()

# Set up training and validation iterators
train_iter = MyIterator(train, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=True)

valid_iter = MyIterator(val, batch_size=BATCH_SIZE, device=0,
                        repeat=False, sort_key=lambda x: (len(x.src), 
                                                          len(x.trg)),
                        batch_size_fn=batch_size_fn, train=False)

num_train_batches = 0
for i, batch in enumerate(train_iter):
    num_train_batches += 1
num_valid_batches = 0
for i, batch in enumerate(valid_iter):
    num_valid_batches += 1
print("\nNumber of training batches:", num_train_batches)
print("Number of validation batches:", num_valid_batches, "\n")

# Train the model
for epoch in range(10):
    print("\n\n%%%%%%%%%% EPOCH " + str(epoch) + " %%%%%%%%%%\n")
    train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, 
                model_opt)
    valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)

  return Variable(arr, volatile=not train)



Number of training batches: 779
Number of validation batches: 12 



%%%%%%%%%% EPOCH 0 %%%%%%%%%%



  # Remove the CWD from sys.path while we load stuff.


Batch 1 Loss 8.7637 Learning Rate 7e-07
Batch 11 Loss 8.1585 Learning Rate 4.19e-06
Batch 21 Loss 8.3962 Learning Rate 7.69e-06
Batch 31 Loss 9.3083 Learning Rate 1.118e-05
Batch 41 Loss 9.0539 Learning Rate 1.467e-05
Batch 51 Loss 8.3758 Learning Rate 1.817e-05
Batch 61 Loss 8.7287 Learning Rate 2.166e-05
Batch 71 Loss 8.2275 Learning Rate 2.516e-05
Batch 81 Loss 6.9796 Learning Rate 2.865e-05
Batch 91 Loss 8.167 Learning Rate 3.214e-05
Batch 101 Loss 7.4875 Learning Rate 3.564e-05
Batch 111 Loss 7.4344 Learning Rate 3.913e-05
Batch 121 Loss 7.4456 Learning Rate 4.263e-05
Batch 131 Loss 7.1431 Learning Rate 4.612e-05
Batch 141 Loss 6.7854 Learning Rate 4.961e-05
Batch 151 Loss 7.0102 Learning Rate 5.311e-05
Batch 161 Loss 6.4727 Learning Rate 5.66e-05
Batch 171 Loss 6.2427 Learning Rate 6.009e-05
Batch 181 Loss 6.5768 Learning Rate 6.359e-05
Batch 191 Loss 5.6957 Learning Rate 6.708e-05
Batch 201 Loss 6.0747 Learning Rate 7.058e-05
Batch 211 Loss 5.8043 Learning Rate 7.407e-05
Batch 2

### Print out the 1-layer results again so we can look at the spanish

In [0]:
# Decode the model to produce translations (first sentence in validation set)
model.eval()
for param in model.parameters():
    param.requires_grad = False

for i, batch in enumerate(valid_iter):
    
    src = batch.src.transpose(0, 1)[:1]
    src_mask = (src != SRC.vocab.stoi["<blank>"]).unsqueeze(-2)
    out = greedy_decode(model, src, src_mask, 
                        max_len=60, start_symbol=TGT.vocab.stoi["<s>"])
    
    print("Original:", end="\t")
    for i in range(1, batch.src.size(0)):
        sym = SRC.vocab.itos[batch.src.data[i,0]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    
    print("Translation:", end="\t")
    for i in range(1, out.size(1)):
        sym = TGT.vocab.itos[out[0, i]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    
    print("Target:", end="\t")
    for i in range(1, batch.trg.size(0)):
        sym = TGT.vocab.itos[batch.trg.data[i, 0]]
        if sym == "</s>": break
        print(sym, end =" ")
    print()
    print()

  return Variable(arr, volatile=not train)


Original:	hijas de Dios , nacieron para liderar . 
Translation:	As daughters of God , you were born to lead . 
Target:	As daughters of God , you were born to lead . 

Original:	se turbe vuestro corazón ni tenga miedo ” . 
Translation:	Let not your heart be troubled , neither let it be afraid . ” 
Target:	Let not your heart be troubled , neither let it be afraid . ” 

Original:	otro consejero era un juez importante de la ciudad . 
Translation:	The other counselor was a prominent judge in the city . 
Target:	The other counselor was a prominent judge in the city . 

Original:	esposa Harriet era la mejor para hallar algo inspirador , edificante o cómico para compartir ; 
Translation:	My wife , Harriet , was always the best at finding something inspirational , uplifting , or humorous to share . 
Target:	My wife , Harriet , was always the best at finding something inspirational , uplifting , or humorous to share . 

Original:	Y Jesús les dijo : Seguid orando ; y ellos no cesaban de orar ” ( 

I printed out the original Spanish for each sentence because I suspected that when we use fewer heads of attention, the system is not as able to correctly translate phrases where spanish grammar puts words in a different order than english, so there is not as direct of a one-to-one mapping between words. It can still switch adjective order for adjacent words ('existencia mortal' does become 'mortal existence', as it should, even with one layer), but it has trouble with longer phrases with fewer layers. In some cases, though, the increasing freedom to move away from the original structure goes a bit awry in an overfitting-esque way.

A couple of illustrative examples:

"The … miracle for me occurred in the Family History office of Mel Olsen"
* Spanish: “El … milagro tuvo lugar para mí en la oficina de Mel Olsen, en Historia Familiar"
* 1 layer: "The … miracle for me occurred in the Family History office of felt calm Olsen" (a bit confused)
* 2 layers: "The miracle … occurred to me in the Family History office of Grandmother" (even more confused)
* 6 layers: "The … miracle for me occurred in the Family History office of Mel Olsen" (correct!)

"And he that receiveth my Father receiveth my Father’s kingdom" 
* Spanish: "y el que recibe a mi Padre, recibe el reino de mi Padre" 
* 1 layer: "And he that receiveth my Father, receiveth the kingdom of my Father" (maintaining Spanish word order)
* 2 layers: "And he that receiveth my Father ’s kingdom" (lost part of the sentence, but got the English word order)
* 6 layers: "And he that receiveth my Father receiveth my Father ’s kingdom" (correct!)

"Oh, I’ve got to get my visiting teaching done!"
* Spanish: "¡Ay, tengo que hacer mis visitas!"
* 1 layer: "Oh, I’ve got to make my visiting teaching visits!" (follows Spanish grammar, but makes sense)
* 2 layers: "Oh, I’ve got to get my visiting teaching" (follows English grammar, but loses a word)
* 6 layers: "Oh, I’ve been on my visits" (misinterprets the sentence completely)
