# Introduction
In this notebook I will once again tackle the task of machine translation using an encoder-decoder setup. I will also give attention a more honest try compared to last time.

I have just started trying out PyTorch instead of Keras, and have thus far enjoyed the increased flexibility. It turns out that PyTorch offers a [tutorial for machine translation using an encoder-decoder setup and attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-download-intermediate-seq2seq-translation-tutorial-py), which I will draw a lot of inspiration from.

Some notes on the baseline model I will use:
* Encoder-Decoder Architecture
* Operates on word level
* Trains embeddings for both languages
* Attention used in the decoder

# Data
I will be using data from the same source as Chollet (exactly as in my previous notebooks), http://www.manythings.org/anki/. I'm using the 17303 sentence long swe-eng data set, that contains english sentences and their swedish translations. The french data set used by Chollet is much larger, but he limited his training set to 10 000 sentences and used 20% of it for validation during training.

The data will be implicitly padded in the model so, I don't have to di it explicitly.

In [1]:
data_path = 'data/swe-eng/swe.txt'
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

Read all sentences.

In [2]:
input_sentences, target_sentences = [], []
for line in lines:
    try:
        input_text, target_text, *_ = line.split('\t')
    except ValueError:
        print(line)
        
    input_sentences.append(input_text)
    target_sentences.append(target_text)




## Tokenize and Normalise
I want to word tokenize by using NLTK, and then convert to lower case.

In [3]:
from nltk.tokenize import word_tokenize

In [4]:
input_tokenized = [list(map(str.lower, word_tokenize(sentence))) for sentence in input_sentences]
target_tokenized = [list(map(str.lower, word_tokenize(sentence, language='swedish'))) for sentence in target_sentences]

In [5]:
input_tokenized[0]

['run', '!']

## Investigate sentence lengths.

The attention model proposed in the PyTorch tutorial is limited to fixed length sequences, so I will choose a fitting max length. Probably the same as what is used in the PyTorch tutorial: 10.

In [6]:
import numpy as np

In [7]:
input_seq_lens = np.array([len(sentence) for sentence in input_tokenized])
target_seq_lens = np.array([len(sentence) for sentence in target_tokenized])

In [8]:
max_seq_len = 10

In [9]:
input_idx = np.where(input_seq_lens <= max_seq_len)
target_idx = np.where(target_seq_lens <= max_seq_len)

In [10]:
print("{} input sentences with {} or fewer characters".format(len(input_idx[0]), max_seq_len))
print("{} target sentences with {} or fewer characters".format(len(target_idx[0]), max_seq_len))

16362 input sentences with 10 or fewer characters
16502 target sentences with 10 or fewer characters


In [11]:
keep_idx = np.intersect1d(input_idx, target_idx)

In [12]:
print("{} input sentence pairs with {} or fewer characters in both languages".format(len(keep_idx), max_seq_len))

16186 input sentence pairs with 10 or fewer characters in both languages


In [13]:
input_sentences = np.array(input_tokenized)[keep_idx]
target_sentences = np.array(target_tokenized)[keep_idx]

## Build vocabularies
I'll pretty much copy the approach of the PyTorch tutorial to construct my vocabularies. Just doing some small code tweaks.

In [31]:
class Vocab:
    
    def __init__(self, name):
        self.name = name
        self.word2index = {"<SOS>" : 0, "<EOS>" : 1}
        self.word2count = {}
        # Reverse word2index
        self.index2word = dict([(b, a) for a, b in self.word2index.items()])
        self.n_words = len(self.word2index)
        
    def add_sentence(self, sentence):
        for word in sentence:
            self.add_word(word)
        
    def add_word(self, word):
        # If there any many more tokens than unique tokens, 
        # it is more efficient to assume the token is already in the vocabulary
        try:
            self.word2count[word] += 1
        except KeyError:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1    

In [20]:
input_vocab = Vocab('eng')
target_vocab = Vocab('swe')

In [21]:
%%time
for input_sentence, target_sentence in zip(input_sentences, target_sentences):
    input_vocab.add_sentence(input_sentence)
    target_vocab.add_sentence(target_sentence)

Wall time: 155 ms


In [22]:
for vocab in [input_vocab, target_vocab]:
    print("Vocab %s: %d unique tokens and %d tokens total" % (vocab.name, vocab.n_words, sum(vocab.word2count.values())))

Vocab eng: 4552 unique tokens and 97545 tokens total
Vocab swe: 6511 unique tokens and 94538 tokens total


## Divide data into a training and a validation set
I will use 8 000 sentances as training set and 2000 as validation set.

In [146]:
trainig_size, validation_size = 8000, 2000

In [147]:
shuffle_idx = np.random.permutation(len(input_sentences))

In [148]:
train_idx, val_idx = shuffle_idx[:trainig_size], shuffle_idx[trainig_size:trainig_size+validation_size]

In [162]:
input_train, input_val = input_sentences[train_idx], input_sentences[val_idx]
target_train, target_val = target_sentences[train_idx], target_sentences[val_idx]

## Convert data to tensors
Just like in the PyTorch tutorial I will set up functions to convert the sentences to tensors.

In [33]:
import torch

In [37]:
def sentence2indexes(vocab, sentence):
    return [vocab.word2index[word] for word in sentence]


def sentence2tensor(vocab, sentence):
    indexes = sentence2indexes(vocab, sentence)
    indexes.append(vocab.word2index['<EOS>'])
    return torch.tensor(indexes, dtype=torch.long).view(-1, 1)


def pair2tensor(pair):
    input_tensor = sentence2tensor(input_vocab, pair[0])
    target_tensor = sentence2tensor(target_vocab, pair[1])
    return (input_tensor, target_tensor)

# Model
Finally, let's define our model.

## Encoder

In [39]:
import torch.nn as nn

In [179]:
class EncoderRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        
    def forward(self, input_word, hidden_state):
        embedded = self.embedding(input_word).view(1, 1, -1)
        output, hidden_state = self.gru(embedded, hidden_state)
        return output, hidden_state
    
    # Returns a tensor with only zeroes to be used as initial hidden state
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size)
        

## Decoder

In [181]:
import torch.nn.functional as F

In [182]:
class DecoderRNNAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, dropout_p, max_length):
        super(DecoderRNNAttention, self).__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.dropout_p = dropout_p
        self.max_length = max_length
        
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.dropout = nn.Dropout(dropout_p)
        self.attention = nn.Linear(hidden_size * 2, max_length)
        self.attention_combine = nn.Linear(hidden_size * 2, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, input_word, hidden_state, encoder_outputs):
        
        embedded = self.embedding(input_word).view(1, 1, -1)
        embedded = self.dropout(embedded)
        
        attention_weights = F.softmax(
            self.attention(torch.cat((embedded[0], hidden_state[0]), dim = 1)),
            dim = 1
        )
        
        attention_applied = torch.bmm(attention_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))
        
        output = torch.cat((embedded[0], attention_applied[0]), dim=1)
        output = self.attention_combine(output).unsqueeze(0)
        
        output = F.relu(output)
        output, hidden_state = self.gru(output, hidden_state)
        
        output = F.log_softmax(self.output(output[0]), dim=1)
        return output, hidden_state, attention_weights
    
    # Returns a tensor with only zeroes to be used as initial hidden state
    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size) 
        
        

# Training
The way I set up the model it can only train on one sentance at a time. This will probably add significant overhead, but tackeling that is a task for another day.

As in all my previous notebooks I will use teacher forcing to train. However, this time I am able to introduce a new variant! As suggested in the PyTorch tutorial it's possible to alternate between teacher forcing and feeding the previous output as input.

In [183]:
import random

In [194]:
def train_model(encoder, decoder, input_tensor, target_tensor, encoder_optimizer, 
                decoder_optimizer, criterion, max_length, teacher_forcing_ratio = 1.0):
    
    encoder_hidden_state = encoder.init_hidden()
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    input_length = input_tensor.shape[0]
    target_length = target_tensor.shape[0]
    
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size)
    
    loss = 0
    
    # Encode input tensor
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden_state)
        encoder_outputs[ei] = encoder_output
        
    # Decode input tensor, seeding the decoding with the <BOS> token
    decoder_input = torch.tensor([[target_vocab.word2index['<SOS>']]])
    decoder_hidden_state = encoder_hidden_state
    
    # Use teacher-forcing according to dice roll
    use_teacher_forcing = random.random() < teacher_forcing_ratio
    
    if use_teacher_forcing:
        for di in range(target_length):
            # Forward step and loss
            decoder_output, decoer_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden_state, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            
            # Update input for next step (teacher-forcing)
            decoder_input = target_tensor[di]
    else:      
        for di in range(target_length):
            # Forward step and loss
            decoder_output, decoer_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden_state, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            
            # Update input for next step (using output at this step)
            log_likelihood, top_word = decoder_output.topk(1) 
            decoder_input = top_word.squeeze().detach()
            
            if decoder_input.item() == target_vocab.word2index['<EOS>']:
                break
    
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item() / target_length
    
    

In [208]:
def tensor2sentence(tensor, vocab):
    sent = []
    for i in tensor:
        sent.append(vocab.index2word[i.item()])
    return sent

In [218]:
def evaluate_model(encoder, decoder, input_tensor, target_tensor, max_length):
    
    with torch.no_grad():

        encoder_hidden_state = encoder.init_hidden()

        input_length = input_tensor.shape[0]
        target_length = target_tensor.shape[0]

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size)

        # Encode input tensor
        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden_state)
            encoder_outputs[ei] = encoder_output

        # Decode input tensor, seeding the decoding with the <BOS> token
        decoder_input = torch.tensor([[target_vocab.word2index['<SOS>']]])
        decoder_hidden_state = encoder_hidden_state
        decoder_word_outputs = []

        for di in range(target_length):
            # Forward step and loss
            decoder_output, decoer_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden_state, encoder_outputs)

            # Update input for next step (using output at this step)
            log_likelihood, top_word = decoder_output.topk(1) 
            decoder_input = top_word.squeeze().detach()
            decoder_word_output = target_vocab.index2word[decoder_input.item()]
            decoder_word_outputs.append(decoder_word_output)
            if decoder_word_output == '<EOS>':
                break
                

        print("Input: ", " ".join(tensor2sentence(input_tensor, input_vocab)))
        print("Target: ", " ".join(tensor2sentence(target_tensor, target_vocab)))
        print("Result: ", " ".join(decoder_word_outputs))

    

In [200]:
encoder = EncoderRNN(input_vocab.n_words, 100)
decoder = DecoderRNNAttention(target_vocab.n_words, 100, .1, 11)

In [201]:
encoder_optimizer = torch.optim.Adam(encoder.parameters())
decoder_optimizer = torch.optim.Adam(decoder.parameters())

criterion = nn.NLLLoss()

In [202]:
from tqdm import tqdm

In [222]:
def train_iteratively(encoder, decoder, tensor_pairs, encoder_optimizer, 
                    decoder_optimizer, criterion, max_length, validation_pairs,
                    teacher_forcing_ratio = 1.0, epochs = 2):
    
    for epoch in range(epochs):
        print("Epoch %d / %d" % (epoch + 1, epochs))
        loss = 0
        for input_tensor, target_tensor in tqdm(tensor_pairs):
            
            loss += train_model(encoder, decoder, input_tensor, target_tensor, encoder_optimizer, 
                         decoder_optimizer, criterion, max_length, teacher_forcing_ratio)
        
        
        # Print some statistics every epoch
        epoch_loss = loss / len(tensor_pairs)
        print('Loss: {:.4f}'.format(epoch_loss))
        
        # Print some examples after each epoch
        # TODO: Don't just print the first 10 sentences of the validation set...
        for input_tensor, target_tensor in validation_pairs[:10]:
            evaluate_model(encoder, decoder, input_tensor, target_tensor, max_length)
            print("")
        print("")
        

In [216]:
def evaluate(encoder, decoder, tensor_pairs,  max_length):
    

    for input_tensor, target_tensor in tensor_pairs:
        evaluate_model(encoder, decoder, input_tensor, target_tensor, max_length)

In [192]:
training_sentences = zip(input_train, target_train)
validation_sentences = zip(input_val, target_val)

training_tensors = [pair2tensor(sent_pair) for sent_pair in training_sentences]
validation_tensors = [pair2tensor(sent_pair) for sent_pair in validation_sentences]

My training setup could definitely be improved, but I can't spend to mucg time per coding session writing boilerplate. 
I'll improve it iteratively over coming projects!

Let's launch training!

In [223]:
train_iteratively(encoder, decoder, training_tensors, encoder_optimizer, decoder_optimizer, criterion, 11, validation_tensors)

Epoch 1 / 2


100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [21:10<00:00,  6.29it/s]


Loss: 3.8954
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han har en gång . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  att jag är inte att jag tycker . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan jag kan inte

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den är väldigt . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom har inte . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var min penna . <EOS>

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är glad . <EOS>

Input:  i thanked tom for his help . <EOS>
Target:  jag ta

100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:26<00:00,  5.24it/s]


Loss: 3.0453
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han vill ha en gång . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  vi ses mig . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan ta . <EOS>

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom har inte ha inte . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem det ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var min jacka min jacka .

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är väldigt väldigt väldigt väldigt väldigt

Input:  i thanked tom for his help . <EOS>
T

In [224]:
train_iteratively(encoder, decoder, training_tensors, encoder_optimizer, decoder_optimizer, criterion, 11, validation_tensors, epochs=5)

Epoch 1 / 5


100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:06<00:00,  5.31it/s]


Loss: 2.6477
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han köpte du vill ha en penna .

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  vi ses på honom . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan vara . <EOS>

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den här . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom har inte en penna . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem vill det ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var inte att jag var min

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är det är det är det

Input:  i thanked tom for his help . <EOS>
Tar

100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:07<00:00,  5.31it/s]


Loss: 2.3354
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han brukar . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  vi ses mig . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan studera . <EOS>

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den står på . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom kan inte haft en sköldpadda . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem ska jag gav det ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var min lägenhet var min lägenhet

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är det här . <EOS>

Input:  i thanked tom for his hel

100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:07<00:00,  5.31it/s]


Loss: 2.0872
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han brukar vilja ett öga . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  vi ses att vi ses på honom . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan ta med .

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den här boken . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom har inte haft en penna . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem ska jag gav den ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var min lägenhet var min familj

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är din dotter . <EOS>

Input:  i th

100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:08<00:00,  5.30it/s]


Loss: 1.8795
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han stängde på det . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  vi ses på honom . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan olycka . <EOS>

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den här boken . <EOS>

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom har inte en tonårig aldrig äpple .

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem bör den ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var inte mina mina mina mina

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är snäll och gör det här

Input:  i thanked tom for h

100%|██████████████████████████████████████████████████████████████████████████████| 8000/8000 [25:01<00:00,  5.33it/s]


Loss: 1.7069
Input:  he behaved like a child . <EOS>
Target:  han betedde sig som ett barn . <EOS>
Result:  han brukar . <EOS>

Input:  `` trust me , '' he said . <EOS>
Target:  ” lita på mig ” , sade han . <EOS>
Result:  hurdan . <EOS>

Input:  i can smell smoke . <EOS>
Target:  det luktar rök . <EOS>
Result:  jag kan greta . <EOS>

Input:  the noise will wake the baby up . <EOS>
Target:  ljudet kommer att väcka bebisen . <EOS>
Result:  den kommer att behöva lite imorgon .

Input:  tom ca n't sing a high a . <EOS>
Target:  tom kan inte sjunga höga a . <EOS>
Result:  tom en tonårig . <EOS>

Input:  who gave it to me ? <EOS>
Target:  vem gav den till mig ? <EOS>
Result:  vem jag gav det ? <EOS>

Input:  it was actually my fault . <EOS>
Target:  det var faktiskt mitt fel . <EOS>
Result:  det var inte mina mina mina mina

Input:  tom ran over someone 's dog . <EOS>
Target:  tom körde över någons hund . <EOS>
Result:  tom är ditt namn . <EOS>

Input:  i thanked tom for his help . <EOS>
Tar

In [228]:
torch.save(encoder.state_dict(), 'pytorch_models/encoder.pt')
torch.save(decoder.state_dict(), 'pytorch_models/decoder.pt')

After almost 3 hours of training the model does not really perform that well, but more training would probably help. It doesn't feel that good to always stop training at a stage where the model does not perform well, but at this stage of exploration I don't want to run models for many hours before getting any results.

The PyTorch tutorial limited their translation task to sentences with similar structure. This makes the task less interesting, but perhaps such a task can be solved in a more reasonable time frame?

Let's try the same approach!