<a href="https://colab.research.google.com/github/olivermueller/aml4ta-2021/blob/main/Session_08/08_sequence2sequence_FULL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IF USING GOOGLE COLABORATORY -> RUN FIRST!!!
# OTHERWISE -> IGNORE ;-)

from google.colab import drive

drive.mount('/content/gdrive')

!pip install pymysql

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
# <font color="#003660">Lesson 8: Creating a Neural Translator with Sequence-to-Sequence Models</font>

<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... understand the inner workings of sequence-to-sequence architectures;<br>
        ... implement your own sequence-to-sequence models.<br>
    </font>
</div>
</center>
</p>

<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/tip.png"></center>

<p><center><font color="red"><strong><i>This entire tutorial is based on the implementations by <a href="https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html">Robertson (n.d.)</a> and <a href="https://github.com/bentrevett/pytorch-seq2seq/blob/master/2%20-%20Learning%20Phrase%20Representations%20using%20RNN%20Encoder-Decoder%20for%20Statistical%20Machine%20Translation.ipynb">Trevett (n.d.)</a> of the architecture proposed by Cho et al. (2014).</i></strong></font></center></p>

# 1. What is a Sequence-to-Sequence Model?


## 1.1 General Idea

<table class="image">
<center>
<caption align="bottom">(Lane et al., 2019,  p.318)</caption>
<tr><td><img width=540 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_6/images/seq2seq.png'></td></tr>
</center>
</table>

## 1.2 Applications
<center><img width=100 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/question.png"></center>

<p><center><b>What are possible applications of sequence-to-sequence models?<br>What are they good for?</b></center></p>

# 2. Dataset

<p>For this tutorial, we will use, akin to <a href="https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html">Robertson (n.d.)</a>, a dataset provided by <a href="https://tatoeba.org/eng/">Tatoeba.</a></p>

In [None]:
%matplotlib inline

################
# Load dataset #
################

# Import
import re
import getpass
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

# Get credentials
user = input("Username: ")
host = input("Host: ")
db = input("Database: ")
passwd = getpass.getpass("Password: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd, host, db))

# Define SQL query
sql_query = "SELECT english, german FROM TatoebaEnglishGerman"

# Query dataset (pandas)
data = pd.read_sql(sql=sql_query, con=engine)

# Sample
data.head()

In [None]:
###################
# Explore dataset #
###################

data.info()

<p>For the sake of this tutorial, the documents were already preprocessed and have a maximum length of 10 tokens. However, some more preprocessing is required before we can get started!</p>

<table class="image">
<center>
<caption align="bottom">(Lane et al., 2019,  p.319)</caption>
<tr><td><img width=768 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_6/images/preprocessing.png'></td></tr>
</center>
</table>

In [None]:
#################
# Preprocessing #
#################

# Import
import re

def preprocessing(doc):
    
    # Tokenize
    doc = re.findall(r'\w+', doc)
    
    # Add <sos> token (start of sentence)
    doc.insert(0, '<sos>')
    
    # Add <eos> token (end of sentence)
    doc.insert(len(doc), '<eos>')
    
    return doc

data.english = data.english.apply(preprocessing)
data.german = data.german.apply(preprocessing)

data.head()

In [None]:
#################
# Split dataset #
# 80% / 5% / 5% #
#################

# Import
from sklearn.model_selection import train_test_split

# Train / Test split
train, test = train_test_split(data,
                               test_size=0.1,
                               random_state=42)
test, val = train_test_split(test,
                             test_size=0.5,
                             random_state=42)

print(train.shape)
print(val.shape)
print(test.shape)

In [None]:
######################
# Build vocabularies #
######################

def vocabulary_generator(corpus):
    
    vocab = {"<pad>": 0, "<unk>": 1, "<sos>": 2, "<eos>": 3}
    
    for doc in corpus:
        for token in doc:
            if token not in vocab:
                vocab[token] = len(vocab)+1
                
    return vocab
    
# Source vocabulary – i.e., English
src_vocabulary = vocabulary_generator(train.english)

# Target vocabulary – i.e., German
trg_vocabulary = vocabulary_generator(train.german)

print(len(src_vocabulary))
print(len(trg_vocabulary))

In [None]:
#######################
# Vectorize documents #
#######################

# Import
import torch

MAX_LENGTH = 12

def document_vectorizer(corpus, vocab):
    
    output_corpus = list()
    
    for doc in corpus:
    
        output_tensor = torch.zeros(MAX_LENGTH, dtype=torch.int64)

        for index, token in enumerate(doc):
            
            if token in vocab:
                output_tensor[index] = vocab[token]
            else:
                output_tensor[index] = vocab['<unk>']
            
        output_corpus.append(output_tensor)
    
    return torch.stack(output_corpus)

src_train = document_vectorizer(train.english.to_list(), src_vocabulary)
trg_train = document_vectorizer(train.german.to_list(), trg_vocabulary)

src_val = document_vectorizer(val.english.to_list(), src_vocabulary)
trg_val = document_vectorizer(val.german.to_list(), trg_vocabulary)

src_test = document_vectorizer(test.english.to_list(), src_vocabulary)
trg_test = document_vectorizer(test.german.to_list(), trg_vocabulary)

print(f'>>> Train:\n• src -> {src_train.shape}\n• trg -> {trg_train.shape}')
print(f'\n>>> Val.:\n• src -> {src_val.shape}\n• trg -> {trg_val.shape}')
print(f'\n>>> Test:\n• src -> {src_test.shape}\n• trg -> {trg_test.shape}')

In [None]:
########################
# Generate DataLoaders #
########################

BATCH_SIZE = 256

def dataloader_generator(src, trg, shuffle):

    dataset = torch.utils.data.TensorDataset(src, trg)
    return torch.utils.data.DataLoader(dataset=dataset,
                                       num_workers=8,
                                       shuffle=shuffle,
                                       batch_size=BATCH_SIZE)

train_dataloader = dataloader_generator(src_train, trg_train, True)
val_dataloader = dataloader_generator(src_val, trg_val, False)
test_dataloader = dataloader_generator(src_test, trg_test, False)

# 3. Implementation

In [None]:
##########
# Import #
##########

import random
import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F 

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## 3.1 Encoder

<table class="image">
<center>
<caption align="bottom">(Trevett, n.d.)</caption>
<tr><td><img width=512 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_6/images/trevett_encoder.png'></td></tr>
</center>
</table>

In [None]:
###########
# Encoder #
###########

class Encoder(nn.Module):
    
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0., bidirectional=True):
        
        super(Encoder, self).__init__()
        
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim

        # Define embedding layer (vocab_size, emb_dim)
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        
        # Define recurrent layer (emb_dim, hidden_dim, *)
        # Example -> Bidirectional GRU
        self.rnn = nn.GRU(
            emb_dim, 
            hidden_dim,
            num_layers=num_layers,
            dropout=dropout,
            bidirectional=bidirectional
        )
        
    def forward(self, input):
                
        # Embedding layer
        # Input -> (sequence_length, batch_size)
        embedding = self.embedding(input)
                        
        # Recurrent layer
        # Input -> (sequence_length, batch_size, embedding_dim)
        _, hidden_state = self.rnn(embedding)
        
        # Return hidden state
        # Output -> (n_layers * directions, batch_size, hidden_dim)
        return hidden_state

In [None]:
#################
# Debug encoder #
#################

DEBUG_BATCH_SIZE = 5

debug_encoder = Encoder(len(src_vocabulary)+1, 200, 256, 1, 0., False)
hidden_state = debug_encoder.forward(src_train[:DEBUG_BATCH_SIZE].permute(1,0))
hidden_state.shape
# Debugging
# (if required!)

## 3.2 Decoder

<table class="image">
<center>
<caption align="bottom">(Trevett, n.d.)</caption>
<tr><td><img width=256 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_6/images/trevett_decoder.png'></td></tr>
</center>
</table>

In [None]:
###########
# Decoder #
###########

class Decoder(nn.Module):
    
    def __init__(self, vocab_size, emb_dim, hidden_dim, num_layers=1, dropout=0., bidirectional=True):
        
        super(Decoder, self).__init__()
        
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        
        # Define embedding layer (vocab_size, emb_dim)
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        
        # Define recurrent layer (emb_dim, hidden_dim, *)
        # Example -> Bidirectional GRU
        self.rnn = nn.GRU(
            emb_dim,
            hidden_dim,
            num_layers=num_layers,
            dropout=dropout,
            bidirectional=bidirectional
        )
        
        # Define fully connected layer (hidden_dim * directions, vocab_size)
        self.out = nn.Linear(hidden_dim * (2 if bidirectional else 1), vocab_size)
                            
    def forward(self, input, hidden_state):
        
        # Add dimension at position 0
        # Input -> (batch_size)
        input = input.unsqueeze(0)
        
        # Embedding layer
        # Input -> (1, batch_size)
        embedded = self.embedding(input)
                
        # Recurrent layer
        # Input -> (sequence_length, batch_size, embedding_dim)
        output, hidden_state = self.rnn(embedded, hidden_state)
        
        # FCL
        # Input -> (sequence_length, batch_size, hidden_dim * directions)
        predictions = self.out(output).squeeze(0)
        
        # Return prediction + hidden_state
        # Prediction -> (batch_size, vocab_size)
        return predictions, hidden_state

In [None]:
#################
# Debug encoder #
#################

DEBUG_BATCH_SIZE = 5

debug_encoder = Encoder(len(src_vocabulary)+1, 200, 256, 1, 0., True)
hidden_state = debug_encoder.forward(src_train[:DEBUG_BATCH_SIZE].permute(1,0))

debug_decoder = Decoder(len(trg_vocabulary)+1, 200, 256, 1, 0., True)
prediction, hidden_state = debug_decoder.forward(trg_train[:DEBUG_BATCH_SIZE].permute(1,0)[0], hidden_state)

# Debugging
# (if required!)

## 3.3 Model

<table class="image">
<center>
<caption align="bottom">(Trevett, n.d.)</caption>
<tr><td><img width=512 src='https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_6/images/trevett_seq2seq.png'></td></tr>
</center>
</table>

In [None]:
#########
# Model #
#########

class Model(nn.Module):
    
    def __init__(self, encoder, decoder, device):
        
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        
        # Initialize output vector (sequence_length, batch_size, vocab_size)
        output_vector = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.vocab_size).to(self.device)
        
        # Encoder
        # Input -> (sequence_length, batch_size)
        hidden_state = self.encoder(src)
        
        # First token (i.e., <sos>)
        # Input -> (sequence_length, batch_size)
        # Output -> (batch_size)
        decoder_input = trg[0]
                
        for index in range(1, trg.shape[0]):
            
            # Decoder
            # Input -> (batch_size)
            # Hidden state -> tensor (e.g., RNN / GRU) or tuple (e.g., LSTM)
            predictions, hidden_state = self.decoder(decoder_input, hidden_state)

            # Append 
            output_vector[index] = predictions
                        
            # Teacher forcing
            teacher_forcing = True if random.random() < teacher_forcing_ratio else False
            decoder_input = trg[index] if teacher_forcing else torch.argmax(predictions, dim=1)
            
        # Return output_vector
        # Output -> (sequence_length, batch_size, trg_vocab_size)
        return output_vector

In [None]:
#################
# Debug model #
#################

DEBUG_BATCH_SIZE = 2

debug_encoder = Encoder(len(src_vocabulary), 200, 256, 1, 0., True)
debug_decoder = Decoder(len(trg_vocabulary), 200, 256, 1, 0., True)

debug_model = Model(debug_encoder, debug_decoder, 'cpu')

output_vector = debug_model.forward(src_train[:DEBUG_BATCH_SIZE].permute(1,0),
                                    trg_train[:DEBUG_BATCH_SIZE].permute(1,0), 
                                    teacher_forcing_ratio=0.5)

# Debugging
# (if required!)

## 3.4 Training

In [None]:
####################
# Initialize model #
####################

# Hyperparameters
EMB_DIM = 300
HIDDEN_DIM = 512
NUM_LAYERS = 1
DROPOUT = 0.
BIDIRECTIONAL = True

# Encoder
encoder = Encoder(len(src_vocabulary)+1, EMB_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT, BIDIRECTIONAL)

# Decoder
decoder = Decoder(len(trg_vocabulary)+1, EMB_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT, BIDIRECTIONAL)

# Model
model = Model(encoder, decoder, device)
model.to(device)

# Optimizer
optimizer = optim.Adam(model.parameters(),lr=1e-3, weight_decay=1e-5)

# Loss function (a.k.a. criterion)
criterion = nn.CrossEntropyLoss(ignore_index=0)

print(model)

In [None]:
#######################
# Evaluation function #
#######################

def evaluate(model, dataloader, criterion):
    
    model.eval()
    
    total_loss = 0
    
    for batch_id, batch in enumerate(dataloader):

        # Get validation/test data
        src, trg = batch

        # Permute dimensions
        # Input -> (batch_size, sequence_length)
        # Output -> (sequence_length, batch_size)
        src = src.permute(1,0).to(device)
        trg = trg.permute(1,0).to(device)
         
        with torch.no_grad():
            
            # Encoder
            hidden_state = model.encoder(src)

            # Initialize output vector (sequence_length, batch_size, vocab_size)
            outputs = torch.zeros(src.shape[0], src.shape[1], model.decoder.vocab_size).to(device)

            # First token (i.e., <sos>)    
            decoder_input = src[0]

            for index in range(1, src.shape[0]):
                predictions, hidden_state = model.decoder(decoder_input, hidden_state)
                outputs[index] = predictions
                decoder_input = torch.argmax(predictions, dim=1)

        # Skip first token (i.e., <sos>) + reshape output
        # Input -> (sequence_length, batch_size, trg_vocab_size)
        # Output -> (sequence_length-1 * batch_size, trg_vocab_size)
        outputs = outputs[1:].reshape(-1, outputs.shape[2])

        # Compute loss
        loss = criterion(outputs, trg[1:].reshape(-1))

        # Update loss
        total_loss += loss.item()
        
    return total_loss / len(dataloader)

In [None]:
###############
# Train model #
###############

EPOCHS = 15

train_history = list()
validation_history = list()

best_validation_loss = np.inf

# Training loop
for epoch in range(EPOCHS):
    
    model.train()
        
    total_loss = 0
        
    for batch_id, batch in enumerate(train_dataloader):
        
        # Get training data
        src, trg = batch
        
        # Permute dimensions
        # Input -> (batch_size, sequence_length)
        # Output -> (sequence_length, batch_size)
        src = src.permute(1,0).to(device)
        trg = trg.permute(1,0).to(device)
  
        # Clear gradients
        optimizer.zero_grad()
        
        # Feedforward
        # teacher_forcing_ratio >= 0.0
        outputs = model(src, trg, teacher_forcing_ratio=0.4)
               
        # Skip first token (i.e., <sos>) + reshape output
        # Input -> (sequence_length, batch_size, trg_vocab_size)
        # Output -> (sequence_length-1 * batch_size, trg_vocab_size)
        outputs = outputs[1:].reshape(-1, outputs.shape[2])
                        
        # Compute loss
        loss = criterion(outputs, trg[1:].reshape(-1))
        
        # Backpropagate errors
        loss.backward()
        
        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.)

        # Update weights
        optimizer.step()
        
        # Update loss
        total_loss =+ loss.item()
        
    # Validation
    validation_loss = evaluate(model, val_dataloader, criterion)
    
    if validation_loss < best_validation_loss:

        best_validation_loss = validation_loss
        torch.save(model, 'best_seq2seq_model.pt')
    
    train_history.append(total_loss)
    validation_history.append(validation_loss)
    
    print({ 'epoch': epoch, 'training loss': total_loss, 'validation loss': validation_loss})

print('\n>>> DONE!')
print(f'>>> BEST MODEL (VALIDATION): {best_validation_loss}')

In [None]:
################
# Plot history #
################

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

axs[0].plot(train_history)
axs[0].set_title('Training Loss')

axs[1].plot(validation_history)
axs[1].set_title('Validation Loss')

fig.show()

## 3.5 Evaluation

In [None]:
#########################
# Evaluate model (test) #
#########################

# Load model
best_model = torch.load('best_seq2seq_model.pt')

# Evaluate
print(evaluate(best_model, test_dataloader, criterion))

In [None]:
#############
# Translate #
#############

def get_translation(document, seq2seq_model):
    
    with torch.no_grad():
        
        # Add dimension
        document = document.unsqueeze(1)
        
        # Encoder
        hidden_state = seq2seq_model.encoder(document.to(device))
        
        # Initialize output document
        output_document = [trg_vocabulary['<sos>']]

        for _ in range(1, src.shape[0]):
                        
            outputs, hidden_state = seq2seq_model.decoder(torch.LongTensor([output_document[-1]]).to(device), hidden_state)            
            output_document.append(outputs.argmax().item())
            
            if output_document[-1] == trg_vocabulary['<eos>']:
                break
            
        return output_document

def tokens_lookup(tokens_idx, vocabulary):
    tokens = [list(vocabulary.keys())[list(vocabulary.values()).index(token)] for token in tokens_idx if token != vocabulary['<pad>']]
    return ' '.join(tokens)
        
# Document ID
ID = 13 # 442 # 178

# Translate
tokens_idx = get_translation(src_test[ID], best_model)
translation = tokens_lookup(tokens_idx, trg_vocabulary)
print(f'>>> Translation: {translation}')

# Source & Target
print(f'\n>>> Source: {tokens_lookup(src_test[ID].tolist(), src_vocabulary)}')
print(f'>>> Target: {tokens_lookup(trg_test[ID].tolist(), trg_vocabulary)}')

<ul style="list-style-type:round">
<i>
    <li>Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &amp; Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation.</li>
    <li>Lane, H., Howard, C., &amp; Hapke, H.M. (2019). Natural Language Processing in Action. Shelter Island, NY: Manning Publications Co.</li>
    <li>Rao, D., &amp; McMahan, B. (2019). Natural Language Processing with Pytorch. Sebastopol, CA: O'Reilly Media.</li>
    <li>Robertson, S. (n.d.). NLP from Scratch: Translation with a Sequence-to-Sequence Network and Attention. <a href="https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html">Link</a>.</li>
    <li>Trevett, B. (n.d.). pytorch-seq2seq:  Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. <a href="https://github.com/bentrevett/pytorch-seq2seq/blob/master/2%20-%20Learning%20Phrase%20Representations%20using%20RNN%20Encoder-Decoder%20for%20Statistical%20Machine%20Translation.ipynb">Link</a>.</li>
</i>
</ul>