# CS 4464/7643 Deep Learning HW 4
### Machine Translation with Seq2Seq and Transformers
In this exercise you will implement a [Sequence to Sequence(Seq2Seq)](https://arxiv.org/abs/1703.03906) and a [Transformer](https://arxiv.org/pdf/1706.03762.pdf) model and use them to perform machine translation.

**A quick note: if you receive the following TypeError "super(type, obj): obj must be an instance or subtype of type", try re-importing that part or restarting your kernel and re-running all cells.** Once you have finished making changes to the model constuctor, you can avoid this issue by commenting out all of the model instantiations after the first (e.g. lines starting with "model = TransformerTranslator(*args, **kwargs)").

#### Google Colab Setup
Edit and run the cell below to setup the environment for Google Colab (and only for Google Colab).

In [1]:
# Cell 1
from google.colab import drive
drive.mount('/content/drive')

# Change this path to the correct one for you
%cd /content/drive/MyDrive/hw4_student_version/

%pip install torchtext==0.9

Mounted at /content/drive
/content/drive/MyDrive/hw4_student_version
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.9
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 4.4 MB/s 
[?25hCollecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 14 kB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is

### Introduction

#### Multi30K: Multilingual English-German Image Descriptions

[Multi30K](https://github.com/multi30k/dataset) is a dataset for machine translation tasks. It is a multilingual corpus containing English sentences and their German translation. In total it contains 31014 sentences(29000 for training, 1014 for validation, and 1000 for testing).
As one example:

En: `Two young, White males are outside near many bushes.`

De: `Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.`

You can read more info about the dataset [here](https://arxiv.org/abs/1605.00459). The following parts of this assignment will be based on this dataset.

#### TorchText: A PyTorch Toolkit for Text Dataset and NLP Tasks
[TorchText](https://github.com/pytorch/text) is a PyTorch package that consists of data processing utilities and popular datasets for natural language. The key idea of TorchText is that datasets can be organized in *Field*, *TralsationDataset*, and *BucketIterator* classes. They serve to help with data splitting and loading, token encoding, sequence padding, etc. You don't need to know about how TorchText works in detail, but you might want to know about why those classes are needed and what operations are necessary for machine translation. This knowledge can be migrated to all sequential data modeling. In the following parts, we will provide you with some code to help you understand.

 You can refer to torchtext's documentation(v0.6.0) [here](https://pytorch.org/text/).

#### Spacy
Spacy is package designed for tokenization in many languages. Tokenization is a process of splitting raw text data into lists of tokens that can be further processed. Since TorchText only provides tokenizer for English, we will be using Spacy for our assignment. 

### Prerequisites
Before you start this assignment, please make sure you have the following package installed:

`PyTorch, TorchText, Spacy, Tqdm, Numpy`

You can first check using either `pip freeze` in terminal or `conda list` in conda environment. Then run the following code blocks to make sure they can be imported.

In [2]:
# Cell 4
import numpy as np
import csv
import torch

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [3]:
%cd /content/drive/MyDrive/hw4_student_version/

# Cell 5
# Just run this block. Please do not modify the following code.
import math
import time

# Pytorch package
import torch
import torch.nn as nn
import torch.optim as optim

# Torchtext package
from torchtext.legacy.datasets	import Multi30k
from torchtext.legacy.data import Field, BucketIterator

# Tqdm progress bar
from tqdm import tqdm_notebook

# Code provided to you for training and evaluation
from utils import train, evaluate, set_seed_nb, unit_test_values

/content/drive/MyDrive/hw4_student_version


Once you properly import the above packages, you can proceed to download Spacy English and German tokenizers by running the following commands in your **terminal** (if working locally). Otherwise, run the following cell on Google Colab. They will take some time.

In [4]:
# Cell 6
!python -m spacy download en
!python -m spacy download de

2022-11-10 22:04:42.223335: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
2022-11-10 22:04:54.493460: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'de' are deprecated. Please use the
full pipe

Check your GPU availability and load some sanity checkers

In [5]:
# Cell 7
# Check device availability

print(torch.cuda.is_available())
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("You are using device: %s" % device)

False
You are using device: cpu


#### Preprocess data

With TorchText and Spacy tokenizers ready, you can now prepare the data using TorchText objects. Just run the following code blocks. Read the comment and try to understand what they are for.

In [6]:
# Cell 8
# load checkers
d1 = torch.load('./data/d1.pt') 
d2 = torch.load('./data/d2.pt')
d3 = torch.load('./data/d3.pt')
d4 = torch.load('./data/d4.pt')

In [7]:
# Cell 9
# You don't need to modify any code in this block

# Define the maximum length of the sentence. Shorter sentences will be padded to that length and longer sentences will be croped. Given that the average length of the sentence in the corpus is around 13, we can set it to 20
MAX_LEN = 20

# Define the source and target language
SRC = Field(tokenize = "spacy",
            tokenizer_language="de_core_news_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            fix_length = MAX_LEN,
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en_core_web_sm",
            init_token = '<sos>',
            eos_token = '<eos>',
            fix_length = MAX_LEN,
            lower = True)

# Download and split the data. It should take some time
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))

In [8]:
# Cell 10
# Define Batchsize
BATCH_SIZE = 128

# Build the vocabulary associated with each language
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

# Get the padding index to be ignored later in loss calculation
PAD_IDX = TRG.vocab.stoi['<pad>']

# Get data-loaders using BucketIterator
train_loader, valid_loader, test_loader = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE, device = device)

# Get the input and the output sizes for model
input_size = len(SRC.vocab)
output_size = len(TRG.vocab)

### Part 1: RNNs and LSTMs

In this section, you will need to implement a Vanilla RNN and an LSTM unit using PyTorch Linear layers and nn.Parameter. This is designed to help you to understand how they work behind the scene. The code you will be working with is in *LSTM.py* and *RNN.py* under *naive* folder. Please refer to instructions among this notebook and those files. 

#### 1.1 Implement an RNN Unit

In this section you will be using PyTorch Linear layers and activations to implement a vanilla RNN unit. Run the following block to check your implementation.

In [9]:
# Cell 11
from models.naive.RNN import VanillaRNN

set_seed_nb()
x1,x2 = (1,4), (-1,2)
h1,h2 = (-1,2,0,4), (0,1,3,-1)
batch = 5
x = torch.FloatTensor(np.linspace(x1,x2,batch))
h = torch.FloatTensor(np.linspace(h1,h2,batch))

rnn = VanillaRNN(x.shape[-1], h.shape[-1], 3)
out, hidden = rnn.forward(x,h)
print(out.size())
print(hidden.size())
print(hidden)
expected_out, expected_hidden = unit_test_values('rnn')

if out is not None:
    print('Close to out: ', expected_out.allclose(out, atol=1e-4))
    print('Close to hidden: ', expected_hidden.allclose(hidden, atol=1e-4))
else:
    print("NOT IMPLEMENTED")

torch.Size([5, 3])
torch.Size([5, 4])
tensor([[-0.9717,  0.9257, -0.9781,  0.9998],
        [-0.8943,  0.9526, -0.9394,  0.9994],
        [-0.6433,  0.9699, -0.8381,  0.9977],
        [-0.0845,  0.9810, -0.6024,  0.9915],
        [ 0.5330,  0.9880, -0.1771,  0.9690]], grad_fn=<SliceBackward>)
Close to out:  True
Close to hidden:  True


#### 1.2 Implement an LSTM Unit

In this section you will be using PyTorch nn.Parameter and activations to implement an LSTM unit. You can simply translate the following equations using nn.Parameter and PyTorch activation functions to build an LSTM from scratch: 
\begin{array}{ll} \\
    i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\
    f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\
    g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\
    o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\
    c_t = f_t \odot c_{t-1} + i_t \odot g_t \\
    h_t = o_t \odot \tanh(c_t) \\
\end{array}

Here's a great visualization of the above equation from [Colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) to help you understand LSTM unit. You can also read more about it from that blog.

If you want to see nn.Parameter in example, check out this [tutorial](https://pytorch.org/tutorials/beginner/nn_tutorial.html) from PyTorch. Run the following block to check your implementation

In [10]:
# Cell 12
from models.naive.LSTM import LSTM

set_seed_nb()
x1,x2 = np.mgrid[-1:3:3j, -1:4:2j]
h1,h2 = np.mgrid[-2:2:3j, 1:3:4j]
batch = 4
x = torch.FloatTensor(np.linspace(x1,x2,batch))
h = torch.FloatTensor(np.linspace(h1,h2,batch))

lstm = LSTM(x.shape[-1], h.shape[-1])
h_t, c_t = lstm.forward(x)

expected_ht, expected_ct = unit_test_values('lstm')

print('Close to h_t: ', expected_ht.allclose(h_t, atol=1e-3))
print('Close to c_t: ', expected_ct.allclose(c_t, atol=1e-3))

Close to h_t:  True
Close to c_t:  True


### Part 2: Train a Seq2Seq Model
In this section, you will be working on implementing a simple Seq2Seq model. You will first implement an Encoder and a Decoder, and then join them together with a Seq2Seq architecture. You will need to complete the code in *Decoder.py*, *Encoder.py*, and *Seq2Seq.py* under *seq2seq* folder. Please refer to the instructions in those files.

#### Implement the Encoder

In this section you will be implementing an RNN/LSTM based encoder to model English texts. Please refer to the instructions in *seq2seq/Encoder.py*. Run the following block to check your implementation. 

In [None]:
# Cell 13
from models.seq2seq.Encoder import Encoder as Encoder
import pdb

set_seed_nb()
i, n, h = 10, 4, 2

encoder = Encoder(i, n, h, h)
x_array = np.random.rand(5,1) * 10
x = torch.LongTensor(x_array)
out, hidden = encoder.forward(x)
print(hidden.size())
print(out.size())

expected_out, expected_hidden = unit_test_values('encoder')
print('Close to out: ', expected_out.allclose(out, atol=1e-4))
print('Close to hidden: ', expected_hidden.allclose(hidden, atol=1e-4))

torch.Size([1, 5, 2])
torch.Size([5, 1, 2])
Close to out:  True
Close to hidden:  True


#### Implement the Decoder

In this section you will be implementing an RNN/LSTM based decoder to model German texts. Please refer to the instructions in *seq2seq/Decoder.py*. Run the following block to check your implementation. 

In [None]:
# Cell 14
from models.seq2seq.Decoder import Decoder as Decoder
import pdb

set_seed_nb()
i, n, h =  10, 2, 2

decoder = Decoder(h, n, n, i)
x_array = np.random.rand(5, 1) * 10
x = torch.LongTensor(x_array)
_, enc_hidden = unit_test_values('encoder')
out, hidden = decoder.forward(x,enc_hidden)

expected_out, expected_hidden = unit_test_values('decoder')
print(hidden.size())
print(expected_hidden.size())
print(out.size())
print(expected_out.size())
print('Close to out: ', expected_out.allclose(out, atol=1e-4))
print('Close to hidden: ', expected_hidden.allclose(hidden, atol=1e-4))


torch.Size([1, 5, 2])
torch.Size([1, 5, 2])
torch.Size([5, 10])
torch.Size([5, 10])
Close to out:  True
Close to hidden:  True


#### Implement the Seq2Seq

In this section you will be implementing the Seq2Seq model that utilizes the Encoder and Decoder you implemented. Please refer to the instructions in *seq2seq/Seq2Seq.py*. Run the following block to check your implementation.

In [None]:

# Cell 15
import pdb

from models.seq2seq.Seq2Seq import Seq2Seq
from models.seq2seq.Decoder import Decoder
from models.seq2seq.Encoder import Encoder
set_seed_nb()
embedding_size = 32
hidden_size = 32
input_size = 8
output_size = 8
batch, seq = 1, 2

encoder = Encoder(input_size, embedding_size, hidden_size, hidden_size)
decoder = Decoder(embedding_size, hidden_size, hidden_size, output_size)

seq2seq = Seq2Seq(encoder, decoder, 'cpu')
x_array = np.random.rand(batch, seq) * 10
x = torch.LongTensor(x_array)

out = seq2seq.forward(x)
expected_out = unit_test_values('seq2seq')
print(out.size())
print(expected_out)
print('Close to out: ', expected_out.allclose(out, atol=1e-4))

torch.Size([1, 2, 8])
tensor([[[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
           0.0000],
         [-2.4136, -2.2861, -1.7145, -2.5612, -1.9864, -2.0557, -1.7461,
          -2.1898]]])
Close to out:  True


#### Train your Seq2Seq model

Now its time to combine what we have and train a Seq2Seq translator. We provided you with some training code and you can simply run them to see how your translator works. If you implemented everything correctly, you should see some meaningful translation in the output. You can modify the hyperparameters to improve the results. You can also tune the BATCH_SIZE in section Preprocess data.

In [None]:
# Cell 16
# Hyperparameters. You are welcome to modify these
encoder_emb_size = 64
encoder_hidden_size = 64
encoder_dropout = 0.1

decoder_emb_size = 64
decoder_hidden_size = 64
decoder_dropout = 0.1

learning_rate = 5e-3
model_type = "RNN"

EPOCHS = 5

#input size and output size
input_size = len(SRC.vocab)
output_size = len(TRG.vocab)

In [None]:
# Cell 17
# Declare models, optimizer, and loss function
encoder = Encoder(input_size, encoder_emb_size, encoder_hidden_size, decoder_hidden_size, dropout = encoder_dropout, model_type = model_type)
decoder = Decoder(decoder_emb_size, encoder_hidden_size, encoder_hidden_size, output_size, dropout = decoder_dropout, model_type = model_type)
seq2seq_model = Seq2Seq(encoder, decoder, device)

optimizer = optim.Adam(seq2seq_model.parameters(), lr = learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

In [None]:
# Cell 18
# If training freezes it is due to memory error
for epoch_idx in range(EPOCHS):
    print("-----------------------------------")
    print("Epoch %d" % (epoch_idx+1))
    print("-----------------------------------")
    
    train_loss, avg_train_loss = train(seq2seq_model, train_loader, optimizer, criterion)
    scheduler.step(train_loss)

    val_loss, avg_val_loss = evaluate(seq2seq_model, valid_loader, criterion)

    avg_train_loss = avg_train_loss.item()
    avg_val_loss = avg_val_loss.item()
    print("Training Loss: %.4f. Validation Loss: %.4f. " % (avg_train_loss, avg_val_loss))
    print("Training Perplexity: %.4f. Validation Perplexity: %.4f. " % (np.exp(avg_train_loss), np.exp(avg_val_loss)))

-----------------------------------
Epoch 1
-----------------------------------


  0%|          | 0/227 [00:00<?, ?it/s]



  0%|          | 0/8 [00:00<?, ?it/s]

Training Loss: 5.5072. Validation Loss: 5.1946. 
Training Perplexity: 246.4544. Validation Perplexity: 180.2892. 
-----------------------------------
Epoch 2
-----------------------------------


  0%|          | 0/227 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

Training Loss: 5.2549. Validation Loss: 5.1946. 
Training Perplexity: 191.5023. Validation Perplexity: 180.2892. 
-----------------------------------
Epoch 3
-----------------------------------


  0%|          | 0/227 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

Training Loss: 5.2549. Validation Loss: 5.1946. 
Training Perplexity: 191.5050. Validation Perplexity: 180.2892. 
-----------------------------------
Epoch 4
-----------------------------------


  0%|          | 0/227 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

Training Loss: 5.2549. Validation Loss: 5.1946. 
Training Perplexity: 191.5006. Validation Perplexity: 180.2892. 
-----------------------------------
Epoch 5
-----------------------------------


  0%|          | 0/227 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

Training Loss: 5.2550. Validation Loss: 5.1946. 
Training Perplexity: 191.5145. Validation Perplexity: 180.2892. 


### **2.1: Report Section: Seq2Seq Results [4 pts]**
Please edit this section to answer the following questions:

1) Put your loss & perplexities from training here, both before and after hyperparameter tuning.

2) Explain what you did here as well.


### Part 3: Train a Transformer

We will be implementing a one-layer Transformer **encoder** which, similar to an RNN, can encode a sequence of inputs and produce a final output of possibility of tokens in target language. 

You can refer to the [original paper](https://arxiv.org/pdf/1706.03762.pdf) for more details.

#### The Corpus of Linguistic Acceptability (CoLA)

The Corpus of Linguistic Acceptability ([CoLA](https://nyu-mll.github.io/CoLA/)) in its full form consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by their original authors. Native English speakers consistently report a sharp contrast in acceptability between pairs of sentences. 
Some examples include:

`What did Betsy paint a picture of?` (Correct)

`What was a picture of painted by Betsy?` (Incorrect)

You can read more info about the dataset [here](https://arxiv.org/pdf/1805.12471.pdf). This is a binary classification task (predict 1 for correct grammar and 0 otherwise).

We will be using this dataset as a sanity checker for the forward pass of the Transformer architecture discussed in class. The general intuitive notion is that we will _encode_ the sequence of tokens in the sentence, and then predict a binary output based on the final state that is the output of the model.

#### Load the preprocessed data

We've appended a "CLS" token to the beginning of each sequence, which can be used to make predictions. The benefit of appending this token to the beginning of the sequence (rather than the end) is that we can extract it quite easily (we don't need to remove paddings and figure out the length of each individual sequence in the batch). We'll come back to this.

We've additionally already constructed a vocabulary and converted all of the strings of tokens into integers which can be used for vocabulary lookup for you. Feel free to explore the data here.

In [11]:
# Cell 19
train_inxs = np.load('./data/train_inxs.npy')
val_inxs = np.load('./data/val_inxs.npy')
train_labels = np.load('./data/train_labels.npy')
val_labels = np.load('./data/val_labels.npy')

# load dictionary
word_to_ix = {}
with open("./data/word_to_ix.csv", "r") as f:
    reader = csv.reader(f)
    for line in reader:
        word_to_ix[line[0]] = line[1]
print("Vocabulary Size:", len(word_to_ix))
        
print(train_inxs.shape) # 7000 training instances, of (maximum/padded) length 43 words.
print(val_inxs.shape) # 1551 validation instances, of (maximum/padded) length 43 words.
print(train_labels.shape)
print(val_labels.shape)

d1 = torch.load('./data/d1.pt') 
d2 = torch.load('./data/d2.pt')
d3 = torch.load('./data/d3.pt')
d4 = torch.load('./data/d4.pt')

Vocabulary Size: 1542
(7000, 43)
(1551, 43)
(7000,)
(1551,)


Instead of using numpy for this model, we will be using Pytorch to implement the forward pass. You will not need to implement the backward pass for the various layers in this assigment.

The file `models/Transformer.py` contains the model class and methods for each layer. This is where you will write your implementations.

#### 3.1 Embeddings

We will format our input embeddings similarly to how they are constructed in [BERT (source of figure)](https://arxiv.org/pdf/1810.04805.pdf). Recall from lecture that unlike a RNN, a Transformer does not include any positional information about the order in which the words in the sentence occur. Because of this, we need to append a positional encoding token at each position. (We will ignore the segment embeddings and [SEP] token here, since we are only encoding one sentence at a time). We have already appended the [CLS] token for you in the previous step.

Your first task is to implement the embedding lookup, including the addition of positional encodings. Open the file `transformer.py` and complete all code parts.

In [70]:
# Cell 20
from models.Transformer import TransformerTranslator

inputs = train_inxs[0:2]
inputs = torch.LongTensor(inputs)

model = TransformerTranslator(input_size=len(word_to_ix), output_size=2, device='cpu', hidden_dim=128, num_heads=2, dim_feedforward=2048, dim_k=96, dim_v=96, dim_q=96, max_length=train_inxs.shape[1])

embeds = model.embed(inputs)
print(inputs.size())
print(d1.size())
print(embeds.size())
try:
    print("Difference:", torch.sum(torch.pairwise_distance(embeds, d1)).item()) # should be very small (<0.01)
except:
    print("NOT IMPLEMENTED")

torch.Size([2, 43])
torch.Size([2, 43, 128])
torch.Size([2, 43, 128])
Difference: 0.0017998493276536465


#### 3.2 Multi-head Self-Attention

We want to have multiple self-attention operations, computed in parallel. Each of these is called a *head*. We concatenate the heads and multiply them with the matrix `attention_head_projection` to produce the output of this layer.

After every multi-head self-attention and feedforward layer, there is a residual connection + layer normalization. 

Open the file `models/transformer.py` and implement the `multihead_attention` function. 
We have already initialized all of the layers you will need in the constructor.

In [71]:
# Cell 21
hidden_states = model.multi_head_attention(embeds)

try:
    print("Difference:", torch.sum(torch.pairwise_distance(hidden_states, d2)).item()) # should be very small (<0.01)
except:
    print("NOT IMPLEMENTED")

Difference: 0.0017089800676330924


#### 3.3 Element-Wise Feed-forward Layer

Open the file `models/transformer.py` and complete codes: Include layer norm and addition as per transformer diagram.

In [72]:
# Cell 22
outputs = model.feedforward_layer(hidden_states)

try:
    print("Difference:", torch.sum(torch.pairwise_distance(outputs, d3)).item()) # should be very small (<0.01)
except:
    print("NOT IMPLEMENTED")

Difference: 0.001713332487270236


#### 3.4 Final Layer

Open the file `models/transformer.py` and complete codes, to produce logits for all tokens in target language.

NOTE: Since the transformer is for translation and not classification, the size of the `scores` tensor will be \[2, 43, 2\]. This is okay for our purposes.

In [73]:
# Cell 23
scores = model.final_layer(outputs)

try:
    print("Difference:", torch.sum(torch.pairwise_distance(scores, d4)).item()) # should be very small (<1e-5)
except:
    print("NOT IMPLEMENTED")

Difference: 2.6047251594718546e-05


#### 3.5 Forward Pass

Open the file `models/Transformer.py` and complete the method `forward`, by putting together all of the methods you have developed in the right order to perform a full forward pass.

In [74]:
# Cell 24
inputs = train_inxs[0:2]
inputs = torch.LongTensor(inputs)

outputs = model.forward(inputs)

try:
    print("Difference:", torch.sum(torch.pairwise_distance(outputs, scores)).item()) # should be very small (<1e-5)
except:
    print("NOT IMPLEMENTED")

Difference: 2.6229748982586898e-05


Great! We've just implemented a Transformer forward pass for translation. One of the big perks of using PyTorch is that with a simple training loop, we can rely on automatic differentation ([autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)) to do the work of the backward pass for us. This is not required for this assignment, but you can explore this on your own.

#### 3.6 Training and Hyperparameter Tuning

Now you can start training the Transformer translator on the original sequence to sequence task. We provided you with some training code and you can simply run them to see how your translator works. If you implemented everything correctly, you should see some meaningful translation in the output. Compare the results from the Seq2Seq model, which one is better? You can modify the hyperparameters to improve the results. You can also tune the BATCH_SIZE in section Preprocess Data.

In [None]:
# Cell 25
from models.Transformer import TransformerTranslator

# Hyperparameters
learning_rate = 1e-1
EPOCHS = 5

# Model
model = TransformerTranslator(input_size, output_size, device, max_length = MAX_LEN).to(device)

# optimizer = optim.Adam(model.parameters(), lr = learning_rate)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

In [None]:
for epoch_idx in range(EPOCHS):
    print("-----------------------------------")
    print("Epoch %d" % (epoch_idx+1))
    print("-----------------------------------")
    
    train_loss, avg_train_loss = train(model, train_loader, optimizer, criterion)
    scheduler.step(train_loss)

    val_loss, avg_val_loss = evaluate(model, valid_loader, criterion)

    avg_train_loss = avg_train_loss.item()
    avg_val_loss = avg_val_loss.item()
    print("Training Loss: %.4f. Validation Loss: %.4f. " % (avg_train_loss, avg_val_loss))
    print("Training Perplexity: %.4f. Validation Perplexity: %.4f. " % (np.exp(avg_train_loss), np.exp(avg_val_loss)))

### **Report Section: Transformer Results [5 pts]**
Please edit this section to answer the following questions:

1) Put your loss & perplexities from training here, both before and after hyperparameter tuning.

2) Explain what you did here as well.

**Translations**

Run the code below to see some of your translations. Modify to your liking.

In [None]:
# Cell 26
def translate(model, dataloader):
    model.eval()
    with torch.no_grad():
        # Get the progress bar 
        #progress_bar = tqdm(dataloader, asci = True)
        for batch_idx, data in enumerate(dataloader):
            source = data.src.transpose(1,0)
            target = data.trg.transpose(1,0)

            translation = model(source)
            return target, translation

In [None]:
# Cell 27
# Select Transformer or Seq2Seq model
# model = trans_model
model = seq2seq_model

In [None]:
# Cell 28
#Set model equal to trans_model or seq2seq_model
target, translation = translate(model, valid_loader)

In [None]:
# Cell 29
raw = np.array([list(map(lambda x: TRG.vocab.itos[x], target[i])) for i in range(target.shape[0])])
print(raw)

[['<sos>' 'a' 'man' ... '<pad>' '<pad>' '<pad>']
 ['<sos>' 'a' 'man' ... '<pad>' '<pad>' '<pad>']
 ['<sos>' 'a' 'man' ... '<pad>' '<pad>' '<pad>']
 ...
 ['<sos>' 'boy' 'doing' ... '<pad>' '<pad>' '<pad>']
 ['<sos>' 'kids' 'are' ... '<pad>' '<pad>' '<pad>']
 ['<sos>' 'a' 'man' ... '<pad>' '<pad>' '<pad>']]


In [None]:
# Cell 30
token_trans = np.argmax(translation.cpu().numpy(), axis = 2)
translated = np.array([list(map(lambda x: TRG.vocab.itos[x], token_trans[i])) for i in range(token_trans.shape[0])])
print(translated)

[['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']
 ['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']
 ['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']
 ...
 ['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']
 ['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']
 ['<unk>' 'a' 'man' ... '<eos>' '<eos>' '<eos>']]


### Zip submission and submit to Gradescope

If on Linux or Mac, run the following cell:

In [None]:
!cd /content/drive/MyDrive/hw4_student_version
%rm -rf assignment4_submission.zip
%zip -r assignment4_submission.zip models/ Machine_Translation.ipynb

UsageError: Line magic function `%zip` not found.


If on Windows, run the following cell:

In [None]:
!cd /content/drive/MyDrive/hw4_student_version/
!collect_submission.bat

/bin/bash: collect_submission.bat: command not found


In [None]:
!sh collect_submission.sh

  adding: models/ (stored 0%)
  adding: models/seq2seq/ (stored 0%)
  adding: models/seq2seq/__pycache__/ (stored 0%)
  adding: models/seq2seq/__pycache__/Decoder.cpython-37.pyc (deflated 43%)
  adding: models/seq2seq/__pycache__/Encoder.cpython-37.pyc (deflated 44%)
  adding: models/seq2seq/__pycache__/Seq2Seq.cpython-37.pyc (deflated 40%)
  adding: models/seq2seq/Decoder.py (deflated 74%)
  adding: models/seq2seq/Encoder.py (deflated 74%)
  adding: models/seq2seq/Seq2Seq.py (deflated 72%)
  adding: models/naive/ (stored 0%)
  adding: models/naive/__pycache__/ (stored 0%)
  adding: models/naive/__pycache__/RNN.cpython-37.pyc (deflated 52%)
  adding: models/naive/__pycache__/LSTM.cpython-37.pyc (deflated 47%)
  adding: models/naive/RNN.py (deflated 72%)
  adding: models/naive/LSTM.py (deflated 74%)
  adding: models/__pycache__/ (stored 0%)
  adding: models/__pycache__/Transformer.cpython-37.pyc (deflated 58%)
  adding: models/Transformer.py (deflated 81%)
  adding: Machine_Translation.