## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2024.

## **Task 3: RNN Application -- Neural Machine Translation** (25%)

In this task, you are going to perform neural machine translation (NMT). NMT involves using a neural network to translate from one language to another. This is a widely studied natural language processing (NLP) problem and has tremendous real-world applications.

Machine Translation is a challenging task that involves both the usage of complex architectures and data processing tricks to obtain human-level performance. **In this notebook, you will implement a simple Seq2Seq architecture using RNN layers in keras.**

**The goal is to train a model to translate from Dutch (input language) to English (target language)**. This notebook uses data from the [Tab Delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) repository. You can find many such language pairs here.

## <span style="color:red"><strong>NOTE: Training this model may take 10-15 minutes of time depending on the strength of the system, so please plan accordingly.</strong></span>

In [1]:
# Import modules
import tensorflow as tf
import numpy as np
import json
import matplotlib.pyplot as plt

%matplotlib inline

%load_ext autoreload
%autoreload 2

### Broad Overview of Steps:
1. Preprocess and encode data
2. Create dataset/dataloaders
3. Define Model Architecture
4. Train Model
5. Evaluate results

Step 1. has already been completed for you. We provide two .npy files that contain the data: 
- `nmt_eng.npy` contains the encoded English sentences.
- `nmt_nl.npy` contains the encoded Dutch sentences. 
The sentences already have been normalized, padded and appended with the \<start\> and \<end\> tokens.

We also provide two vocabulary files `eng_vocab.txt` and `nl_vocab.txt` for the English and Dutch languages respectively. The vocabulary files will be used for decoding the input and output of our model.

## Part 1. Load Encoded Data

<font color="red"><strong>TODO:</strong></font> Execute the following cells to load the text data.

In [2]:
# Load Vocabulary files (dictionaries of word:int pairs)
with open("text_data/eng_vocab.txt", 'r') as f:
    eng_vocab = json.load(f)

with open("text_data/nl_vocab.txt", 'r') as f:
    nl_vocab = json.load(f)
    
eng_vocab = {int(key): value for key, value in eng_vocab.items()}
nl_vocab = {int(key): value for key, value in nl_vocab.items()}

print(f'Size of english vocab: {len(eng_vocab)}')
print(f'Size of dutch vocab: {len(nl_vocab)}')

Size of english vocab: 9044
Size of dutch vocab: 10000


In [3]:
# Load Encoded Sentence Data
eng_text = np.load("text_data/nmt_eng.npy")
nl_text = np.load("text_data/nmt_nl.npy")

print(f'Shape of english text data: {eng_text.shape}')
print(f'Shape of dutch text data: {nl_text.shape}')

Shape of english text data: (75298, 30)
Shape of dutch text data: (75298, 30)


## Part 2: Datasets and Dataloading (3%)

<font color="red"><strong>TODO:</strong></font> <b>Complete the functions in utils/translation/text_data.py</b>

This will create the train, validation, and test datasets for our translation model.

In [4]:
from utils.translation.text_data import get_dataset, get_dataset_partitions_tf, decode_text

text_ds = get_dataset(nl_text, eng_text)
train_ds, val_ds = get_dataset_partitions_tf(text_ds, len(text_ds))
print(f"Train size: {len(train_ds)}")
print(f"Validation size: {len(val_ds)}")

Train size: 67768
Validation size: 7530


In [5]:
# Let's have a look at a sample from the dataset
sample = next(iter(train_ds))
sample[0], sample[1]

(<tf.Tensor: shape=(30,), dtype=int64, numpy=
 array([  2,  57,  82,  27,   9, 167,  62, 341,   7,   3,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0])>,
 <tf.Tensor: shape=(30,), dtype=int64, numpy=
 array([  2,  28,  63,  47,   6,  50,   9, 319, 158,   8,   3,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0])>)

In [6]:
decoded_nl = decode_text(sample[0].numpy(), vocab=nl_vocab)
decoded_eng = decode_text(sample[1].numpy(), vocab=eng_vocab)
print('NL:', decoded_nl)
print()
print('EN:', decoded_eng)

NL: ['[SOS]', 'hoe', 'laat', 'ben', 'je', 'gisteren', 'gaan', 'slapen', '?', '[EOS]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

EN: ['[SOS]', 'what', 'time', 'did', 'you', 'go', 'to', 'sleep', 'yesterday', '?', '[EOS]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


## Part 3: Model Architecture (15%)

### Seq2Seq Model

In the general case, input sequences and output sequences have different lengths (e.g. machine translation) and the entire input sequence is required in order to start predicting the target. This requires a more advanced setup, which is what people commonly refer to when mentioning "sequence-to-sequence models" with no further context. Here's how it works (This example is English to French):

- An RNN layer (or stack thereof) acts as "encoder": it processes the input sequence and returns its own internal state. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the "context", or "conditioning", of the decoder in the next step.
- Another RNN layer (or stack thereof) acts as "decoder": it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep in the future, a training process called "teacher forcing" in this context. Importantly, the encoder uses as initial state the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate. Effectively, the decoder learns to generate targets[t+1...] given targets[...t], conditioned on the input sequence.

![teacher_forcing](./img/seq2seq-teacher-forcing.png)

In inference mode, i.e. when we want to decode unknown input sequences, we go through a slightly different process:

1) Encode the input sequence into state vectors.
2) Start with a target sequence of size 1 (just the start-of-sequence character).
3) Feed the state vectors and 1-char target sequence to the decoder to produce predictions for the next character.
4) Sample the next character using these predictions (we simply use argmax).
5) Append the sampled character to the target sequence
6) Repeat until we generate the end-of-sequence character or we hit the character limit.

![seq2seq-inference](./img/seq2seq-inference.png)

The seq2seq model implementation requires a more complex setup than what is provided by keras.Sequential().
You will be exposed to writing modular code using custom `keras.layer` and `keras.Model` classes. **First, please read https://keras.io/guides/making_new_layers_and_models_via_subclassing/** to get an idea about writing custom modules in tensorflow/keras, which is what is done in practice to implement complex architectures.

<font color="red"><strong>TODO:</strong></font> <b>Based on the above, you need to complete the code in utils/translation/layers.py</b>

In [7]:
# BATCH, PREFETCH, CACHE the datasets
# You can change the batch size based on memory requirements
BATCH_SIZE = 64
train_loader = train_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE).cache()
val_loader = val_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE).cache()

In [8]:
from utils.translation.layers import TranslationModel

eng_vocab_size = len(eng_vocab)
nl_vocab_size = len(nl_vocab)
hidden_size = 256

# Initialize Model
model = TranslationModel(nl_vocab_size, eng_vocab_size, hidden_size, eng_vocab)

## Part 4: Training the Model (5%)

The following cell(s) will train your Machine Translation model. The loss function used is Cross Entropy (since we are performing classification across the vocabulary at each time step). In practice, we usually implement a machine translation metric such as BLEU or ROUGE ([reference](https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb#:~:text=While%20BLEU%20score%20is%20primarily,the%20reference%20translations%20or%20summaries.)), and compute it for the validation set after each epoch. For this task, it is sufficient to just observe the train loss values.

You are already provided the `train_seq2seq_model` function in `utils.translation.train_funcs.py`. You can refer to this file to see the loss function and how a custom training loop with modifications has been implemented. Execute the cell below to train your model.

**Note that training will proceed as expected only if the implementation of your model is correct.** You can monitor the training loss to make sure that the model training is proceeding as expected. **Training may take 10-15 minutes depending on the strength of the system.**

If you have spare time, feel free to increase the number of epoch and gauge if the performance improves. 

In [9]:
from utils.translation.train_funcs import train_seq2seq_model

# Train the model. Use the Adam optimizer with 1e-3 learning rate.
num_epochs = 8
learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

train_seq2seq_model(
    model,
    train_loader,
    optimizer,
    num_epochs
)


Epoch: 1/8
Iter: 0, Loss (iter): 9.11008358001709, Mean Loss (over last 50 iters): 9.11008358001709
Iter: 50, Loss (iter): 5.750728607177734, Mean Loss (over last 50 iters): 6.467575550079346
Iter: 100, Loss (iter): 5.2706170082092285, Mean Loss (over last 50 iters): 5.489830493927002
Iter: 150, Loss (iter): 5.337277889251709, Mean Loss (over last 50 iters): 5.267693996429443
Iter: 200, Loss (iter): 5.086941242218018, Mean Loss (over last 50 iters): 5.1465654373168945
Iter: 250, Loss (iter): 5.174102783203125, Mean Loss (over last 50 iters): 5.066580772399902
Iter: 300, Loss (iter): 4.774022579193115, Mean Loss (over last 50 iters): 4.946993350982666
Iter: 350, Loss (iter): 4.859435558319092, Mean Loss (over last 50 iters): 4.828985214233398
Iter: 400, Loss (iter): 4.607417106628418, Mean Loss (over last 50 iters): 4.692594051361084
Iter: 450, Loss (iter): 4.604677677154541, Mean Loss (over last 50 iters): 4.642575263977051
Iter: 500, Loss (iter): 4.480602741241455, Mean Loss (over las

## Part 5: Evaluating Results (2%)

Our training function only shows the training loss value. To assess the performance of the model, we can perform some predictions and decode the input/output sentences. 

<font color="red"><strong>TODO:</strong></font> Run the following cells to qualitatively asses the quality of the generated sentences and the performance of the trained model.

**NOTE**: As we are dealing with a generation task, the outputs will vary depending on the final trained model. Therefore, we have provided a set of example outputs with the translation quality you can expect from the trained model. You results may be different.



In [10]:
# Run these cells to evaluate your model on one batch of the validation set
val_sample = val_loader.shuffle(10000).take(1)
val_sample = next(iter(val_sample))

In [11]:
val_inp, val_target = val_sample
decoded_inputs = []
reserved_tokens = ['[PAD]', '[SOS]', '[EOS]']
for inp in val_inp.numpy():
    decoded_text = decode_text(inp, vocab=nl_vocab)
    decoded_inputs.append([token for token in decoded_text if token not in reserved_tokens])

val_pred = model(val_inp, training=False)
decoded_outputs = model.decode_tokens(val_pred)
decoded_ground_truth = model.decode_tokens(val_target[:, 1:])

samples_to_show = 10 # Should be <= batch size

for i, decoded_data in enumerate(zip(decoded_inputs, decoded_ground_truth, decoded_outputs)):
    nl_sentence = ' '.join(decoded_data[0])
    gt_en_sentence = ' '.join(decoded_data[1])
    pred_en_sentence = ' '.join(decoded_data[2])
    print('Sample:', i+1)
    print('Dutch Sentence: ', nl_sentence)
    print('English Sentence (Truth): ', gt_en_sentence)
    print('English Sentence (Pred)', pred_en_sentence)
    if i > samples_to_show - 1:
        break

Sample: 1
Dutch Sentence:  mijn vrienden gingen zonder mij naar de film
English Sentence (Truth):  my friends went to the movies without me
English Sentence (Pred) my parents went to the movies without my cousin
Sample: 2
Dutch Sentence:  ik kan autorijden
English Sentence (Truth):  i am able to drive a car
English Sentence (Pred) i can drive a car
Sample: 3
Dutch Sentence:  ik wil je bezoeken
English Sentence (Truth):  i want to visit you
English Sentence (Pred) i want to see you
Sample: 4
Dutch Sentence:  ik probeerde te ontspannen maar dat lukte niet
English Sentence (Truth):  i tried to relax but couldn t
English Sentence (Pred) i tried to hear that i didn t see anything
Sample: 5
Dutch Sentence:  ik bel je morgen
English Sentence (Truth):  i ll call you tomorrow
English Sentence (Pred) i ll call you tomorrow
Sample: 6
Dutch Sentence:  je zult er op tijd aankomen zolang je tenminste de trein niet mist
English Sentence (Truth):  you ll get there in time so long as you don t miss the

### Example of Expected Outputs

In [12]:
"""
Sample: 1
Dutch Sentence:  waar hebben jullie het verstopt ?
English Sentence (Truth):  where did you hide it ?
English Sentence (Pred) where did you do it ?
Sample: 2
Dutch Sentence:  denk jij dit ?
English Sentence (Truth):  is that what you think ?
English Sentence (Pred) do you think this ?
Sample: 3
Dutch Sentence:  de treinen rijden s nachts minder vaak
English Sentence (Truth):  the trains don t run as often at night
English Sentence (Pred) the [UNK] [UNK] to [UNK] a week
Sample: 4
Dutch Sentence:  ik weet hoe dit werkt
English Sentence (Truth):  i know how this works
English Sentence (Pred) i know that this dictionary
Sample: 5
Dutch Sentence:  houd je toespraak kort
English Sentence (Truth):  keep your speech short
English Sentence (Pred) look at your country
Sample: 6
Dutch Sentence:  help me alsjeblieft een trui uit te kiezen die bij mijn nieuwe jurk past
English Sentence (Truth):  please help me pick out a sweater which matches my new dress
English Sentence (Pred) please give me a new dictionary for me this year ago
Sample: 7
Dutch Sentence:  welk jaar is het ?
English Sentence (Truth):  what year is it ?
English Sentence (Pred) what is the last ?
Sample: 8
Dutch Sentence:  welk verschil is er tussen dit en dat ?
English Sentence (Truth):  what is the difference between this and that ?
English Sentence (Pred) what s the difference between this bird ?
Sample: 9
Dutch Sentence:  is dat onze bus ?
English Sentence (Truth):  is that our bus ?
English Sentence (Pred) is this the book ?
Sample: 10
Dutch Sentence:  kun je me vannacht een [UNK] doen en op mijn kinderen oppassen ?
English Sentence (Truth):  could you do me a favor and [UNK] my kids tonight ?
English Sentence (Pred) can you please tell the truth to me a doctor ?
Sample: 11
Dutch Sentence:  ik bleef daar
English Sentence (Truth):  i stayed there
English Sentence (Pred) i felt [UNK]
"""

'\nSample: 1\nDutch Sentence:  waar hebben jullie het verstopt ?\nEnglish Sentence (Truth):  where did you hide it ?\nEnglish Sentence (Pred) where did you do it ?\nSample: 2\nDutch Sentence:  denk jij dit ?\nEnglish Sentence (Truth):  is that what you think ?\nEnglish Sentence (Pred) do you think this ?\nSample: 3\nDutch Sentence:  de treinen rijden s nachts minder vaak\nEnglish Sentence (Truth):  the trains don t run as often at night\nEnglish Sentence (Pred) the [UNK] [UNK] to [UNK] a week\nSample: 4\nDutch Sentence:  ik weet hoe dit werkt\nEnglish Sentence (Truth):  i know how this works\nEnglish Sentence (Pred) i know that this dictionary\nSample: 5\nDutch Sentence:  houd je toespraak kort\nEnglish Sentence (Truth):  keep your speech short\nEnglish Sentence (Pred) look at your country\nSample: 6\nDutch Sentence:  help me alsjeblieft een trui uit te kiezen die bij mijn nieuwe jurk past\nEnglish Sentence (Truth):  please help me pick out a sweater which matches my new dress\nEnglish

### <font color="red"><strong>TODO:</strong></font> <b>Answer the following questions:</b>

1. **Describe your observations of the model's evaluation performance. Briefly explain any one method to improve the model architecture based on the lecture readings, or online sources.**

<span style="color:red">__Answer:__</span>



After having increased the number of epochs, model's perfomance was pretty accurate, managing to produce an output similar to the ones in the expected output. A technique to further increase, based on online resources, would be to introduce an attention mechanism (https://medium.com/@prakhargannu/attention-mechanism-in-deep-learning-simplified-d6a5830a079d), that would give the power to the model to concentrate on specific parts of the input sequence dynamically

2. **During the data preprocessing, we encoded each word in the input/target language to a number based on the vocabulary. This is known as tokenization. Briefly explain any one other method of tokenization, and why it might be beneficial to this particular task.**

<span style="color:red">__Answer:__</span>



Such a technique could be subword tokenization (https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c). In our case it could be highly beneficial, as it could handle rare and out of vocabulary words, improve generalization and translation quality. (I also provided the reference, hence i think you do not need a lenghtier response) 

We implemented a simple LSTM-based seq2seq model. The performance may not be the best, since Dutch and English are naturally complex languages. The state-of-the-art translation models are based on Transformer Networks that use the attention mechanism. (further reading: https://nlpprogress.com/english/machine_translation.html)

## (BONUS) Part 6: Bidirectional LSTM (5%)

One simple modification that we can do to significantly improve the quality of the generated sentences is to change the Encoder LSTM to be bidirectional. This will improve the performance because different languages tend to have different sentence structures, and in the case of Dutch, crucial information for a given word may not be available until later on in the sentence. 

**SIDE NOTE**: Changing the decoder to be bidirectional will not work in a text generation task (in our case, translation). Feel free to think of why this is the case. (No writing is required).

<font color="red"><strong>BONUS TODO:</strong></font> <b> Complete the BidirectionalEncoder Class in utils/translation/layers.py, and run the training and validation loops to compare the generated translations with the previous model you implemented.</b>

In [14]:
from utils.translation.layers import TranslationModel

eng_vocab_size = len(eng_vocab)
nl_vocab_size = len(nl_vocab)
hidden_size = 256

# Initialize Model with bidirectional_encoder = True
# NOTE the bidirectional_encoder = True
model = TranslationModel(nl_vocab_size, eng_vocab_size, hidden_size, eng_vocab, bidirectional_encoder=True)

In [17]:
from utils.translation.train_funcs import train_seq2seq_model

# Train the model. Use the Adam optimizer with 1e-3 learning rate.
num_epochs = 8
learning_rate = 0.001
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

train_seq2seq_model(
    model,
    train_loader,
    optimizer,
    num_epochs
)

Epoch: 1/8
Iter: 0, Loss (iter): 5.482997894287109, Mean Loss (over last 50 iters): 5.482997894287109
Iter: 50, Loss (iter): 5.286454200744629, Mean Loss (over last 50 iters): 5.355573654174805
Iter: 100, Loss (iter): 5.019895553588867, Mean Loss (over last 50 iters): 5.1938700675964355
Iter: 150, Loss (iter): 5.138948917388916, Mean Loss (over last 50 iters): 5.054823398590088
Iter: 200, Loss (iter): 4.7865447998046875, Mean Loss (over last 50 iters): 4.873647689819336
Iter: 250, Loss (iter): 4.744236946105957, Mean Loss (over last 50 iters): 4.716247081756592
Iter: 300, Loss (iter): 4.362544059753418, Mean Loss (over last 50 iters): 4.543221950531006
Iter: 350, Loss (iter): 4.470834255218506, Mean Loss (over last 50 iters): 4.418442726135254
Iter: 400, Loss (iter): 4.184449195861816, Mean Loss (over last 50 iters): 4.2786478996276855
Iter: 450, Loss (iter): 4.202212333679199, Mean Loss (over last 50 iters): 4.2479448318481445
Iter: 500, Loss (iter): 4.092261791229248, Mean Loss (over

In [18]:
# Run these cells to evaluate your model on one batch of the validation set
val_sample = val_loader.shuffle(10000).take(1)
val_sample = next(iter(val_sample))

In [19]:
val_inp, val_target = val_sample
decoded_inputs = []
reserved_tokens = ['[PAD]', '[SOS]', '[EOS]']
for inp in val_inp.numpy():
    decoded_text = decode_text(inp, vocab=nl_vocab)
    decoded_inputs.append([token for token in decoded_text if token not in reserved_tokens])

val_pred = model(val_inp, training=False)
decoded_outputs = model.decode_tokens(val_pred)
decoded_ground_truth = model.decode_tokens(val_target[:, 1:])

samples_to_show = 10 #Should be <= batch size

for i, decoded_data in enumerate(zip(decoded_inputs, decoded_ground_truth, decoded_outputs)):
    nl_sentence = ' '.join(decoded_data[0])
    gt_en_sentence = ' '.join(decoded_data[1])
    pred_en_sentence = ' '.join(decoded_data[2])
    print('Sample:', i+1)
    print('Dutch Sentence: ', nl_sentence)
    print('English Sentence (Truth): ', gt_en_sentence)
    print('English Sentence (Pred)', pred_en_sentence)
    if i > samples_to_show - 1:
        break

Sample: 1
Dutch Sentence:  ik wil dat dingen veranderen
English Sentence (Truth):  i want things to change
English Sentence (Pred) i want to be prepared
Sample: 2
Dutch Sentence:  tom is nog steeds niet gewend aan het leven in de stad
English Sentence (Truth):  tom is still not accustomed to city life
English Sentence (Pred) tom isn t very interested in the [UNK]
Sample: 3
Dutch Sentence:  ik dacht dat je dat wist
English Sentence (Truth):  i thought you knew that
English Sentence (Pred) i thought that you said that
Sample: 4
Dutch Sentence:  waarom houdt iedereen van katten ?
English Sentence (Truth):  why does everybody love cats ?
English Sentence (Pred) why do people like this ?
Sample: 5
Dutch Sentence:  hij studeert [UNK]
English Sentence (Truth):  he is studying [UNK]
English Sentence (Pred) he is studying [UNK]
Sample: 6
Dutch Sentence:  ze vierde gisteren haar [UNK] verjaardag
English Sentence (Truth):  she celebrated her fifteenth birthday yesterday
English Sentence (Pred) sh

<font color="red"><strong>BONUS TODO:</strong></font> <b> Briefly describe the differences in quality of the translations between the model with and without a bidirectional encoder. (1-2 sentenes is enough) <b/>

<span style="color:red">__Answer:__</span>



In general both models have a sufficient level of accuracy as the difference in the loss functions is within the limits mentioned in edstem. However, the model with the encoder seems slightly more fluent and with a better understanding towards uncertainties. I thinks this should be more visible in he case of longer sentences.