## Neural Machine Translation with Attention

In this notebook we will show you how to train a `seq2seq` model for English to
Turkish translation. When you train the model, you will be able to translate
English sentences to Turkish.

In [11]:
import os
import time

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import tensorflow as tf
from IPython.display import Video
from sklearn.model_selection import train_test_split

import utils

### Sections of the Notebook
1. [Loading the Dataset](#load)
2. [Tokenizing and Encoding](#tokenize_encode)
3. [Embeddings](#embeddings)
4. [The Model and Training](#model)<br>
    4.1 [Custom Loss and Accuracy](#loss_acc)
5. [Exercises](#exercises)

<a id="load"></a>
### 1. Loading the Dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. They
provide translation datasets for 80 different languages to/from English. The
dataset is in tab separated tabular format with 3 columns. First column is a
sentence in one of the 80 languages, and second is its translation in English.
Third column shows the source of the row. We can ignore third column for our
purposes.

In [21]:
path_to_zip = tf.keras.utils.get_file(
    "tur-eng.zip", origin="https://github.com/haluk/NLP_course_materials/blob/master/hw4/tur-eng.zip?raw=true",
    extract=True)
path_to_file = os.path.dirname(path_to_zip)+"/tur.txt"

input_tensor, target_tensor, inp_lang, targ_lang = utils.load_dataset(
    path_to_file, None
)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]

# Creating training and validation sets using an 80-20 split
(
    input_tensor_train,
    input_tensor_val,
    target_tensor_train,
    target_tensor_val,
) = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(
    "{:15s} => {:10s}: {}\t{:15s}: {}".format(
        "Input language",
        "Training size",
        len(input_tensor_train),
        "Validation size",
        len(input_tensor_val),
    )
)
print(
    "{:15s} => {:10s}: {}\t{:15s}: {}".format(
        "Target language",
        "Training size",
        len(target_tensor_train),
        "Validation size",
        len(target_tensor_val),
    )
)

Input language  => Training size: 106280	Validation size: 26571
Target language => Training size: 106280	Validation size: 26571


We will use `tf.data.Dataset` API for building an asynchronous, highly optimized
data pipeline to prevent GPUs from data starvation. It loads data from the disk,
text in our case, creates batches and sends it to the GPU.

In [22]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train) // BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices(
    (input_tensor_train, target_tensor_train)
).shuffle(BUFFER_SIZE)

dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

## Seq2Seq Models

We will use [Jay Alammar's](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) wonderful visualizations to explain `seq2seq` model and `attention` mechanism.

In [23]:
Video("https://jalammar.github.io/images/seq2seq_2.mp4", width=900, height=200)

Neural Machine Translation (NMT) model is composed of an `encoder` and `decoder`. Encoder part of the model 

In [24]:
Video("https://jalammar.github.io/images/seq2seq_4.mp4", width=900, height=200)

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Please contact Haluk Dogan (<a href="mailto:hdogan@vivaldi.net">hdogan@vivaldi.net</a>) for further questions or inquries.