# Translations 2.0 - Neural Machine Translation

According to the Google paper [*Attention is all you need*](https://arxiv.org/abs/1706.03762), you only need layers of Attention to make a Deep Learning model understand the complexity of a sentence. We will try to implement this type of model for our translator. 

### Data import 

You will have the same `.txt` file containing a sentence with its translation separated by a tab (`\t`). You will have to import this data and read it via `pandas`.

Your data can be found on this link: https://go.aws/38ECHUB

### Preprocessing 

The whole purpose of your preprocessing is to express your (French) entry sentence in a sequence of clues.

i.e. :

* je suis heureux---> `[123, 21, 34, 0, 0, 0, 0, 0]`

This gives a *shape* -> `(batch_size, max_len_of_a_sentence)`.

The zeros correspond to what are called [*padded_sequences*](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) which allow all word sequences to have the same length across a set of sequences (mandatory for your algorithm). 

You will run the same preprocessing on the target sequences, and add a `<start>` token at the beginning of each sequence.

* `<start>` I am happy ---> `[1, 43, 2, 42, 0, 0]`

### Modeling 

For modeling, you will need to set up layers of attention. You'll need to: 

* Create an `Encoder` class that inherits from `tf.keras.Model`.
* Create a Bahdanau Attention Layer that will be a class that inherits `tf.keras.layers.Layer`
* Finally create a `Decoder` class that inherits from `tf.keras.Model`.


You will need to create your own cost function as well as your own training loop. 


### Tips 

Don't take the whole dataset at the beginning for your experiments, just take 5000 or even 3000 sentences. This will allow you to iterate faster and avoid bugs simply related to your need for computing power, and memory space.

Good Luck!


In [None]:
# Import necessaries librairies
import pandas as pd
import numpy as np 
import tensorflow_datasets as tfds
import tensorflow as tf 
tf.__version__

'2.7.0'

## Importing data & Preprocessing

1. Load the data using the following url https://go.aws/38ECHUB you can read this using `pd.read_csv` with the `"\t"` delimiter and `header=None`

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


In [None]:
len(doc)

160538

2. Create an object `doc` containing the first 5000 rows from the file.

3. Add the word `<start>` to the beginning of each target sentence in order to create a new column named `padded_en`

Unnamed: 0,0,1
0,<start> Go.,Va !
1,<start> Hi.,Salut !
2,<start> Run!,Cours !
3,<start> Run!,Courez !
4,<start> Wow!,Ça alors !
...,...,...
4995,<start> I am so sorry.,Je suis tellement désolé !
4996,<start> I am so sorry.,Je suis tellement désolée !
4997,<start> I am very sad.,Je suis très triste.
4998,<start> I ate a donut.,J'ai mangé un beignet.


4. Create two objects : `tokenizer_fr` and `tokenizer_en` that will be instances of the `tf.keras.preprocessing.text.Tokenizer` class. 

Be careful! Since we added a special token containing special characters, make sure you setup the tokenizers right so this token is well interpreted! (use the `filters` argument for example).

5. Fit the tokenizers on the french, and english sentences respectively.

6. Create three new columns in your Dataframe for the encoded french, english sentences.

Unnamed: 0,0,1,fr_indices,en_indices
0,<start> Go.,Va !,[36],"[1, 11]"
1,<start> Hi.,Salut !,[404],"[1, 616]"
2,<start> Run!,Cours !,[1212],"[1, 111]"
3,<start> Run!,Courez !,[1213],"[1, 111]"
4,<start> Wow!,Ça alors !,"[22, 1214]","[1, 872]"


7. It's rather difficult to work with sequences with variable length, use zero-padding to normalize the length of all the sequences in each category.

8. What are the shapes of the arrays you just created for the french, and english sentences?

(5000, 10)

(5000, 5)

9. Use `sklearn` `train_test_split` function to divide your sample into train and validation sets.

10. Set a `BATCH_SIZE` then create a `train`, and `val` tensor datasets, apply `.shuffle` on the `train` set and `.batch` on both sets.

## Modeling

1. Set up the following variables:
  * `n_embed` for the models' embedding output dimensions
  * `n_gru` for the models' gru number of units
  * `vocab_inp_size` for the french vocab size
  * `vocab_tar_size` for the english vocab size

### Encoder

2. Define a class `encoder_maker` inheriting from `tf.keras.Model` that can instanciate and encoder type model according to the following schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

3. Define an instance of the class called... `encoder`!

4. Use the `__call__` method of `encoder` on some data to create an object `encoder_output`, and an `encoder_state` (remember your encoder has two different outputs!). Then print out `encoder_output`, and `encoder_state`.

<tf.Tensor: shape=(1, 10, 256), dtype=float32, numpy=
array([[[-0.00253955,  0.01535846, -0.01031921, ..., -0.0036075 ,
          0.00527863, -0.03740017],
        [ 0.00914116,  0.01068447, -0.01797251, ..., -0.00501267,
         -0.02302877, -0.01950926],
        [-0.01169751,  0.00807895, -0.02574131, ..., -0.00959103,
         -0.0026719 , -0.02316577],
        ...,
        [ 0.01122705,  0.00844394,  0.0134197 , ...,  0.00524029,
          0.07471443, -0.0542349 ],
        [ 0.01121978,  0.00809082,  0.01428621, ...,  0.00589937,
          0.07630372, -0.05507544],
        [ 0.01115769,  0.00785431,  0.01476404, ...,  0.00638014,
          0.07713117, -0.0555497 ]]], dtype=float32)>

<tf.Tensor: shape=(1, 256), dtype=float32, numpy=
array([[ 1.11576924e-02,  7.85430986e-03,  1.47640351e-02,
         1.07137319e-02,  3.64588830e-03,  1.67278796e-02,
         1.34425284e-03, -2.37888610e-03, -4.53043841e-02,
         3.18491012e-02,  1.95410363e-02,  1.14593888e-02,
         1.75643116e-02,  1.36310114e-02, -2.52847336e-02,
         2.52709687e-02, -1.18106306e-02, -8.87616407e-05,
         5.62975556e-03,  1.87446177e-02,  1.26475617e-02,
        -1.11991875e-02,  1.76293682e-03,  1.63977174e-03,
        -3.16058937e-03, -3.12474612e-02, -6.21723616e-03,
         6.01544650e-03,  4.35616225e-02, -2.77330466e-02,
         8.30608141e-03, -3.16027477e-02, -3.33204158e-02,
         3.90100740e-02, -5.50506823e-02,  4.36665788e-02,
        -3.52847390e-02, -1.49069668e-03, -9.91806202e-03,
        -4.39796485e-02, -3.00285988e-03,  4.04703580e-02,
         5.69241755e-02,  1.07780751e-02,  3.27506550e-02,
        -1.17954444e-02,  3.48997936e-02,  4.54235673e-02,
      

### Attention layer

5. Create a `Bahdanau_attention_maker` class that lets you instanciate an attention layer that you will include in your decoder model. You may follow the instructions from this schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

And get inspiration (as much as you want) from the lecture's demo!

6. Create an instance of the class called `attention_layer`.

7. Try out the `__call__` method on the `encoder_output`, and `encoder_state`.

(<tf.Tensor: shape=(1, 256), dtype=float32, numpy=
 array([[ 0.00601813,  0.00965912,  0.00083819,  0.0061054 ,  0.00741023,
          0.01029313,  0.01178071, -0.00325181, -0.03398979,  0.0228165 ,
          0.01502588,  0.00566412,  0.00525974,  0.00832514, -0.02218461,
          0.02579355, -0.01195296, -0.00294421, -0.00418221,  0.0133222 ,
          0.00819678, -0.00276295, -0.00685048, -0.00093522, -0.00660039,
         -0.01658776,  0.01074449,  0.01261417,  0.03227958, -0.01585295,
         -0.00188884, -0.01266814, -0.01212784,  0.02138761, -0.04008794,
          0.02293937, -0.01680868, -0.00520804, -0.00773489, -0.02959757,
         -0.01014823,  0.01726578,  0.03042765,  0.00458782,  0.02141881,
         -0.01147711,  0.02137397,  0.02372793,  0.02370477, -0.00983301,
         -0.00022486,  0.00519349,  0.00588073, -0.00999522, -0.00343707,
          0.01357444,  0.0222452 , -0.0225017 ,  0.01612515,  0.01630028,
          0.03240031,  0.04815333,  0.01252025, -0.00452948, 

### Decoder

8. Set up a `decoder_maker` class that will let you create decoder models according to the demo and the following schema: 

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

9. Create an instance of the class called...... `decoder` !

10. Try out the decoder on some teacher forcing data and the encoder outputs.

(<tf.Tensor: shape=(1, 1258), dtype=float32, numpy=
 array([[0.0007928 , 0.0007945 , 0.00079508, ..., 0.00079173, 0.00080088,
         0.00080663]], dtype=float32)>,
 <tf.Tensor: shape=(1, 256), dtype=float32, numpy=
 array([[ 0.0054262 , -0.00104087, -0.01437734,  0.0007463 , -0.00516099,
         -0.00511794, -0.02686284,  0.02661309,  0.0079101 , -0.00073266,
          0.00072341,  0.00591044, -0.00800723, -0.01275809,  0.02003843,
         -0.00484268,  0.00313709,  0.00535898,  0.01624558, -0.02003136,
         -0.02237273, -0.01843331, -0.01609821,  0.00016066,  0.00276765,
         -0.00078844,  0.01207494,  0.01926803, -0.01058505,  0.01601942,
          0.00838074,  0.01534662, -0.00091441, -0.00283618,  0.0260405 ,
         -0.00851085, -0.01272523,  0.00645019, -0.02285067,  0.00688291,
         -0.00792905, -0.00216173, -0.01906213,  0.01370876, -0.00043282,
         -0.00464217,  0.0030584 ,  0.009551  , -0.01040792, -0.01235285,
         -0.01967028,  0.01735196, -0.01187

### Loss

11. Look at the following loss function, what is the purpose of it, what will it change about the way the model learns?

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

12. Set up a checkpoint for the optimizer, the encoder, and the decoder.

## Training 

1. Define a `train_step` function that will take as arguments `inp` which represents a batch of input sequences, and `targ` which represents an input of target sequences.

This function will:
* Initiate `loss` to zero
* Track all operations with `tf.GradientTape() as tape`
* Use the encoder on `inp` to compute its outputs
* Set `dec_state` as the encoder state
* Set `dec_input` as the first sequence element of the target batch `targ` (careful with the shapes)
* Start a loop that will go through each subsequent elements of the target sequence, and will do:
  * Apply the decoder on the encoder outputs and `dec_input`, this will create the prediction's probability vector, and update the decoder state
  * Calculate  the loss based on the next element of `targ`, and the prediction probability vector and add it to `loss`
  * Set the new decoder input as the next element of `targ`
* Create `batch_loss` as equal to the average value of the loss over the target sequence.
* Create a `variables` object containing both the encoder's and the decoder's training variables.
* Compute the gradient and update the training variables.
* Return `batch_loss`


2. Code the training loop.
It needs to loop across the number of epochs you wish to train for, use the train step, print out the train loss every now and then, and the val loss at the end of each epoch (optional)

Epoch 1 Batch 0 Loss 3.7139
Epoch 1 Batch 10 Loss 3.0329
Epoch 1 Batch 20 Loss 3.0076
Epoch 1 Loss 94.7033
Time taken for 1 epoch 4.985331773757935 sec

 val loss : tf.Tensor(3.7366273, shape=(), dtype=float32) 

Epoch 2 Batch 0 Loss 2.7408
Epoch 2 Batch 10 Loss 2.7151
Epoch 2 Batch 20 Loss 2.7714
Epoch 2 Loss 82.6822
Time taken for 1 epoch 4.810648202896118 sec

 val loss : tf.Tensor(3.8331704, shape=(), dtype=float32) 

Epoch 3 Batch 0 Loss 2.6021
Epoch 3 Batch 10 Loss 2.5657
Epoch 3 Batch 20 Loss 2.6074
Epoch 3 Loss 77.7419
Time taken for 1 epoch 4.990307807922363 sec

 val loss : tf.Tensor(3.9707232, shape=(), dtype=float32) 

Epoch 4 Batch 0 Loss 2.4774
Epoch 4 Batch 10 Loss 2.4839
Epoch 4 Batch 20 Loss 2.3897
Epoch 4 Loss 73.2277
Time taken for 1 epoch 4.866182804107666 sec

 val loss : tf.Tensor(4.07519, shape=(), dtype=float32) 

Epoch 5 Batch 0 Loss 2.3309
Epoch 5 Batch 10 Loss 2.2713
Epoch 5 Batch 20 Loss 2.3188
Epoch 5 Loss 68.7288
Time taken for 1 epoch 5.3698811531066895 s

3. What do you think of the training process, did it work well on the train set?  On the validation set?

4. Use `X_val` to compute all the predictions for the validation set and convert them  back to text. Compare them with the actual target values, what do you think? What about the results on the training set?

pred: off it wasn't me
true: go away


pred: find a job a
true: get a job


pred: i won win win
true: did i win


pred: now drink up tom
true: now drink up


pred: stop that out of
true: come off it


pred: a nap a nap
true: i have proof


pred: where is he is
true: where is it


pred: i was busy got
true: i was busy


pred: i've tried it out
true: i tried


pred: i'm starved all set
true: i'm through




5. Now that everything works well, it's time to increase our number of samples and start another training, did the results improve?