# Translations with ENcoder Decoder

We'll see that with LSTMs and the Encoder Decoder framework, we can do some pretty powerful things like: *translators* ! Let's see how we can create a French > English translator with TensorFlow 

### Tips 

Don't take the whole dataset at the beginning for your experiments, just take 5000 or even 3000 sentences. This will allow you to iterate faster and avoid bugs simply related to your need for computing power.

Let's get started!

## Import Libraries

In [None]:
# Import necessaries librairies
import pandas as pd
import numpy as np 
import sklearn
import tensorflow_datasets as tfds
import tensorflow as tf 
tf.__version__

'2.6.0'

## Importing data 

1. Load the data using the following url https://go.aws/38ECHUB you can read this using `pd.read_csv` with the `"\t"` delimiter and `header=None`

Unnamed: 0,0,1
0,Go.,Va !
1,Hi.,Salut !
2,Run!,Cours !
3,Run!,Courez !
4,Wow!,Ça alors !


2. Create an object `doc` containing the first 5000 rows from the file.

3. In your opinion, are we going to need to lemmatize and remove stop words for a translation problem?

4. Add the word `<start>` to the beginning of each target sentence in order to create a new column named `padded_en`

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,padded_en
0,Go.,Va !,<start> Go.
1,Hi.,Salut !,<start> Hi.
2,Run!,Cours !,<start> Run!
3,Run!,Courez !,<start> Run!
4,Wow!,Ça alors !,<start> Wow!


5. Create two objects : `tokenizer_fr` and `tokenizer_en` that will be instances of the `tf.keras.preprocessing.text.Tokenizer` class. 

Be careful! Since we added a special token containing special characters, make sure you setup the tokenizers right so this token is well interpreted! (use the `filters` argument for example).

6. Fit the tokenizers on the french, and **padded** english sentences respectively.

7. Create three new columns in your Dataframe for the encoded french, english, and padded english sentences.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,0,1,padded_en,fr_indices,en_indices,padded_en_indices
0,Go.,Va !,<start> Go.,[36],[11],"[1, 11]"
1,Hi.,Salut !,<start> Hi.,[404],[616],"[1, 616]"
2,Run!,Cours !,<start> Run!,[1212],[111],"[1, 111]"
3,Run!,Courez !,<start> Run!,[1213],[111],"[1, 111]"
4,Wow!,Ça alors !,<start> Wow!,"[22, 1214]",[872],"[1, 872]"


8. We learned from the tutorial that the padded target sequences need to have the same length as the target sequences, so we will remove the last element of each padded target sequence (this will help us enforce teacher forcing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,padded_en,fr_indices,en_indices,padded_en_indices,padded_en_indices_clean
0,Go.,Va !,<start> Go.,[36],[11],"[1, 11]",[1]
1,Hi.,Salut !,<start> Hi.,[404],[616],"[1, 616]",[1]
2,Run!,Cours !,<start> Run!,[1212],[111],"[1, 111]",[1]
3,Run!,Courez !,<start> Run!,[1213],[111],"[1, 111]",[1]
4,Wow!,Ça alors !,<start> Wow!,"[22, 1214]",[872],"[1, 872]",[1]


9. It's rather difficult to work with sequences with variable length, use zero-padding to normalize the length of all the sequences in each category.

10. What are the shapes of the arrays you just created for the french, padded english, and english sentences?

(5000, 10)

(5000, 4)

(5000, 4)

11. Use `sklearn` `train_test_split` function to divide your sample into train and validation sets.

## MODEL

Now it's time to code the model, thankfully you can largely base yourself off the code provided during the demo!

1. Create the following variables:
* `n_embed` the number of dimensions you want for the embeddings output spaces
* `n_lstm` the number of units you want for the lstm layers
* `fr_len` the length of a french sentence
* `en_len` the length of an english or teacher forcing sentence
* `vocab_size_fr` the number of tokens in the french vocabulary
* `vocab_size_en` the number of tokens in the english vocabulary (based of the padded sequences so the `<start>` is included!

In [None]:
# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed = 
n_lstm = 
fr_len = 
en_len = 
vocab_size_fr = 
vocab_size_en = 

2. Set up the encoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the french vocabulary +1 (for the zero-padding)

3. Try the encoder on the french train data (using the call method)

[<tf.Tensor: shape=(3500, 64), dtype=float32, numpy=
 array([[ 0.03516009,  0.03100508,  0.02265517, ..., -0.01395426,
         -0.02611356,  0.009006  ],
        [ 0.03598815,  0.03333484,  0.01612609, ..., -0.01795677,
         -0.02568691,  0.00788247],
        [ 0.0381716 ,  0.03174238,  0.01842551, ..., -0.01462295,
         -0.02422647,  0.00832345],
        ...,
        [ 0.03569918,  0.03146626,  0.02163094, ..., -0.01790284,
         -0.02431938,  0.00937005],
        [ 0.037827  ,  0.03526489,  0.01708072, ..., -0.01626929,
         -0.02635708,  0.01015202],
        [ 0.03321327,  0.02693242,  0.0135023 , ..., -0.01000259,
         -0.02379798,  0.01082663]], dtype=float32)>,
 <tf.Tensor: shape=(3500, 64), dtype=float32, numpy=
 array([[ 0.03516009,  0.03100508,  0.02265517, ..., -0.01395426,
         -0.02611356,  0.009006  ],
        [ 0.03598815,  0.03333484,  0.01612609, ..., -0.01795677,
         -0.02568691,  0.00788247],
        [ 0.0381716 ,  0.03174238,  0.01842551,

4. Set up the decoder

This will work in the same way as the demo, just make sure the input dimension of the embedding is equal to the number of words in the french vocabulary +1 (for the zero-padding). The same goes for the last Dense layer!

5. Try the decoder on the french train data and the teacher forcing data

<tf.Tensor: shape=(3500, 4, 1258), dtype=float32, numpy=
array([[[0.00079917, 0.00078856, 0.00079867, ..., 0.00080011,
         0.00079256, 0.00079229],
        [0.00079764, 0.00079093, 0.00079901, ..., 0.00079932,
         0.00078957, 0.0007946 ],
        [0.00079621, 0.00079152, 0.00079583, ..., 0.00079952,
         0.00078921, 0.00079494],
        [0.00079622, 0.00079408, 0.00079392, ..., 0.00079977,
         0.00079026, 0.00079613]],

       [[0.00079946, 0.00078939, 0.00079881, ..., 0.00080009,
         0.00079287, 0.00079212],
        [0.00079881, 0.00078832, 0.00079819, ..., 0.00079679,
         0.00079118, 0.00078969],
        [0.00079534, 0.00078864, 0.00079493, ..., 0.00079706,
         0.00079157, 0.00078996],
        [0.0007948 , 0.00079164, 0.00079286, ..., 0.00079765,
         0.00079189, 0.00079264]],

       [[0.00079904, 0.00078939, 0.00079909, ..., 0.00079993,
         0.00079231, 0.00079234],
        [0.0007985 , 0.00078829, 0.00079849, ..., 0.00079665,
         0.00

6. Set up the inference decoder

The code here will be identical to the one from the demo except if you changed some naming conventions!

7. Compile the decoder (the training version) using the appropriate loss and metric functions.

8. Train the decoder for 50 epochs, this should take 10 minutes. Is there overfitting ?

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fd0cf49ff90>

9. Adapt the code from the demo to make some predictions on the validation data.

Be careful, in the demo the starting index for the teacher forcing sequences was 0, what index is the starting point of the teacher forcing sequences now?

Set up the first decoder input with the right dimension too!

pred: [ 2 18 54 54]
true: [26 29  5  0]


pred: [ 41 207 207  32]
true: [41 32  0  0]


pred: [8 6 8 7]
true: [ 28  78 192   0]


pred: [  2  25 238   5]
true: [  2 284   5   0]


pred: [  8   6 291 120]
true: [  8   6  34 574]


pred: [  3  46 106 101]
true: [  3  46 233   0]


pred: [ 15 130 136  39]
true: [13  5 89  0]


pred: [  3 128  77  32]
true: [  2 706   4   0]


pred: [ 25   5 117   7]
true: [ 25   5 516   0]


pred: [  2  33 193  77]
true: [  2  33 193   0]




10. Use the tokenizer to convert the target and predicted sequences back to text, what do you think of the translations?

true: i'll get you
pred i was good good


true: tom's here
pred tom's mad mad here


true: he's so young
pred he is he a


true: i called you
pred i can read you


true: he is no fool
pred he is kind too


true: i'm not mean
pred i'm not sure done


true: are you ready
pred be careful well in


true: i oppose it
pred i'm sorry busy here


true: can you pitch
pred can you swim a


true: i got fined
pred i got fined busy




11. Now that you reached the end of the exercise, go back to the beginning and increase the number of sentences your model will train on, this should significantly improve the quality of the predictions!