In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Machine Translation
In this notebook, we aim to convert English phrases to French using RNN on Deep Learning Neural Network

#Introduction
In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.

In [183]:
#Now importing modules
!pip install helper
!pip install keras
!pip install tensorflow
import helper
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy



In [184]:
import tensorflow as tf

#Load Data
The small_vocab_en file contains English sentences with their French translations in the small_vocab_fr file. Load the English and French data from these files from running the cell below.

In [220]:
english_path='https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_en.txt'
french_path='https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_fr.txt'

Load the dataset and split file by lines

In [221]:
import os

def load_data(path):
  input_file = os.path.join(path)
  with open(input_file, "r") as f:
    data = f.read()

  return data.split('\n')

In [222]:
#Using helper to inport dataset
english_data=tf.keras.utils.get_file('file1',english_path)
french_data=tf.keras.utils.get_file('file2',french_path)

In [223]:
#Now loading data
english_sentences=load_data(english_data)
french_sentences=load_data(french_data)

In [224]:
len(french_sentences), len(english_sentences)

(137860, 137860)

In [225]:
english_sentences

['new jersey is sometimes quiet during autumn , and it is snowy in april .',
 'the united states is usually chilly during july , and it is usually freezing in november .',
 'california is usually quiet during march , and it is usually hot in june .',
 'the united states is sometimes mild during june , and it is cold in september .',
 'your least liked fruit is the grape , but my least liked is the apple .',
 'his favorite fruit is the orange , but my favorite is the grape .',
 'paris is relaxing during december , but it is usually chilly in july .',
 'new jersey is busy during spring , and it is never hot in march .',
 'our least liked fruit is the lemon , but my least liked is the grape .',
 'the united states is sometimes busy during january , and it is sometimes warm in november .',
 'the lime is her least liked fruit , but the banana is my least liked .',
 'he saw a old yellow truck .',
 'india is rainy during june , and it is sometimes warm in november .',
 'that cat was my most l

#Analysis of Dataset
Let us look at a few examples in the dataset of both language

In [226]:
for i in range(3):
  print('Sample :',i)
  print(english_sentences[i])
  print(french_sentences[i])
  print('-'*50)

Sample : 0
new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
--------------------------------------------------
Sample : 1
the united states is usually chilly during july , and it is usually freezing in november .
les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
--------------------------------------------------
Sample : 2
california is usually quiet during march , and it is usually hot in june .
california est généralement calme en mars , et il est généralement chaud en juin .
--------------------------------------------------


#Convert to Vocabulary
The complexity of the problem is determined by the complexity of the vocabulary. A more complex vocabulary is a more complex problem. Let's look at the complexity of the dataset we'll be working with.

In [227]:
import collections

In [228]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('English Vocab:',len(english_words_counter))
print('French Vocab:',len(french_words_counter))

English Vocab: 227
French Vocab: 355


#Tokenize (IMPLEMENTATION)
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number. These are called character and word ids, respectively.
- Character ids are used for character level models that generate text predictions for each character.
- A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity, so we'll use that.

**TO_DO:** Turn each sentence into a sequence of words_ids using Keras's Tokenizer function. Use this function to tokenize english_sentences and french_sentences in the cell below.

In [229]:
from collections import Counter
def tokenize(x):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(x)
  return tokenizer

In [230]:
# Tokenize Sample output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']

text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
text_tokenized = text_tokenizer.texts_to_sequences(text_sentences)

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
  print('Sequence {} in x'.format(sample_i + 1))
  print('  Input:  {}'.format(sent))
  print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}
Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


#Padding (IMPLEMENTATION)
When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the end of each sequence using Keras's pad_sequences function.

In [231]:
def pad(x, length=None):
  ## TO_DO:
  text_padded = pad_sequences(x, maxlen=length, padding='post')
  return text_padded

In [232]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    x_tk = tokenize(x)
    preprocess_x = x_tk.texts_to_sequences(x)
    y_tk = tokenize(y)
    preprocess_y = y_tk.texts_to_sequences(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    #Expanding dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)


print('Data Preprocessed.')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

print("Are all English sequences of the same length?", all(len(seq) == max_english_sequence_length for seq in preproc_english_sentences))
print("Are all French sequences of the same length?", all(len(seq) == max_french_sequence_length for seq in preproc_french_sentences))

Prédictions:   1%|▏         | 127/10000 [14:19<18:33:34,  6.77s/it]


Data Preprocessed.
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344
Are all English sequences of the same length? True
Are all French sequences of the same length? True


#Create Model



The neural network will translate the input to words ids, which isn't the final form we want. We want the French translation. The function logits_to_text will bridge the gap between the logits from the neural network to the French translation. You'll be using this function to better understand the output of the neural network.

In [233]:
def logits_to_text(logits, tokenizer):
  index_to_words = {id: word for word, id in tokenizer.word_index.items()}
  index_to_words[0] = '<PAD>'

  #So basically we are predicting output for a given word and then selecting best answer
  #Then selecting that label we reverse-enumerate the word from id
  return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

![Model](https://github.com/tommytracey/AIND-Capstone/raw/8267d4fe72e48c595a0aff46eaf0a805fff0f36d/images/embedding.png)

#Building Model
Here we use RNN model combined with GRU nodes for translation.
In the code section below, we give a simple model example. You can first run this model and play with it. Then you can change the model architecture by following the Exercise 4 to get better results.

In [199]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """

    ## TO_DO: Improve the layers (See Exercise 4)
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    return model

In [200]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
print(f"Shape before reshaping: {tmp_x.shape}")
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
print(f"Shape after reshaping: {tmp_x.shape}")

Shape before reshaping: (137860, 21)
Shape after reshaping: (137860, 21)


Finally calling the model function

In [201]:
# Hyperparameters
learning_rate = 0.005

In [202]:
simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

The output is a sequence of one-hot encoded arrays. Our data-set contains integer-tokens instead of one-hot encoded arrays. Each one-hot encoded array has large number of elements so it would be extremely wasteful to convert the entire data-set to one-hot encoded arrays. A better way is to use a so-called sparse cross-entropy loss-function, which does the conversion internally from integers to one-hot encoded arrays.

In [203]:
# Compile model
simple_rnn_model.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

In [204]:
simple_rnn_model.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 21, 256)           51200     
                                                                 
 gru_10 (GRU)                (None, 21, 256)           394752    
                                                                 
 time_distributed_20 (TimeD  (None, 21, 1024)          263168    
 istributed)                                                     
                                                                 
 dropout_10 (Dropout)        (None, 21, 1024)          0         
                                                                 
 time_distributed_21 (TimeD  (None, 21, 345)           353625    
 istributed)                                                     
                                                                 
Total params: 1062745 (4.05 MB)
Trainable params: 106

#Training the model
Here we start to train the model and pass the english text and the max_sequence_length, with vocab size for both english and french text

In [205]:
history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#Arbitrary Predictions

Try with arbitary examples in the corpus to see the translation

In [218]:
import re
def final_predictions(text):
  y_id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
  y_id_to_word[0] = '<PAD>'

  sentence = [english_tokenizer.word_index[word] for word in text.split()]
  sentence = pad_sequences([sentence], maxlen=preproc_french_sentences.shape[-2], padding='post')
  french_translation = logits_to_text(simple_rnn_model.predict(sentence[:1], verbose=0)[0], french_tokenizer)
  return re.split(r"\s*<PAD>", french_translation, 1)[0]

In [207]:
txt = english_sentences[0].lower()
print('English: ', english_sentences[0])
print('French: ', final_predictions(re.sub(r'[^\w]', ' ', txt)))

English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French:  new jersey est parfois calme en l' automne il est neigeux en avril


# Evaluation

In this section, we provide the example code for you to do the evaluation using BLEU score metrics.

In [208]:
# useful tokenization
import re
from functools import lru_cache


class BaseTokenizer:
    """A base dummy tokenizer to derive from."""

    def signature(self):
        """
        Returns a signature for the tokenizer.
        :return: signature string
        """
        return "none"

    def __call__(self, line):
        """
        Tokenizes an input line with the tokenizer.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        return line


class TokenizerRegexp(BaseTokenizer):
    def signature(self):
        return "re"

    def __init__(self):
        self._re = [
            # language-dependent part (assuming Western languages)
            (re.compile(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])"), r" \1 "),
            # tokenize period and comma unless preceded by a digit
            (re.compile(r"([^0-9])([\.,])"), r"\1 \2 "),
            # tokenize period and comma unless followed by a digit
            (re.compile(r"([\.,])([^0-9])"), r" \1 \2"),
            # tokenize dash when preceded by a digit
            (re.compile(r"([0-9])(-)"), r"\1 \2 "),
            # one space only between words
            # NOTE: Doing this in Python (below) is faster
            # (re.compile(r'\s+'), r' '),
        ]

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Common post-processing tokenizer for `13a` and `zh` tokenizers.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        for (_re, repl) in self._re:
            line = _re.sub(repl, line)

        # no leading or trailing spaces, single space within words
        # return ' '.join(line.split())
        # This line is changed with regards to the original tokenizer (seen above) to return individual words
        return line.split()


class Tokenizer13a(BaseTokenizer):
    def signature(self):
        return "13a"

    def __init__(self):
        self._post_tokenizer = TokenizerRegexp()

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Tokenizes an input line using a relatively minimal tokenization
        that is however equivalent to mteval-v13a, used by WMT.

        :param line: a segment to tokenize
        :return: the tokenized line
        """

        # language-independent part:
        line = line.replace("<skipped>", "")
        line = line.replace("-\n", "")
        line = line.replace("\n", " ")

        if "&" in line:
            line = line.replace("&quot;", '"')
            line = line.replace("&amp;", "&")
            line = line.replace("&lt;", "<")
            line = line.replace("&gt;", ">")

        return self._post_tokenizer(f" {line} ")

In [209]:
import collections
import math


def get_ngrams(segment, max_order):
  """Extracts all n-grams upto a given maximum order from an input segment.

  Args:
    segment: text segment from which n-grams will be extracted.
    max_order: maximum length in tokens of the n-grams returned by this
        methods.

  Returns:
    The Counter containing all n-grams upto max_order in segment
    with a count of how many times each n-gram occurred.
  """
  ngram_counts = collections.Counter()
  for order in range(1, max_order + 1):
    for i in range(0, len(segment) - order + 1):
      ngram = tuple(segment[i:i+order])
      ngram_counts[ngram] += 1
  return ngram_counts


def compute_bleu(reference_corpus, translation_corpus, max_order=4):
  """Computes BLEU score of translated segments against one or more references.

  Args:
    reference_corpus: list of lists of references for each translation. Each
        reference should be tokenized into a list of tokens.
    translation_corpus: list of translations to score. Each translation
        should be tokenized into a list of tokens.
    max_order: Maximum n-gram order to use when computing BLEU score.

  Returns:
    3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
    precisions and brevity penalty.
  """
  matches_by_order = [0] * max_order
  possible_matches_by_order = [0] * max_order
  reference_length = 0
  translation_length = 0
  for (references, translation) in zip(reference_corpus,
                                       translation_corpus):
    reference_length += min(len(r) for r in references)
    translation_length += len(translation)

    merged_ref_ngram_counts = collections.Counter()
    for reference in references:
      merged_ref_ngram_counts |= get_ngrams(reference, max_order)
    translation_ngram_counts = get_ngrams(translation, max_order)
    overlap = translation_ngram_counts & merged_ref_ngram_counts
    for ngram in overlap:
      matches_by_order[len(ngram)-1] += overlap[ngram]
    for order in range(1, max_order+1):
      possible_matches = len(translation) - order + 1
      if possible_matches > 0:
        possible_matches_by_order[order-1] += possible_matches

  precisions = [0] * max_order
  for i in range(0, max_order):
      if possible_matches_by_order[i] > 0:
        precisions[i] = (float(matches_by_order[i]) /
                         possible_matches_by_order[i])
      else:
        precisions[i] = 0.0

  if min(precisions) > 0:
    ## TO_DO: compute the geometric mean of all modified precision scores
      geo_mean = math.exp(sum(math.log(p) for p in precisions) / max_order)
  else:
      geo_mean = 0


  ## TO_DO: compute the brevity penalty (BP)
  ratio = translation_length / reference_length
  if ratio > 1.0:
      bp = 1.0
  else:
      bp = math.exp(1.0 - 1.0 / ratio)

  # final bleu score
  bleu = geo_mean * bp

  return (bleu, precisions, bp, ratio, translation_length, reference_length)

In [210]:
# Evaluation
def compute_bleu_score(predictions, references, tokenizer=Tokenizer13a(), max_order=4):
      # if only one reference is provided make sure we still use list of lists
      if isinstance(references[0], str):
          references = [[ref] for ref in references]

      references = [[tokenizer(r) for r in ref] for ref in references]
      predictions = [tokenizer(p) for p in predictions]
      score = compute_bleu(
          reference_corpus=references, translation_corpus=predictions, max_order=max_order)
      (bleu, precisions, bp, ratio, translation_length, reference_length) = score
      return {
          "bleu": bleu,
          "precisions": precisions,
          "brevity_penalty": bp,
          "length_ratio": ratio,
          "translation_length": translation_length,
          "reference_length": reference_length,
      }

A small example for real evaluation, feel free to change the final_predictions funtion to make it more adaptable.

In [211]:
references = french_sentences[:5]
predictions = [final_predictions(re.sub(r'[^\w]', ' ', txt)) for txt in english_sentences[:5]]
compute_bleu_score(predictions, references, max_order=2)

{'bleu': 0.5901239588230737,
 'precisions': [0.7941176470588235, 0.5714285714285714],
 'brevity_penalty': 0.8760317528329519,
 'length_ratio': 0.8831168831168831,
 'translation_length': 68,
 'reference_length': 77}

## Exercises:

* Please complete the code under **TO_DO**
* Complete the evaluation metrics (BLEU) and evaluate the whole dataset.
* Train with more epochs. Does it improve the translations?
* Change the architectures of the neural network, Does it improve the translations? For example:
    * change the number of GRU layers
    * change embedding-size
    * try Bidirectional-RNN
* Please finally submit the notebook with the best architecture settings that you found and comment your results.


## 1) EUROPARL Dataset import

In [None]:
!wget https://www.statmt.org/europarl/v7/fr-en.tgz
!tar -xzf fr-en.tgz

--2024-06-18 18:38:32--  https://www.statmt.org/europarl/v7/fr-en.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.32.28
Connecting to www.statmt.org (www.statmt.org)|129.215.32.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202718517 (193M) [application/x-gzip]
Saving to: ‘fr-en.tgz’


2024-06-18 18:38:35 (88.0 MB/s) - ‘fr-en.tgz’ saved [202718517/202718517]



In [212]:
### Téléchargement du dataset ###
english_path='/content/europarl-v7.fr-en.en'
french_path='/content/europarl-v7.fr-en.fr'

def load_data(path):
  input_file = os.path.join(path)
  with open(input_file, "r") as f:
    data = f.read()
  return data.split('\n')

english_sentences = load_data(english_path)
french_sentences = load_data(french_path)

num_sentences = 15000
english_sentences = english_sentences[:num_sentences]
french_sentences = french_sentences[:num_sentences]

print("Anglais : ", english_sentences[:3])
print(len(english_sentences))
print("Français : ", french_sentences[:3])
print(len(french_sentences))

Anglais :  ['Resumption of the session', 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful."]
15000
Français :  ['Reprise de la session', 'Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.', 'Comme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit. En revanche, les citoyens d\'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.']
15000


In [213]:
new_english_sentences = english_sentences[:5]

text_tokenizer = tokenize(new_english_sentences)
print(text_tokenizer.word_index)
text_tokenized = text_tokenizer.texts_to_sequences(new_english_sentences)

for sample_i, (sent, token_sent) in enumerate(zip(new_english_sentences, text_tokenized)):
  print('Sequence {} in x'.format(sample_i + 1))
  print('  Input:  {}'.format(sent))
  print('  Output: {}'.format(token_sent))


{'the': 1, 'of': 2, 'a': 3, 'in': 4, 'you': 5, 'session': 6, 'i': 7, 'on': 8, 'to': 9, 'have': 10, 'european': 11, 'like': 12, 'that': 13, 'as': 14, 'number': 15, 'countries': 16, 'requested': 17, 'this': 18, 'resumption': 19, 'declare': 20, 'resumed': 21, 'parliament': 22, 'adjourned': 23, 'friday': 24, '17': 25, 'december': 26, '1999': 27, 'and': 28, 'would': 29, 'once': 30, 'again': 31, 'wish': 32, 'happy': 33, 'new': 34, 'year': 35, 'hope': 36, 'enjoyed': 37, 'pleasant': 38, 'festive': 39, 'period': 40, 'although': 41, 'will': 42, 'seen': 43, 'dreaded': 44, "'millennium": 45, "bug'": 46, 'failed': 47, 'materialise': 48, 'still': 49, 'people': 50, 'suffered': 51, 'series': 52, 'natural': 53, 'disasters': 54, 'truly': 55, 'were': 56, 'dreadful': 57, 'debate': 58, 'subject': 59, 'course': 60, 'next': 61, 'few': 62, 'days': 63, 'during': 64, 'part': 65, 'meantime': 66, 'should': 67, 'observe': 68, "minute'": 69, 's': 70, 'silence': 71, 'members': 72, 'behalf': 73, 'all': 74, 'victims':

## Data Préprocessing



In [214]:
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)


print('Data Preprocessed.')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

print("Are all English sequences of the same length?", all(len(seq) == max_english_sequence_length for seq in preproc_english_sentences))
print("Are all French sequences of the same length?", all(len(seq) == max_french_sequence_length for seq in preproc_french_sentences))

Data Preprocessed.
Max English sentence length: 150
Max French sentence length: 147
English vocabulary size: 13086
French vocabulary size: 18974
Are all English sequences of the same length? True
Are all French sequences of the same length? True


In [215]:
# english_tokenizer_dataset.word_index
# preproc_french_sentences_dataset

In [216]:
def logits_to_text(logits, tokenizer):
  index_to_words = {id: word for word, id in tokenizer.word_index.items()}
  index_to_words[0] = '<PAD>'

  #So basically we are predicting output for a given word and then selecting best answer
  #Then selecting that label we reverse-enumerate the word from id
  return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

## Training

In [None]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """

    ## TO_DO: Improve the layers (See Exercise 4)
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    return model

In [None]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
print(f"Shape before reshaping: {tmp_x.shape}")
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
print(f"Shape after reshaping: {tmp_x.shape}")

Shape before reshaping: (2000, 138)
Shape after reshaping: (2000, 138)


In [None]:
learning_rate = 0.005

simple_rnn_model2 = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

simple_rnn_model2.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

simple_rnn_model2.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 138, 256)          1271040   
                                                                 
 gru_5 (GRU)                 (None, 138, 256)          394752    
                                                                 
 time_distributed_10 (TimeD  (None, 138, 1024)         263168    
 istributed)                                                     
                                                                 
 dropout_5 (Dropout)         (None, 138, 1024)         0         
                                                                 
 time_distributed_11 (TimeD  (None, 138, 6588)         6752700   
 istributed)                                                     
                                                                 
Total params: 8681660 (33.12 MB)
Trainable params: 868

In [None]:
history=simple_rnn_model2.fit(tmp_x, preproc_french_sentences, batch_size=128, epochs=5, validation_split=0.3)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## 2) Dataset evaluation

In [None]:
from tqdm import tqdm
def final_predictions(text):
  y_id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
  y_id_to_word[0] = '<PAD>'

  sentence = [english_tokenizer.word_index[word.lower()] for word in text.split()]
  sentence = pad_sequences([sentence], maxlen=preproc_french_sentences.shape[-2], padding='post')
  french_translation = logits_to_text(simple_rnn_model.predict(sentence[:1], verbose=0)[0], french_tokenizer)
  return re.split(r"\s*<PAD>", french_translation, 1)[0]

references = french_sentences[:5]
predictions = [final_predictions(re.sub(r'[^\w]', ' ', txt)) for txt in english_sentences[:5]]
compute_bleu_score(predictions, references, max_order=2)

KeyError: 'bug'

## **JE N'AI PAS RÉUSSI A RESOUDRE L'ERREUR CI-DESSUS (KeyError: 'bug') QUI M'EMPÊCHAIT DE FAIRE DES PRRRÉDICTIONS AVEC LE NOUVEAU DATASET, DE CE FAIT J'AI DÛ FAIRE LA SUITE DU TP AVEC LE SAMPLE DATASET**

In [236]:
references = french_sentences

predictions = []
for txt in tqdm(english_sentences[:2000], desc="Prédictions"):
  clean_txt = re.sub(r'[^\w\']', ' ', txt)
  prediction = final_predictions(clean_txt)
  predictions.append(prediction)

compute_bleu_score(predictions, references, max_order=2)

Prédictions: 100%|██████████| 2000/2000 [02:35<00:00, 12.84it/s]


{'bleu': 0.714328185625965,
 'precisions': [0.9071064564022311, 0.7527594782077571],
 'brevity_penalty': 0.8644513190877329,
 'length_ratio': 0.8728590942523905,
 'translation_length': 24921,
 'reference_length': 28551}

## Premiers résultats avec RNN (par manque de ressources computationnelles, je n'utilise que 2000 phrases)

Caractéristiques du RNN :
- batch_size=1024,
- epochs=10,
- validation_split=0.2,
- learning_rate=0.005,
- 5 couches : embedding/GRU/TimeDistr/Dropout/TimeDistr

On a :

**Bleu** : 0.7143

**Précisions** : [0.9071, 0.7526]

**brevity_penalty** : 0.8645

**length_ratio** : 0.8729

**translation_length** : 24921

**reference_length** : 28551





## 3) Train with more epochs. Does it improve the translations?

In [237]:
simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

simple_rnn_model.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)

predictions = []
for txt in tqdm(english_sentences[:2000], desc="Prédictions"):
  clean_txt = re.sub(r'[^\w\']', ' ', txt)
  prediction = final_predictions(clean_txt)
  predictions.append(prediction)

compute_bleu_score(predictions, references, max_order=2)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Prédictions: 100%|██████████| 2000/2000 [02:30<00:00, 13.29it/s]


{'bleu': 0.716483611506371,
 'precisions': [0.9098103808452515, 0.757076708020269],
 'brevity_penalty': 0.8632982779433895,
 'length_ratio': 0.8718433680081258,
 'translation_length': 24892,
 'reference_length': 28551}

Avec les mêmes configurations mais avec 20 épochs au lieu de 10, on a les métriques suivantes

**Bleu** : 0.7165

**Précisions** : [0.9098, 0.7571]

**brevity_penalty** : 0.8633

**length_ratio** : 0.8718

**translation_length** : 24892

**reference_length** : 28551

On obtient quasiment le même BLEU score et des métriques similiraies à un epsilon près. Le nombre d'épochs (au dela d'un seuil) n'est plus si déterminant que cela.

## 4) Change the architectures of the neural network, Does it improve the translations?


### -->  Number of GRU layers = 3

In [239]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(GRU(256, return_sequences=True))
    model.add(GRU(256, return_sequences=True))
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    return model

simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

simple_rnn_model.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

predictions = []
for txt in tqdm(english_sentences[:2000], desc="Prédictions"):
  clean_txt = re.sub(r'[^\w\']', ' ', txt)
  prediction = final_predictions(clean_txt)
  predictions.append(prediction)

compute_bleu_score(predictions, references, max_order=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Prédictions: 100%|██████████| 2000/2000 [02:38<00:00, 12.64it/s]


{'bleu': 0.7143483148839451,
 'precisions': [0.9078424286230574, 0.7534384141815482],
 'brevity_penalty': 0.8637357732958085,
 'length_ratio': 0.8722286434800882,
 'translation_length': 24903,
 'reference_length': 28551}

**Bleu** : 0.7143

**Précisions** : [0.9078, 0.7534]

**brevity_penalty** : 0.8637

**length_ratio** : 0.8722

**translation_length** : 24903

**reference_length** : 28551

On remarque que les résultats n'évoluent pas grandement non plus en rajoutant des couches GRU. On aurait pu étudier l'impact du nombre d'unités de la couche GRU plutôt que sur le nombre de couches.

### -->  Embedding size = 512

In [240]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    model = Sequential()
    model.add(Embedding(english_vocab_size, 512, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    return model

simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

simple_rnn_model.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

predictions = []
for txt in tqdm(english_sentences[:2000], desc="Prédictions"):
  clean_txt = re.sub(r'[^\w\']', ' ', txt)
  prediction = final_predictions(clean_txt)
  predictions.append(prediction)

compute_bleu_score(predictions, references, max_order=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Prédictions: 100%|██████████| 2000/2000 [02:38<00:00, 12.59it/s]


{'bleu': 0.7152663616859067,
 'precisions': [0.9074490287365549, 0.7548001396404259],
 'brevity_penalty': 0.8642526006518635,
 'length_ratio': 0.872683969037862,
 'translation_length': 24916,
 'reference_length': 28551}

**Bleu** : 0.7153

**Précisions** : [0.9074, 0.7548]

**brevity_penalty** : 0.8643

**length_ratio** : 0.8727

**translation_length** : 24916

**reference_length** : 28551


### -->  Bidirectional-RNN

In [241]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(Bidirectional(GRU(256, return_sequences=True)))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    return model

simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

simple_rnn_model.compile(loss=sparse_categorical_crossentropy,
                         optimizer=Adam(learning_rate),
                         metrics=['accuracy'])

history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

predictions = []
for txt in tqdm(english_sentences[:2000], desc="Prédictions"):
  clean_txt = re.sub(r'[^\w\']', ' ', txt)
  prediction = final_predictions(clean_txt)
  predictions.append(prediction)

compute_bleu_score(predictions, references, max_order=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Prédictions: 100%|██████████| 2000/2000 [02:48<00:00, 11.90it/s]


{'bleu': 0.8034979934093085,
 'precisions': [0.9785226816539543, 0.8838061981667394],
 'brevity_penalty': 0.8640140935655726,
 'length_ratio': 0.872473818780428,
 'translation_length': 24910,
 'reference_length': 28551}

**Bleu** : 0.8035

**Précisions** : [0.9785, 0.8838]

**brevity_penalty** : 0.8640

**length_ratio** :  0.8725

**translation_length** : 24910

**reference_length** : 28551


Parmi les différents modèles que nous avons testés, nous avons pu voir que le modèle RNN bidirectionnel était le plus efficace au vu des métriques car offrait le meilleur BLUE score avec environ 0.1 de plus que le second meilleur modèle. De plus il offrait une meilleure range de précision.