# Language Translator

Made by <a href="https://github.com/SeanvonB">SeanvonB</a> | <a href="https://github.com/SeanvonB/language-translator">Source</a>

This project was part of my [Natural Language Processing Nanodegree](https://www.udacity.com/course/natural-language-processing-nanodegree--nd892), which I completed in late 2020. This particular Nanodegree – in fact, this particular *project* – had been my goal throughout my studies of machine learning. I was just so excited to work on it back then, and I'm still excited to share the work with you now. Machine translation has a long and fascinating history that involved [many](https://en.wikipedia.org/wiki/Rule-based_machine_translation) [different](https://en.wikipedia.org/wiki/Statistical_machine_translation) [approaches](https://en.wikipedia.org/wiki/Example-based_machine_translation) before the widespread commercial adoption of [Neural Machine Translation](https://en.wikipedia.org/wiki/Neural_machine_translation) (NMT) around 2016 or so. The following NMT pipeline, that I created with TensorFlow via Keras, reflects some of the most state-of-the-art theories from that time period, but it was already somewhat outdated when I built it in 2020, thanks largely to [Google Brain](https://research.google/teams/brain/)'s [Transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) model with attention.

This notebook includes three main sections:
1.	Preprocessing, where I examine, tokenize, and pad the dataset.
2.	Models, where I showcase three different network features on their own before combining them into the final model.
3.	Prediction, where I show how the trained model performs.

Let's get started with a whole bunch of workspace helpers and imports:

In [1]:
%load_ext autoreload
%aimport helper
%autoreload 1

In [2]:
import collections
import helper
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Ain't nobody got time for training networks on CPU, so this cell simply confirms that the running workspace has access to a GPU, whether through a Udacity Workspace, Amazon Web Services, Google Cloud Platform, or an onboard device. As you can see below, this notebook did:

In [3]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 5959198815678897253
xla_global_id: -1
]


# 1.0 Preprocessing

## 1.1 Dataset

Language datasets are some of the oldest, largest, and best-maintained datasets available to data science, and the most commonly used translation sets are apparently those from [WMT](http://www.statmt.org/). However, these sets are **enormous**, so Udacity provided truncated versions of these datasets as vocabulary subsets that can train simple networks much faster. These files, for English and French, are located in the `data` directory and will loaded in below using the provided `helper.py` package:

In [4]:
# Load English data
english_sentences = helper.load_data('data/small_vocab_en')

# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


## 1.2 Sample the Data
Each index of `small_vocab_en` and `small_vocab_fr` contain the same sentence in their respective language.

The following simply prints the first two pairs:

In [5]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  paris is sometimes pleasant during october , but it is sometimes quiet in june .
small_vocab_fr Line 1:  paris est parfois agréable en octobre , mais il est parfois calme en juin .
small_vocab_en Line 2:  new jersey is never hot during june , and it is beautiful in september .
small_vocab_fr Line 2:  new jersey est jamais chaud en juin , et il est beau en septembre .


Obviously, this data has already undergone some preprocessing, because everything is lowercase and the punctuation is delimited with spaces. This isn't surprising, as these samples come from established datasets that are used for research, but that would otherwise have been Steps 1 and 2.

## 1.3 Vocabulary Complexity

In this instance, "complexity" refers to the size of the vocabulary and the number of unique words within it. You can probably intuit that more "complex" problems require more complex solutions, so the following will provide some insight into the complexity of what Udacity selected:

In [6]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1731746 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1862955 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


For comparison, Lewis Carroll's *Alice's Adventures in Wonderland* has 15,500 total words and 2,766 unique words.

So, there isn't that much complexity to this dataset.

## 1.4 Tokenize the Vocabulary

There are many steps involved in assembling a computer vision pipeline that a natural language processing pipeline can thankfully skip. However, there's one significant difference that must be addressed: unlike image data, language data isn't already numerical. Networks can't perform massive matrix maths on letters.

That's where **tokenizing** comes in. Tokenize can occur at the character level; but, for this application, I'll tokenize at the word level. This will create a library of word IDs that each represent one word. Fortunately, this process is very easy with the Keras [`Tokenzier`](https://keras.io/preprocessing/text/#tokenizer) object.

I'll also print the outcome as an example:

In [7]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    x_tk = Tokenizer(char_level = False)
    x_tk.fit_on_texts(x)
    
    return x_tk.texts_to_sequences(x), x_tk

# Test function and print results
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


As you can see, the `Tokenizer` simply assigns numbers to words in the order that they appear.

## 1.5 Pad the Inputs

The network will expect every batch of word ID sequences (an abstract way of saying "batch of sentences") to be the same length, but that doesn't naturally occur in either dimension: length varies between different sentences within each language and between the same sentence in different languages. Since sentences/sequences are fully dynamic in length, **padding** must be added to the **end** of each sequence to make them all as long as the longest sample in the dataset.

Keras provides another function, [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences), for just this purpose:

In [8]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    if length is None:
        length = max([len(sentence) for sentence in x])
    
    return pad_sequences(x, maxlen = length, padding = "post")

# Test function and print results
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


## 1.6 Preprocess Pipeline

Here's the full preprocessing pipeline, which includes the above `tokenize` and `pad` functions, plus a `.reshape()` of the data to accomodate how Keras implements the `SparseCategoricalCrossEntropy`, the loss function that I've chosen for this project. Finally, the vocabulary sizes must be increased by `1` to account for the new `<PAD>` token – this dumb thing had me stumped for a while.

In [9]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Loss function requires labels to be in 3D
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]

# Add 1 for <PAD> token
english_vocab_size = len(english_tokenizer.word_index) + 1
french_vocab_size = len(french_tokenizer.word_index) + 1

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 200
French vocabulary size: 345


And that's all for data preprocessing!

# 2.0 Models

This section showcases some experimentation with neural network architectures. From the start, I was pretty certain that the final architecture would use all of the tested architecture features; instead, I was mostly just curious how much of an impact each would have on performance.

Here are the four architectures that will be shown in this section:
1.	Simple RNN
2.	RNN with Embedding
3.	Bidirectional RNN
4.	Final Model

But, first, there's an issue with what all of these models will output...

## 2.1 IDs to Text

Everything that was done to preprocess the data was done to help the network handle it. But, regardless of the architecture, every model must end with a function that converts the base output – a sequence of word IDs – back into sentences that humans can understand. That's what the following `logits_to_text` function does:

Note: the word **logit**, in this context, means the highest-probability word ID that the network predicts for a given index within a given sequence.

In [10]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


## 2.2 Model #1: Simple RNN

It feels pedantic to say "a simple RNN" – y'know, just your run-of-the-mill *Recurrent Neural Network*. There isn't anything simple about RNNs, which I used previously in my [Image Captioner](https://seanvonb.github.io/image-captioner/) project. What RNNs added that prevous neural networks lacked is **memeory between steps**. As you can see in the following diagram, each step passes information both **out of the network** and **forward to the next step**, which allows the network to handle sequential data, like language, where subsequent outputs are determined as much by previous outputs as they are by current inputs.

<img src='images/simple.png' width="100%" height="auto" style="max-width: 800px;">

But this project will build upon this foundation with some new twists; so, for this notebook, I'll start with a *simple* RNN:

In [11]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 0.01
    
    input_seq = Input(input_shape[1:])
    rnn = GRU(256, return_sequences = True)(input_seq)
    logits = TimeDistributed(Dense(french_vocab_size))(rnn)
    
    model = Model(input_seq, Activation("softmax")(logits))
    model.compile(loss = sparse_categorical_crossentropy,
                  optimizer = Adam(learning_rate),
                  metrics = ['accuracy'])

    return model

# Reshape input to work with base Keras RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

Epoch 1/10
  7/108 [>.............................] - ETA: 42s - loss: 3.5084 - accuracy: 0.3735


KeyboardInterrupt



Well, that's an actual sentence... with essentially the opposite of the intended meaning.

## 2.3 Model #2: RNN with Embedding

Word IDs are a pretty basic way to represent a word for the network; there's a better way: **word embeddings**. Unlike word IDs, which represent words as a list of integers, word embeddings represent words as vectors in n-dimensional space, i.e. a big cloud of words, where similar words can cluster closer to each other. Word embeddings can help the network understand nuances in language, like how `hot` can be closer to `cold` in one dimension and closer to `sexy` in another. In the example below, you can see word `the` – with the word ID `8` - being embedded as the vector `[0.2, 4, 2.4, 1.1, ...]`, which continues for `n` dimensions.

<img src='images/embedding.png' width="100%" height="auto" style="max-width: 800px;">

The following uses a Keras `Embedding` layer with `n` set to `256`:

In [12]:
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 0.01
    
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length = output_sequence_length))
    model.add(GRU(256, return_sequences = True))
    model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
    
    model.compile(loss = sparse_categorical_crossentropy,
                  optimizer = Adam(learning_rate),
                  metrics = ['accuracy'])
    
    return model


# Reshape input
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)

# Train network
embed_rnn_model = embed_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


Now that's pretty good!

## 2.4 Model #3: Bidirectional RNN

An RNN allows the model to handle sequential data, like language; but a bidirectional RNN allows the model to handle language *better*. That's because a bidirectional RNN can also see *future* inputs! That might not be necessary for rote and inflexible sentence structures, but most instances of English will feature split, subordinate, or conditional clauses, phrasal verb tenses, or prepositional phrases – these can cause all manner of unusual splices and inversions of sentence structure. And that's *just* English – I have no idea what linguistic chicanery French gets up to!

<img src='images/bidirectional.png' width="100%" height="auto" style="max-width: 800px;">

This time, the model features a Keras `Bidirectional` layer:

In [13]:
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 0.001
    
    model = Sequential()
    model.add(Bidirectional(GRU(256, return_sequences = True), input_shape = input_shape[1:]))
    model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
    
    model.compile(loss = sparse_categorical_crossentropy,
                  optimizer = Adam(learning_rate),
                  metrics = ['accuracy'])
    
    return model 

# Train network
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

bd_rnn_model = bd_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
bd_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(bd_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

Train on 110288 samples, validate on 27573 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
new jersey est parfois calme en mois et il il il en en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


Uh-oh, that's somehow worse... Oh, of course! Bidirectional must take *twice* as long to train!

## 2.5 Model #4: Final Model

At this point, the architecture is becoming a little complicated, and its training needs are becoming a little less reasonable. But you know I'm still gonna mash the three previous approaches together with some `Dropout` and see what happens. Clearly, the `Embedding` layer had by far the most significant impact; however, I'm curious whether the `Bidirectional` layer will perform better on word embeddings. The following model begs for more training time, but I gave this one the same `10` epochs that the previous models had.

Here's the final model:

In [12]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 0.001
    
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256,
                        input_length = output_sequence_length,
                        input_shape = input_shape[1:]))
    model.add(Bidirectional(GRU(256, return_sequences = True)))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation = "softmax")))
    
    model.compile(loss = sparse_categorical_crossentropy,
                  optimizer = Adam(learning_rate),
                  metrics = ['accuracy'])
    
    return model

print('Final Model Loaded')

# Train network
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

final_rnn_model = model_final(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
final_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=2, validation_split=0.2)
#
# # Print prediction(s)
print(logits_to_text(final_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

Final Model Loaded
Epoch 1/2
Epoch 2/2
paris est parfois agréable en octobre mais il est parfois calme en juin <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


That's almost dead on, with only an errant `space` in `l'automne` from the printed sample.

# 3.0 Prediction

This was provided by Udacity to assess my work on the Nanodegree assignment, which was found to be successful:

In [35]:
def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    x = pad(x, max_french_sequence_length)
    
    model = model_final(
        x.shape,
        y.shape[1],
        english_vocab_size,
        french_vocab_size)
    model.fit(x, y, batch_size=1024, epochs=3, validation_split=0.2)

    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = ''

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))

final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Sample 1:
il a pas vu camion camion               
Il a vu un vieux camion jaune
Sample 2:
new jersey est parfois calme pendant l' et il il il en en        
new jersey est parfois calme pendant l' automne et il est neigeux en avril       


I'm still floored by achieving 95% validation accuracy after only 10 epochs, because this model could still benefit from so much more training. Further enhancements to this architecture could also be made, like the encoder-decoder arrangement I used for the [Image Captioner](https://seanvonb.github.io/image-captioner/). I'm so proud to have reached this point, and I hope you found the journey interesting.

Thanks for reading!

Made by <a href="https://github.com/SeanvonB">SeanvonB</a> | <a href="https://github.com/SeanvonB/language-translator">Source</a>