# Neural Machine Translation with Tensorflow 2.0: English - Russian
## By: Jacob Gursky

Hello all!  

This project deals with a booming subfield of modern machine learning: **neural machine translation**, also called NMT.  NMT has seen huge gains over previous methods, and even has applications in other fields of machine learning, such as generating contextual embeddings.

This tutorial assumes you already have some familiarity with neural networks, and more specifically Tensorflow and Keras.  

So what exactly is NMT?  At a high level, neural machine translation takes advantage of sequence-to-sequence modeling, which attempts to generate a variable length sequence given an arbitrarily long input sequence.  This is a huge advantage over previous network designs, as we are no longer constrained by sequence length!  The network is typically designed with three portions, an encoder, a decoder, and an attention mechanism.  

The encoder is usually constructed with an initial embedding layer and a recurrent layer or two in-between, producing a thought vector that is then passed to the decoder along with the last hidden state. The decoder is usually the mirror image: a recurrent layer or two with a corresponding dense layer that outputs the translated sequence.  The attention mechanism is a set of dense layers that produce a context vector that helps the network focus only on important words.  This usually helps with longer sequences.

I won't get too in-depth on attention mechanisms, but here is a useful link if you are interested in learning more:

http://akosiorek.github.io/ml/2017/10/14/visual-attention.html

Okay, now that we have a brief understanding of NMT, we are going to dive into the code!

If you want a more in-depth explanation of seq2seq models, the tensorflow notebook is fantastic!

https://github.com/tensorflow/nmt

Also, I would like to make clear that a good portion of the code used below comes from the offical Tensorflow 2.0 tutorial on NMT, though modified considerably. The link is provided below:

https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

Furthermore, some code has been adapted from this article, which provides a much lower-level introduction to Seq2Seq modeling:

https://towardsdatascience.com/seq2seq-model-in-tensorflow-ec0c557e560f

### Loading Packages

Below are the packages needed to run this notebook.  Note that we are using the Tensorflow 2.0 preview for this project, though you could also use 1.13 with eager execution enabled.  Note that as I am running this on my laptop, I do not have access to a GPU, though I would highly highly recommend using one!

In [1]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io
import time
import random

try:
    tf.enable_eager_execution()
except:
    pass
print('Tensorflow Version: ', tf.__version__)
print('Using Eager Execution?: ', tf.executing_eagerly())

Tensorflow Version:  2.0.0-dev20190305
Using Eager Execution?:  True


In [2]:
# First lets make sure we are operating on GPU
if not tf.test.gpu_device_name():
    print('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

No GPU found. Please use a GPU to train your neural network.


## About the data

The data I am using for this tutorial comes from the last article I referenced above, which contains roughly 130K english and french sentence pairs that use a relatively small number of words, making it an excellent small dataset to play with!  However, we are only going to use the english sentences, and translate as many as we can into russian using the wonderful Yandex Translate API.

## Defining our Helper Functions

First we need to define our function to load our sentences from the needed text file:

In [3]:
# Defining our reading function for pulling in data
def load_data(path):
    with open(path, 'r', encoding='utf-8') as file:
        data = file.read()
    return data

Now we need a helper function to pad our punctutation with whitespaces so our model treats them as separate tokens!

In [4]:
# Looks like we need to whitespace the punctuation
def whitespace_punct(sent_list):
    whitespaced = [re.sub('([.,!?;()"])', r' \1 ', x).strip() for x in sent_list]
    whitespaced = [re.sub('\s{2,}', ' ', x).strip() for x in whitespaced]
    whitespaced = [x.replace('-',' - ') for x in whitespaced]
    return whitespaced

Now lets load our english sentences into memory and take a look at what we are dealing with:

In [5]:
# Importing our small english vocab
eng = load_data('small_vocab_en.txt')
print('Number of characters: ', len(eng))
eng = eng.split('\n')
print('Number of sentences: ', len(eng))

Number of characters:  9085267
Number of sentences:  137861


## Using Yandex Translate

Now that we have our english sentences, we need to use the `yandex-translater` package to access the Yandex Translate API for Python to translate each of our sentences.  Note that there is a 10 million character free limit per month to use this API, so I will use as many sentences as can be translated in time!  Note that there are some requirements to using the translation results that can be seen here:

https://tech.yandex.com/translate/doc/dg/concepts/design-requirements-docpage/

In [None]:
# Using Yandex translater to create our target corpus
# Determining how many we need
rus = load_data('small_vocab_ru.txt')
rus = rus.split('\n')
rus_transl = rus

start_word = len(rus_transl)
print('Starting at position', start_word)

# Doing the translation
from yandex.Translater import Translater
tr = Translater()
tr.set_key('get your own API key!')
tr.set_from_lang('en')
tr.set_to_lang('ru')

for i in range(start_word, len(eng)):
    #print(i+1)
    tr.set_text(eng[i])
    rus_transl.append(tr.translate())

Now that we have translated as many sentences as possible, we will save the results for persistence:

In [None]:
# Saving our translations to a txt file
if len(rus_transl) != 0:
    f = open('small_vocab_ru.txt','w', encoding = 'utf-8')
    f.write('\n'.join(rus_transl))
    f.close()
    print('Saved Translated Data!')
else:
    print("Whoops, looks like the existing Russian corpus isn't loaded!")

Now lets load our data back into memory and take a look at a sentence pair to see if our data is aligned properly:

In [6]:
# Lets compare lines and make sure everything is correct
eng_test = load_data('small_vocab_en.txt')
rus_test = load_data('small_vocab_ru.txt')
eng_test = eng_test.split('\n')
rus_test = rus_test.split('\n')
max_len = min([len(eng_test),len(rus_test)])
random_line = random.sample(range(0,max_len),1)[0]
print(eng_test[random_line])
print(rus_test[random_line])
# Everything looks fine and lined up, we probably just need more data

he thinks it's easy to translate english to portuguese .
он думает, что это легко переводить с английского на португальский .


Powered by Yandex Translate

http://translate.yandex.com/

Looks like everything is aligned properly!  Now lets define some more helper functions to aid in setting up our models.  First we need a function to preprocess our sentences, removing whitespaces and adding the start and end tokens.

In [7]:
def preprocess_sentence(x):
    # Making sure sentences are in lowercase and removing leading and trailing whitespaces
    x = x.lower().rstrip().strip()
    
    # Adding our start and stop tokens
    x = '<start> ' + x + ' <end>'
    return x

We can bundle all of the above to create a helper function that neatly prepares all of our data:

In [8]:
def create_dataset(eng_path, rus_path):
    # First english
    eng_corpus = load_data(eng_path)
    eng_corpus = eng_corpus.split('\n')
    eng_corpus = [preprocess_sentence(x) for x in eng_corpus]
    
    # Now for Russian
    rus_corpus = load_data(rus_path)
    rus_corpus = rus_corpus.split('\n')
    rus_corpus = [preprocess_sentence(x) for x in rus_corpus]
    
    # Slimming down data as the lengths may differ
    last_pair = min([len(eng_corpus),len(rus_corpus)])
    eng_corpus, rus_corpus = eng_corpus[0:last_pair], rus_corpus[0:last_pair]

    return eng_corpus, rus_corpus

We also need to define a few more helper functions that will be used later, such as determing the max length of our sequences for padding, tokenizing our sentences, and another helper function that combines a few that we have defined already:

In [9]:
def max_length(tensor):
    return max(len(t) for t in tensor)
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [10]:
def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    
    lang_tokenizer.fit_on_texts(lang)
    
    tensor = lang_tokenizer.texts_to_sequences(lang)
    
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
  
    return tensor, lang_tokenizer
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [11]:
def load_dataset(eng_path, rus_path):
    # creating cleaned input, output pairs
    inp_lang, targ_lang = create_dataset(eng_path, rus_path)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

## Preparing the Data

Now we are finally ready to load our data in the format that we will feed into the NMT network!  We also use a train-test split of 20% for validation purposes using the scikit-learn implementation:

In [12]:
# Loading our data
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset('small_vocab_en.txt','small_vocab_ru.txt')

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

In [13]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

(6085, 6085, 1522, 1522)

Now that we have all of our data neatly prepared, lets take a look at how the tokenization process is working:

In [14]:
def convert(lang, tensor):
    for t in tensor:
        if t!=0:
            print ("%d ----> %s" % (t, lang.index_word[t]))
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [15]:
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ("\nTarget Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

Input Language; index to word mapping
3 ----> <start>
101 ----> we
88 ----> dislike
78 ----> lemons
2 ----> ,
84 ----> apples
2 ----> ,
11 ----> and
81 ----> limes
5 ----> .
4 ----> <end>

Target Language; index to word mapping
1 ----> <start>
117 ----> мы
7 ----> не
155 ----> любим
59 ----> лимоны
4 ----> ,
65 ----> яблоки
8 ----> и
85 ----> лаймы
3 ----> .
2 ----> <end>


We also need to declare some of our important model hyperparameters, such as number of epochs to train over, dropout rates, learning rate, etc:

In [16]:
# Define the buffer size as the number of training/validation obs
BUFFER_SIZE = len(input_tensor_train)
VAL_BUFFER_SIZE = len(input_tensor_val)

# Setting our batch size
BATCH_SIZE = 32

# Number of epochs to train over
EPOCHS = 50

# Number of rounds with no improvement to stop after
early_stopping_rounds = 5

# How many steps do we need to take per epoch?
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
val_steps_per_epoch = len(input_tensor_val)//BATCH_SIZE

# The dimension of our word embeddings
embedding_dim = 256

# The number of RNN cells to include in the recurrent layer
units = 128

# The dropout rate of the recurrent cells to help generalize
dropout = 0.5

# Determine the clipping threshold for our gradients to ease training
gradient_clip = 1

# Define the learning rate of our optimizer
learning_rate = 0.001

# Setting vocab sizes
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

We can also use Tensorflow's Dataset class to make our training process easier!

In [17]:
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

dataset_val = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val)).shuffle(VAL_BUFFER_SIZE)
dataset_val = dataset_val.batch(BATCH_SIZE, drop_remainder=True)

example_input_batch, example_target_batch = next(iter(dataset))
print(example_input_batch.shape, example_target_batch.shape)
example_input_batch, example_target_batch = next(iter(dataset_val))
print(example_input_batch.shape, example_target_batch.shape)

# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

(32, 19) (32, 20)
(32, 19) (32, 20)


## Defining our NMT Model

Now comes the fun stuff! We first need to define our encoder class, which takes in the input sequence and passes the thought vector and hidden state to the decoder.  We can do this easily using Keras's Model class.  Note that we are using a GRU layer hear instead of LSTM.  In my experience GRU performs nearly identically as long as you are using an attention mechanism, and is much faster to train!  We are only using a single layer here, but in larger scale projects you would use many more, sometimes even as a residual network.

In [18]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz, dropout):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
        self.drop = tf.keras.layers.Dropout(rate=dropout)
        
    def call(self, x, hidden, dropout=False):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)     
        if dropout:
            output = self.drop(output)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

Let's create our encoder object, using some of the hyperparameters defined above:

In [19]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE, dropout)

Now lets define our Attention class!  I won't get too deep into attention mechanisms, but again, this helps our model with longer sequences.

In [20]:
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
  
    def call(self, query, values):
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(query, 1)

        # score shape == (batch_size, max_length, hidden_size)
        score = self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))

        # attention_weights shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
    
        return context_vector, attention_weights
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [21]:
attention_layer = BahdanauAttention(10)

Now we need to declare the last part of our NMT model, which is the decoder.  Note again we are using a GRU layer instead of LSTM, and have dropout implemented to help make our model more generalizable:

In [22]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz, dropout):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        self.drop = tf.keras.layers.Dropout(rate=dropout)

        # used for attention
        self.attention = BahdanauAttention(self.dec_units)

    def call(self, x, hidden, enc_output, dropout=False):
        # enc_output shape == (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(hidden, enc_output)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # Applying dropout
        if dropout:
            output = self.drop(output)

        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [23]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE, dropout)

Time to declare our optimizer!  I opted for the ever-popular Adam optimizer, though there is much literature stating that SGD with momentum often compares comparably, even better sometimes!  I used Adam because I don't have the compute power with me right now to tune the momentum hyperparameter, so this works for our current needs.  We also define our loss function as sparse categorical cross-entropy, as opposed to the usual categorical cross-entropy, as our tokens are in integer format rather than one-hot encoding.

In [24]:
optimizer = tf.keras.optimizers.Adam(lr=learning_rate, clipvalue=gradient_clip)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    
    return tf.reduce_mean(loss_)
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

Its usually a good idea to save you model as you train, so we are going to set up a checkpoint directory:

In [25]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

## Training and Validation Functions

The last step we need to do before training is to define our training and validation tensorflow functions.  Note that for our training function, we use teacher forcing on the decoder, meaning that at each time-step it is fed the true value at the previous time-step to speed up training, while in the validation function we do not do this to get a better idea of model performance, as we cannot teacher force in the real world.

In [26]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
        
    with tf.GradientTape() as tape:
        
        enc_output, enc_hidden = encoder(inp, enc_hidden, dropout=True)

        dec_hidden = enc_hidden

        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)       

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder

            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output, dropout=True)
        
            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)
    
    optimizer.apply_gradients(zip(gradients, variables))
  
    return batch_loss
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [27]:
@tf.function
def calc_val_error(inp, targ, enc_hidden):
    loss = 0
   
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden
        
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']]*BATCH_SIZE, 1) 
        
    for t in range(1, targ.shape[1]):
        predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
        
        loss += loss_function(targ[:, t], predictions)
        
        #dec_input = tf.expand_dims(targ[:, t], 1)
        
        dec_input = tf.expand_dims(tf.argmax(predictions,1),1)

    return loss/targ.shape[0]

## Putting it all together!
Finally, the time has come to train our NMT model!  The process is relatively straight forward, but note that I also implemented early-stopping here to prevent our model from overfitting as we are using a relatively small dataset.

In [28]:
rounds_not_improved = 0
prev_val_loss = 200
for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss, val_loss = 0, 0
    
    # Calculation loss and applying gradients on training batches
    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss
        
    # Calculating validation error for this epoch to determine early stopping
    for (batch, (inp, targ)) in enumerate(dataset_val.take(val_steps_per_epoch)):
        batch_loss = calc_val_error(inp, targ, enc_hidden)
        val_loss += batch_loss
        
    # Creating our meaned losses
    total_loss = total_loss/steps_per_epoch
    val_loss = val_loss/val_steps_per_epoch
    
    if epoch+1 == 1 or (epoch+1)% 5==0:
        print('Epoch = {} | Train Loss = {:.4f} | Val Loss = {:.4f} | Train Time = {:.2f} sec\n'.format(epoch + 1,
                                                                                                        total_loss,
                                                                                                        val_loss,
                                                                                                        time.time() - start))
    
    # We need to test for early stopping rounds
    if val_loss>prev_val_loss:
        rounds_not_improved += 1
        if rounds_not_improved==5:
            print('Epoch = {} | Train Loss = {:.4f} | Val Loss = {:.4f} | Train Time = {:.2f} sec\n'.format(epoch + 1,
                                                                                                            total_loss,
                                                                                                            val_loss,
                                                                                                            time.time() - start))
            print('Early stopping limit reached!')
            break
    else:
        rounds_not_improved = 0
    prev_val_loss = val_loss
    
checkpoint.save(file_prefix = checkpoint_prefix)

# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

Epoch = 1 | Train Loss = 2.5532 | Val Loss = 2.0912 | Train Time = 49.01 sec

Epoch = 5 | Train Loss = 0.9926 | Val Loss = 2.2485 | Train Time = 12.75 sec

Epoch = 10 | Train Loss = 0.6854 | Val Loss = 2.0896 | Train Time = 12.78 sec

Epoch = 15 | Train Loss = 0.5625 | Val Loss = 1.8739 | Train Time = 12.87 sec

Epoch = 20 | Train Loss = 0.4601 | Val Loss = 1.6985 | Train Time = 12.64 sec

Epoch = 25 | Train Loss = 0.2926 | Val Loss = 1.4711 | Train Time = 12.62 sec

Epoch = 30 | Train Loss = 0.2220 | Val Loss = 1.4201 | Train Time = 12.64 sec

Epoch = 35 | Train Loss = 0.1628 | Val Loss = 1.3953 | Train Time = 12.83 sec

Epoch = 40 | Train Loss = 0.1122 | Val Loss = 1.3606 | Train Time = 12.76 sec

Epoch = 45 | Train Loss = 0.0851 | Val Loss = 1.3814 | Train Time = 12.86 sec

Epoch = 50 | Train Loss = 0.0834 | Val Loss = 1.4094 | Train Time = 12.98 sec



'./training_checkpoints\\ckpt-1'

## Using our Trained Model for Translation

Now that we have our trained NMT model, we can use it to translate sentences from english to russian!  Note that we also need to define an evaluatation function that is very similar to our validation function defined above, as well as a translation wrapper to use it:

In [36]:
def evaluate(sentence):
    
    sentence = preprocess_sentence(sentence)
    
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], 
                                                           maxlen=max_length_inp, 
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
    
    for t in range(max_length_targ):
        predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_out)

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.index_word[predicted_id] + ' '

        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence

# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

In [41]:
def translate(sentence):
    result, sentence = evaluate(sentence)
        
    print('Input: %s' % (sentence).encode('utf-8'))
    print('Predicted translation: {}\n'.format(result))
# Some code adapted from
# https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/sequences/nmt_with_attention.ipynb

To evaluate our model performance I took some english sentences that were not in either the training or validation set, so we can see how well our we are able to translate: 

In [46]:
sentences = ['california is usually beautiful during november , and it is never nice in april .',
             'the peach is your least liked fruit , but the apple is our least liked .',
             'he dislikes peaches , grapefruit , and lemons .',
             'india is usually freezing during autumn , and it is usually cold in february .',
             'he dislikes limes , apples , and grapefruit .',
             'we like apples and peaches .',
             'this rabbit was her favorite animal .',
             'france is sometimes snowy during july , and it is quiet in september .']

for i in sentences:
    translate(i)

Input: b'<start> california is usually beautiful during november , and it is never nice in april . <end>'
Predicted translation: калифорния обычно прекрасный в ноябре , и это не приятно в апреле . <end> 

Input: b'<start> the peach is your least liked fruit , but the apple is our least liked . <end>'
Predicted translation: персик ты не любил фрукты , но яблоко ты не любил фрукты , но яблоко ты не любил фрукты , 

Input: b'<start> he dislikes peaches , grapefruit , and lemons . <end>'
Predicted translation: он не любит персики , грейпфруты и лимоны . <end> 

Input: b'<start> india is usually freezing during autumn , and it is usually cold in february . <end>'
Predicted translation: индия обычно заморозки осенью , и это, как правило, холодно в феврале . <end> 

Input: b'<start> he dislikes limes , apples , and grapefruit . <end>'
Predicted translation: он не любит лимоны , яблоки и грейпфруты . <end> 

Input: b'<start> we like apples and peaches . <end>'
Predicted translation: мы как ябл

**Wow!** our model seems almost perfect even with such little data!  

However, its probably time to discuss how this project could be improved:

- Getting a larger and more diverse dataset
- Using pretrained word embeddings
- Using Stochastic Gradient Descent with Momentum

If you have any questions/concerns about this project, feel free to send me an email at gursky021197@gmail.com