# Neural Machine Translation

## Encoder-Decoder Model for Neural Machine Translation:

* **Baseline Model:**
    * Embedding: 512-dimensions.
    * RNN Cell: Gated Recurrent Unit or GRU.
    * Encoder: Bidirectional.
    * Encoder Depth: 2-layers (1 layer in each direction).Decoder Depth: 2-layers.
    * Attention: Bahdanau-style.
    * Optimizer: Adam.
    *Dropout: 20% on input.

* **Word Embedding Size:**
    * Start with a small embedding, such as 128, perhaps increase the size later for a minor lift in skill.

* **RNN Cell Type:**
    * Use LSTM RNN units in your model.

* **Encoder-Decoder Depth:**
    * Use a **1-layer bidirectional encoder and extend to 2 bidirectional layers** for a small lift in skill.
    * Use a **1-layer decoder as a starting point and use a 4-layer decoder** for better results.

* **Direction of Encoder Input:** 
    * Use a **reversed order input sequence or move to bidirectional** for a small lift in model skill.

* **Attention Mechanism:**
    * Use attention and prefer the Bahdanau-style weighted average style attention.

* **Inference:**
    * Start with a greedy search (beam=1) and tune based on your problem.

* **Final Model:**
    * Following parameters may be taken as a good or best starting point when developing your own encoder-decoder model for an NLP application.

![image.png](attachment:image.png)

# Develop a Neural Machine Translation Model

## Prepare Data

In [1]:
import string
import re
from pickle import load
from pickle import dump
from unicodedata import normalize
from numpy import array
from numpy.random import shuffle

In [2]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [3]:
# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

In [4]:
# clean a list of lines
def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [re_punc.sub('', w) for w in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return array(cleaned)

In [5]:
# save a list of clean sentences to file
def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

In [6]:
# load dataset
filename = 'deu-eng/deu.txt'
doc = load_doc(filename)

In [7]:
# split into english-german pairs
pairs = to_pairs(doc)

In [8]:
# clean sentences
clean_pairs = clean_pairs(pairs)

In [10]:
#for i in range(100):
    #print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

## Split Dataset

In [12]:
# reduce dataset size
n_sentences = 10000
dataset = clean_pairs[:n_sentences, :]

In [13]:
# random shuffle
shuffle(dataset)

In [52]:
# split into train/test
train, test = dataset[:9000], dataset[9000:]

In [53]:
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

Saved: english-german-both.pkl
Saved: english-german-train.pkl
Saved: english-german-test.pkl


## Train Model

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.callbacks import ModelCheckpoint

In [20]:
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

In [21]:
# max sentence length
def max_length(lines):
    return max(len(line.split()) for line in lines)

In [22]:
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

In [23]:
# one hot encode target sequence
def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

In [24]:
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    # compile model
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    # summarize defined model
    model.summary()
    #plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [25]:
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

English Vocabulary Size: 2256
English Max Length: 5


In [26]:
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

German Vocabulary Size: 3586
German Max Length: 9


In [27]:
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)

In [28]:
# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

In [29]:
# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 9, 256)            918016    
_________________________________________________________________
lstm (LSTM)                  (None, 256)               525312    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 5, 256)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 5, 256)            525312    
_________________________________________________________________
time_distributed (TimeDistri (None, 5, 2256)           579792    
Total params: 2,548,432
Trainable params: 2,548,432
Non-trainable params: 0
_________________________________________________________________


In [30]:
# fit model
checkpoint = ModelCheckpoint('neural_machine_translation_model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

Epoch 1/30
141/141 - 26s - loss: 4.1486 - val_loss: 3.4042

Epoch 00001: val_loss improved from inf to 3.40416, saving model to neural_machine_translation_model.h5
Epoch 2/30
141/141 - 16s - loss: 3.2267 - val_loss: 3.2502

Epoch 00002: val_loss improved from 3.40416 to 3.25018, saving model to neural_machine_translation_model.h5
Epoch 3/30
141/141 - 16s - loss: 3.0809 - val_loss: 3.1502

Epoch 00003: val_loss improved from 3.25018 to 3.15016, saving model to neural_machine_translation_model.h5
Epoch 4/30
141/141 - 16s - loss: 2.9215 - val_loss: 3.0110

Epoch 00004: val_loss improved from 3.15016 to 3.01100, saving model to neural_machine_translation_model.h5
Epoch 5/30
141/141 - 15s - loss: 2.7617 - val_loss: 2.9080

Epoch 00005: val_loss improved from 3.01100 to 2.90798, saving model to neural_machine_translation_model.h5
Epoch 6/30
141/141 - 15s - loss: 2.6345 - val_loss: 2.8277

Epoch 00006: val_loss improved from 2.90798 to 2.82768, saving model to neural_machine_translation_model

<tensorflow.python.keras.callbacks.History at 0x1f4ce7cdee0>

## Evaluate Neural Translation Model

In [54]:
from numpy import argmax
from tensorflow.keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

In [55]:
# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

In [56]:
# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

In [57]:
# generate target given source sequence
def predict_sequence(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
        word = word_for_id(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

In [74]:
# evaluate the skill of the model
def evaluate_model(model, sources, raw_dataset):
    actual, predicted = list(), list()
    for i, source in enumerate(sources):
        # translate encoded source text
        source = source.reshape((1, source.shape[0]))
        translation = predict_sequence(model, eng_tokenizer, source)
        raw_target, raw_src, t = raw_dataset[i]
        if i < 10:
            print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
        actual.append([raw_target.split()])
        predicted.append(translation.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

In [75]:
# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

In [76]:
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])

In [77]:
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])

In [78]:
# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])

In [79]:
# load model
model = load_model('neural_machine_translation_model.h5')

In [80]:
# test on some training sequences
print('train')
evaluate_model(model, trainX, train)

train
src=[das ist nicht tom], target=[thats not tom], predicted=[its isnt tom]
src=[tom jubelte], target=[tom cheered], predicted=[tom cheered]
src=[es ist mein ernst], target=[im serious], predicted=[its serious]
src=[zeig ihn ihm], target=[show it to him], predicted=[show it to him]
src=[lassen sie uns heraus], target=[let us out], predicted=[let us out]
src=[tom kam nach hause], target=[tom came home], predicted=[tom came home]
src=[ich rieche benzin], target=[i smell gas], predicted=[i smell gas]
src=[ist das ein nein], target=[is that a no], predicted=[is that a no]
src=[wir haben es vergessen], target=[we forgot], predicted=[we forgot]
src=[tom fahrt], target=[tom drives], predicted=[tom is]
BLEU-1: 0.831789
BLEU-2: 0.764197
BLEU-3: 0.650997
BLEU-4: 0.345318


In [81]:
# test on some test sequences
print('test')
evaluate_model(model, testX, test)

test
src=[wer ist sie], target=[who is she], predicted=[whos is]
src=[hor auf zeit zu schinden], target=[quit stalling], predicted=[stop at]
src=[sie hat ihm geholfen], target=[she helped him], predicted=[she called him]
src=[ich hatte spa hier], target=[i had fun here], predicted=[i was gas]
src=[wunsch dir was], target=[make a wish], predicted=[they yourself]
src=[mir tut der nacken weh], target=[my neck hurts], predicted=[my hip hurts]
src=[wir haben mittag gegessen], target=[we had lunch], predicted=[we want music]
src=[ich will eins], target=[i want one], predicted=[i want one]
src=[tom ist taub], target=[toms deaf], predicted=[tom is]
src=[lassen sie tom frei], target=[set tom free], predicted=[release tom]
BLEU-1: 0.525618
BLEU-2: 0.384386
BLEU-3: 0.290757
BLEU-4: 0.121809
