## Machine Translation Using SeqToSeq Model
The application of sequence-to-sequence model and transformers have greatly enhanced the results of machine translation. In this project, I will walk through the steps needed to train a machine translation model using sequence-to-sequence model and tranformer.

In [1]:
#imprt libraries
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import math
import re
import random
import string
import time

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Dataset: English-French Sentence Pair
The data set used in this project comes from Tatoeba (https://tatoeba.org/en), a website where people can upload sentences in any languages and contribute thier own versions of translations in other languages. The English-French dataset used here was downloaded from https://www.manythings.org/anki/, where they preprocessed the Tatoeba dataset so that it became a text file of English Frence sentence pair. 

## Prepare data: Load file
The sentence pair is saved in datapath as a .txt file. The following function ingests the files and make them into pairs.

In [2]:
def readData(lang1, lang2, dataPath, reverse = False):
    print("Reading dataset...")
    
    #open the file and split by lines (\n)
    lines = open(dataPath, encoding = 'utf-8').read().strip().split('\n')
    
    #split lines into pairs (separated by tab, or \t) and normalize
    
    pairs = [[cleanString(s) for s in l.split('\t')] for l in lines]
    
    # Reverse if spesified
    
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        source_lang = LangDict(lang2)
        target_lang = LangDict(lang1)
    else:
        source_lang = LangDict(lang1)
        target_lang = LangDict(lang2)
    
    return source_lang, target_lang, pairs

## Preprocessing: Create Token-Index Dictionary 
When building neural netword models, it is common practice to map words to a number because models can only "understand" numbers". The following LangDict class is used to walk through the wholde dataset and create word-number mapping dictionary and word count dictionary.

In [3]:
SOS_token = 0
EOS_token =1

class LangDict:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0:"SOS", 1:"EOS"}
        self.n_words = 2 # SOS + COS = 2

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
            
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
        

The following functions are used for preprocessing data (e.g. convert text to lowercase, remove puctuations).

In [4]:
#Helper functions

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def cleanString(s):
    #transform letters to lower case ones and remove non-letter symbols
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [5]:
Max_length = 10

exn_prefix = ("i am", "i m",
              "he is", "he s",
              "she is", "she s",
              "you are", "you re",
              "we are", "we re",
              "they are", "they re"              
             )

def filterPair(p):
    p1_tok = p[0].split(' ')
    p2_tok = p[1].split(' ')
    
    if len(p1_tok) < Max_length and len(p2_tok) < Max_length:
        return True
    else:
        return False

def BuildfilterdPairs(pairs):
    pairList = list()
    for pair in pairs:
        if filterPair(pair)==True:
            pairList.append([pair[0], pair[1]])
    return pairList

In [6]:
def prepareData(lang1, lang2, dataPath, reverse = False):
    input_lang, output_lang, pairs = readData(lang1, lang2, dataPath, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = BuildfilterdPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', r'./data/eng-fra_train.txt', False)

print("Below is an example of sentence pair:")
print(random.choice(pairs))        

Reading dataset...
Read 153873 sentence pairs
Trimmed to 109310 sentence pairs
Counting words...
Counted words:
eng 11027
fra 17859
Below is an example of sentence pair:
['what do you want for breakfast ?', 'que veux tu pour le petit dejeuner ?']


## Encoder Setup
Here we define the structure of the first sequence model, the encoder, which encodes the information of the sentence in the source language. The encoder outputs a fixed length embedding, which will serve as the input of the second sequence model, the decoder, to output prediction of word in target language. The type of layer in the sequence model family used for the encoder is to be choosed. Here, I put the GRU (gated recurrent unit) for demonstration purposes.

In [7]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device = device)

## Decoder Setup
The decoder is also a RNN model, which takes the hidden layer of the encoder as input to generate sentence word by work in the target language. The sentence in source language is represented with a set of numbers. This is helpful because often translation cannot be done in word by word fasion.

In [8]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim =1)

    
    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device = device)

In [9]:
## Define helper functions
def indexFromSentence(lang, sentence):
    return [lang.word2index[word]  for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype = torch.long, device = device).view(-1, 1)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

## Model Training

In [10]:
teacherForcing_r = 0.5

#one training iteration
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, 
         decoder_optimizer, criterion, max_length = Max_length):
    
    #initialize encoder hidden layer weights
    encoder_hidden = encoder.initHidden()
    
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)
    
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device = device)
    
    loss = 0
    
    #encoder
    
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]
    
    decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_hidden = encoder_hidden
    
    teacherForcing = True if random.random() < teacherForcing_r else False
    #decoder    
    if teacherForcing:
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]
            
            
    else:
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()
        
            loss += criterion(decoder_output, target_tensor[di])
        
            if decoder_input.item() == EOS_token:
                break;
    
    loss.backward()
    
    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item() / target_length

In [11]:
#helper function: timers

def asMinutes(s):
    m = math.floor(s/ 60)
    s -= m * 60
    return '%dm %ds' % (m,s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
    

In [12]:
#iterate over training process
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every = 100, learning_rate = 0.01):
    
    start = time.time()
    plot_losses = []
    print_loss_total = 0
    plot_loss_total = 0
    
    encoder_optimizer = optim.SGD(encoder.parameters(), lr = learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr = learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                     for i in range(n_iters)]

    criterion = nn.NLLLoss()
    
    for iter in range(1, n_iters + 1):
        
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        
        loss = train(input_tensor, target_tensor, encoder,
                    decoder, encoder_optimizer, decoder_optimizer, criterion)
        
        print_loss_total += loss
        plot_loss_total += loss
        
        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter/ n_iters), 
                                         iter, iter/ n_iters  *  100, print_loss_avg))


In [13]:
#plot resutls
import matplotlib.pyplot as plt
plt.switch_backend('agg')

import matplotlib.ticker as ticker
import numpy as np

def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    #ticks at regular inter
    loc = ticker.MultipleLocator(base= 0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

## Evaluate the model

In [14]:
def evaluate(encoder, decoder, sentence, max_length=Max_length):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device = device)
        for ei in range(input_length):

            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
            encoder_outputs[ei] = encoder_output[0, 0]
            
            
        decoder_input = torch.tensor([[SOS_token]], device = device)#Start of sentens(SOS)
        
        decoder_hidden  = encoder_hidden
        
        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)
        
        for di in range(max_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            topv, topi, = decoder_output.data.topk(1)
            
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break;
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
            
            decoder_input = topi.squeeze().detach()
        
        return decoded_words, decoder_attentions[:di + 1]
            

In [15]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
decoder1 = DecoderRNN(hidden_size, output_lang.n_words).to(device)

trainIters(encoder1, decoder1, 60000, print_every=5000)

14m 5s (- 155m 1s) (5000 8%) 4.7450
27m 44s (- 138m 43s) (10000 16%) 4.1841
41m 21s (- 124m 5s) (15000 25%) 3.8686
55m 10s (- 110m 20s) (20000 33%) 3.6714
69m 35s (- 97m 26s) (25000 41%) 3.5068
83m 41s (- 83m 41s) (30000 50%) 3.3660
97m 41s (- 69m 46s) (35000 58%) 3.2737
111m 38s (- 55m 49s) (40000 66%) 3.1905
125m 41s (- 41m 53s) (45000 75%) 3.0839
140m 2s (- 28m 0s) (50000 83%) 2.9935
154m 49s (- 14m 4s) (55000 91%) 2.9505
169m 53s (- 0m 0s) (60000 100%) 2.8866


In [16]:
def evaluateRandomly(encoder, decoder, n = 10):
    for i in range(n):
        pair = random.choice(pairs)
        print('Sentence from source language: ', pair[0])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('Model generated sentence: ', output_sentence)
        print('Sentence from target language: ', pair[1])
        print('')

## Examples of tranlation produced by current model

In [17]:
evaluateRandomly(encoder1, decoder1)

Sentence from source language:  we re not family .
Model generated sentence:  nous ne sommes pas . <EOS>
Sentence from target language:  nous ne sommes pas de la meme famille .

Sentence from source language:  tom is heavily armed .
Model generated sentence:  tom est a . . <EOS>
Sentence from target language:  tom est lourdement arme .

Sentence from source language:  thanks for your hard work .
Model generated sentence:  merci pour ton travail . <EOS>
Sentence from target language:  merci d avoir travaille si durement .

Sentence from source language:  i want to be alone for a while .
Model generated sentence:  je veux juste un moment pour un moment . <EOS>
Sentence from target language:  je veux etre seule un moment .

Sentence from source language:  tom has to go even if it rains .
Model generated sentence:  tom a a y a y y . <EOS>
Sentence from target language:  tom doit y aller meme s il pleut .

Sentence from source language:  how s your family ?
Model generated sentence:  commen

## Evaluate the model using BLEU Score
One common measure of machine translation performance is the BLEU (Bilingual Evaluation Understudy) score. The BLEU score measures if all the N-grams in the ground truth translation are covered by the generated translation. Consider the following ground truth sentence pair:

(Eng) How are you doing - (French) Comment ca va

Based on the English sentence, the model generates the following sentence:

Tout va bien

If we measure the number 1 gram in the ground truth appears in the generated sentence, there is only va, which leads to the BLEU score of 1 / 3 ~ 0.33.

If we measure the number 2-grams in the ground truth that appear in the generated sentence, there is zero (Comment ca & ca va), which leads to the BLEU score of 0 / 2 = 0.

From this example, we can see the problem of BLEU score. The number of overlapping n-grams is not a perfect measure for translation because 'comment ca va' and ' and 'tout va bien' can both be correct translations for 'how are you doing'. However, the BLEU score is still a widely used measure for machine learning tasks using large amounts of data.

Below is how to use a pre-built bleu score module from NLTK to calculate the model performance using the test set.


In [18]:
input_lang_test, output_lang_test, pairs_test = prepareData('eng', 'fra', r'./data/eng-fra_test.txt', False)

Reading dataset...
Read 19236 sentence pairs
Trimmed to 13660 sentence pairs
Counting words...
Counted words:
eng 4970
fra 7125


In [50]:
#Calculate BLEU score (1-gram)
from nltk import translate

TotalBleuScore = 0
BleuScoreNum = 0
NoBleuScoreNum = 0


for i in range(1, len(pairs_test)):
    
    pair = pairs_test[i]
    try:
        output_words, attentions = evaluate(encoder1, decoder1, pair[0])
        predicted_sentence = ' '.join(output_words[:-1])
        groundTruth = pair[1]
        
        TotalBleuScore += translate.bleu_score.sentence_bleu(groundTruth, predicted_sentence, weights =(1, 0, 0, 0))
        BleuScoreNum += 1
    except:
        NoBleuScoreNum += 1
        

TotalBleuScore /= (len(pairs_test)-1)

print("Average BLEU score for unseen test dataset is:")
print(TotalBleuScore)

print("Number of pairs with no BLEU Score:")
print(NoBleuScoreNum)

Average BLEU score for unseen test dataset is:
0.43858213061744744
Number of pairs with no BLEU Score:
435


## References

1. [NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html), PyTorch tutorial by Sean  Robertson.
2. [Aladdin Persson's Github](https://github.com/aladdinpersson)
3. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
4. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. (2002). BLEU: a method for automatic evaluation of machine translation. In ACL.