## Machine Translation
The application of sequence-to-sequence model and transformers have greatly enhanced the results of machine translation. In this project, I will walk through the steps needed to train a machine translation model using sequence-to-sequence model and tranformer.

In [1]:
#imprt libraries
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Dataset: English-French Sentence Pair
The data set used in this project comes from Tatoeba (https://tatoeba.org/en), a website where people can upload sentences in any languages and contribute thier own versions of translations in other languages. The English-French dataset used here was downloaded from https://www.manythings.org/anki/, where they preprocessed the Tatoeba dataset so that it became a text file of English Frence sentence pair. 

## Prepare data: Load file
The sentence pair is saved in datapath as a .txt file. The following function ingests the files and make them into pairs.

In [2]:
def readData(lang1, lang2, dataPath, reverse = False):
    print("Reading dataset...")
    
    #open the file and split by lines (\n)
    lines = open(dataPath, encoding = 'utf-8').read().strip().split('\n')
    
    #split lines into pairs (separated by tab, or \t) and normalize
    
    pairs = [[cleanString(s) for s in l.split('\t')] for l in lines]
    
    # Reverse if spesified
    
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        source_lang = LangDict(lang2)
        target_lang = LangDict(lang1)
    else:
        source_lang = LangDict(lang1)
        target_lang = LangDict(lang2)
    
    return source_lang, target_lang, pairs

## Preprocessing: Create Token-Index Dictionary 
When building neural netword models, it is common practice to map words to a number because models can only "understand" numbers". The following LangDict class is used to walk through the wholde dataset and create word-number mapping dictionary and word count dictionary.

In [3]:
SOS_token = 0
EOS_token =1

class LangDict:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0:"SOS", 1:"EOS"}
        self.n_words = 2 # SOS + COS = 2

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
        

The following functions are used for preprocessing data (e.g. convert text to lowercase, remove puctuations).

In [4]:
#Helper functions

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def cleanString(s):
    #transform letters to lower case ones and remove non-letter symbols
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [5]:
Max_length = 10

exn_prefix = ("i am", "i m",
              "he is", "he s",
              "she is", "she s",
              "you are", "you re",
              "we are", "we re",
              "they are", "they re"              
             )

def filterPair(p):
    p1_tok = p[0].split(' ')
    p2_tok = p[1].split(' ')
    
    if len(p1_tok) < Max_length and len(p2_tok) < Max_length:
        return True
    else:
        return False

def BuildfilterdPairs(pairs):
    pairList = list()
    for pair in pairs:
        if filterPair(pair)==True:
            pairList.append([pair[0], pair[1]])
    return pairList

In [6]:
def prepareData(lang1, lang2, reverse = False):
    dataPath = r'./data/eng-fra_small.txt'
    input_lang, output_lang, pairs = readData(lang1, lang2, dataPath, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = BuildfilterdPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', False)

print("Below is an example of sentence pair:")
print(random.choice(pairs))        

Reading dataset...
Read 46301 sentence pairs
Trimmed to 45974 sentence pairs
Counting words...
Counted words:
eng 5683
fra 9741
Below is an example of sentence pair:
['i cannot excuse her .', 'je ne peux pas l excuser .']


## Encoder Setup
Here we define the structure of the first sequence model, the encoder, which encodes the information of the sentence in the source language. The encoder outputs a fixed length embedding, which will serve as the input of the second sequence model, the decoder, to output prediction of word in target language. The type of layer in the sequence model family used for the encoder is to be choosed. Here, I put the GRU (gated recurrent unit) for demonstration purposes.

In [7]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, dropout_rate):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size, dropout = dropout_rate)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        embedded = self.dropout(self.embedding(input)).view(1, 1, -1)
        output = embedded
        output, hidden = slef.gru(output, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device = device)

## Decoder Setup

## References

1. [NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html), PyTorch tutorial by Sean  Robertson.
2. [Aladdin Persson's Github](https://github.com/aladdinpersson)
3. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
