## Machine Translation
The application of sequence-to-sequence model and transformers have greatly enhanced the results of machine translation. In this project, I will walk through the steps needed to train a machine translation model using sequence-to-sequence model and tranformer.

In [2]:
#imprt libraries
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Dataset: English-French Sentence Pair
The data set used in this project comes from Tatoeba (https://tatoeba.org/en), a website where people can upload sentences in any languages and contribute thier own versions of translations in other languages. The English-French dataset used here was downloaded from https://www.manythings.org/anki/, where they preprocessed the Tatoeba dataset so that it became a text file of English Frence sentence pair. 

## Prepare data: Load file
The sentence pair is saved in datapath as a .txt file. The following function ingests the files and make them into pairs.

In [3]:
def readData(lan1, lang2, dataPath, reverse = False):
    print("Reading dataset...")
    
    #open the file and split by lines (\n)
    lines = open(dataPath, encoding = 'utf-8').read().strip().split('\n')
    
    #split lines into pairs (separated by tab, or \t) and normalize
    
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    
    # Reverse if spesified
    
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        source_lang = Lang(lang2)
        target_lang = Lang(lang1)
    else:
        source_lang = Lang(lang1)
        target_lang = Lang(lang2)
    
    return source_lang, target_lang, pairs

## Preprocessing: Create Token-Index Dictionary 
When building neural netword models, it is common practice to map words to a number because models can only "understand" numbers". The following LangDict class is used to walk through the wholde dataset and create word-number mapping dictionary and word count dictionary.

In [2]:
SOS_token = 0
EOS_token =1

class LangDict:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0:"SOS", 1:"EOS"}
        self.n_words = 2 # SOS + COS = 2

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
        

In [3]:
#Helper functions

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def cleanString(s):
    #transform letters to lower case ones and remove non-letter symbols
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

## References

1. [NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html), PyTorch tutorial by Sean  Robertson.
2. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
