# Automatic translation with seq2seq + attention

This tutorial is available at https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py

We use a sequence-to-sequence model, combining two recurrent neural networks: the encoder translates the input sequence into a vector, the decoder translates the vector back into the target language. In addition to this, a so-called attention mechanism makes it so the model learns to focus on a specific range of the input seq.

## The data
The data can be downloaded at: https://download.pytorch.org/tutorial/data.zip

It comes from https://tatoeba.org/ and in particular the file `data/eng-fra.txt` is a set of thousands of pairs of sentences (English-French).

First of all we need to represent each word in both languages with a one-hot vector. To do so, we first define a helper class of dictionaries to associate, for each language, each word to a unique index, and to keep track of how many occurrences of each word we encounter in the data.

In [1]:
# packages
#from io import open
import unicodedata
import string
import re
import random

In [2]:
# helper class Lang
SOS_token = 0 # manually set start-of-sentence and end-of-sentence tokens
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # count SOS and EOS

    def addWord(self, word): # function to add a word to the dictionaries
        if word not in self.word2index:
            self.word2index[word] = self.n_words # index is progressive with the order in which words are found
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1 # if the word was already in the dict, simply increase counter
        
    def addSentence(self, sentence): # function to add a sentence to the dictionary
        for word in sentence.split(' '):
            self.addWord(word)

For example:

In [17]:
# create Lang object
test = Lang("test")

# add sentence
test.addSentence("colorless green ideas")

In [18]:
# inspect dictionary
test.word2index

{'colorless': 2, 'green': 3, 'ideas': 4}

In [19]:
# add another sentence
test.addSentence("the quick brown fox")

# inspect dictionary
test.word2index

{'brown': 7,
 'colorless': 2,
 'fox': 8,
 'green': 3,
 'ideas': 4,
 'quick': 6,
 'the': 5}

In [20]:
# add another sentence
test.addSentence("the ideas of a green fox")

# inspect dictionary
test.word2index

{'a': 10,
 'brown': 7,
 'colorless': 2,
 'fox': 8,
 'green': 3,
 'ideas': 4,
 'of': 9,
 'quick': 6,
 'the': 5}

In [21]:
# counts
test.word2count

{'a': 1,
 'brown': 1,
 'colorless': 1,
 'fox': 2,
 'green': 2,
 'ideas': 2,
 'of': 1,
 'quick': 1,
 'the': 2}

Next, we simplify the data a bit turning it into ASCII (from Unicode), lowercasing and stripping (most) punctuation.

In [22]:
# turn a Unicode string to plain ASCII (see https://stackoverflow.com/a/518232/2809427)
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# normalization
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip()) # to ASCII and lowercased
    s = re.sub(r"([.!?])", r" \1", s) # add space before <.>, <!> and <?>
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s) # remove non-letter characters except <.>, <!> and <?>
    return s

For example:

In [24]:
normalizeString("HeLlO! 3 people?")

'hello ! people ?'

Next, we read the data: we split the input files into lines, then each line into pairs.

In [25]:
def readLangs(lang1, lang2, reverse=False): # add a reverse option, see below
    print("Reading lines...")

    # read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').read().strip().split('\n')

    # split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # reverse pairs (if needed) then build Lang instances (we'll populate them later)
    if reverse: # the data is eng --> other_lang; if we want reverse translation we can set the flag to true
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)
        
    print("Done!")

    return input_lang, output_lang, pairs

For example:

In [30]:
eng, fra, pairs = readLangs("eng", "fra")

Reading lines...


In [37]:
# inspect the pairs
pairs[0:10]

[['go .', 'va !'],
 ['run !', 'cours !'],
 ['run !', 'courez !'],
 ['wow !', 'ca alors !'],
 ['fire !', 'au feu !'],
 ['help !', 'a l aide !'],
 ['jump .', 'saute .'],
 ['stop !', 'ca suffit !'],
 ['stop !', 'stop !'],
 ['stop !', 'arrete toi !']]

In [36]:
# how many pairs?
print(len(pairs))

135842


In order to be able to train and evaluate (and play with) a model quickly, we artificially reduce the data set in order to have only sentences which begin with the forms "I am", "you are" and similar; moreover, we set a maximum length to the sentences (say, 10 tokens).

In [40]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ", # these variants account for the apostrophes removed above
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p): # returns a boolean according to three conditions to be met
    return len(p[0].split(' ')) < MAX_LENGTH and \ # max length of both input and ouput
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[0].startswith(eng_prefixes) # english version must begin with prefixes above

def filterPairs(pairs): # filter based on boolean above
    return [pair for pair in pairs if filterPair(pair)]

For example:

In [42]:
# how many pairs "survive"?
len(filterPairs(pairs))

10599

In [43]:
# take a look
filterPairs(pairs)[0:10]

[['i m .', 'j ai ans .'],
 ['i m ok .', 'je vais bien .'],
 ['i m ok .', 'ca va .'],
 ['i m fat .', 'je suis gras .'],
 ['i m fat .', 'je suis gros .'],
 ['i m fit .', 'je suis en forme .'],
 ['i m hit !', 'je suis touche !'],
 ['i m hit !', 'je suis touchee !'],
 ['i m ill .', 'je suis malade .'],
 ['i m sad .', 'je suis triste .']]

Finally, we can bring everything together in a function:

In [45]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs: # add sentences to the dictionaries
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

For example:

In [46]:
input_lang, output_lang, pairs = prepareData('eng', 'fra')
print(random.choice(pairs))

Reading lines...
Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counting words...
Counted words:
eng 2803
fra 4345
['we re not lost .', 'nous ne sommes pas perdues .']


Great! We can now move to...

## The model