# Translation with a sequence to sequence network and attention

We we create a model to perform translation from French to English, using a sequence to sequence network.

Based on [this tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
sys.path.append('../src')

In [None]:
from seq.seq2seq.model import (Lang, SOS_token, EOS_token, 
                               tensor_from_sentence, train)

from seq.seq2seq.model import EncoderRNN, AttnDecoderRNN, train_iters
#from seq.seq2seq.model_torch import EncoderRNN, AttnDecoderRNN, trainIters

from seq.utils.parse import normalize_string

In [None]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
from pathlib import Path
from tqdm import tqdm
import time 
import math

import torch

from torch import optim


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
device

In [None]:
data_path = Path('data/seq-to-seq/eng-fra.txt')

We will represent each **word** (instead of each letter) in a language as a one-hot vector. We will cheat and trim the data to only use a few thousand words per language.

We'll make a helper class with `word2index` and `index2word` dictionaries.

In [None]:
def read_langs(lang1, lang2, reverse=False):
    lines = data_path.read_text().strip().split('\n')
    
    # split every line into pairs and normalize
    pairs = [[normalize_string(s) for s in l.split('\t')] for l in lines]
    
    # reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)
    return input_lang, output_lang, pairs

Trim the data to short and simple sentences -- this is just a tutorial.

In [None]:
MAX_LENGTH = 10  # of sentences

eng_prefixes = (  # filter to sentences beginning with these prefixes
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filter_pair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)

def filter_pairs(pairs):
    return [pair for pair in pairs if filter_pair(pair)]

def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1, lang2, reverse)
    pairs = filter_pairs(pairs)
    for pair in tqdm(pairs):
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])
    return input_lang, output_lang, pairs

In [None]:
input_lang, output_lang, pairs = prepare_data('eng', 'fra', True)

In [None]:
input_lang.name, input_lang.n_words

In [None]:
output_lang.name, output_lang.n_words

In [None]:
len(pairs)

In [None]:
random.choice(pairs)

## The Seq2Seq model

A seq2seq network, also known as an Encoder Decoder network, consists of two RNNs called the encoder and decoder. The encoder reads an input sequence and outputs a single vector. The decoder reads that vector to produce an output sequence.

<img src="../figures/encoder-decoder.png">

### The encoder

The encoder of a seq2seq network is an RNN that outputs some value for every word in the input sentence, and a hidden state. It encodes the input in an embedding before passing the embedding to the RNN.

<img src="../figures/encoder-seq2seq.png">

## The decoder

The decoder is another RNN that takes the encoder output vector and outputs a sequence of words to create a translation.

<img src="../figures/decoder-seq2seq.png">

The encoder's final hidden state is given to the decoder as the first hidden state. This is called a **context vector**.

## Attention decoder

<img src="../figures/attn-diag-seq2seq.png">

<img src="../figures/attn-seq2seq.png">

## Train the seq2seq + attn model 

In [None]:
teacher_forcing_ratio = 0.5

In [None]:
hidden_size = 256
encoder = EncoderRNN(input_lang.n_words, hidden_size, device).to(device)
attn_decoder = AttnDecoderRNN(hidden_size, output_lang.n_words, MAX_LENGTH, device, dropout_p=0.1).to(device)
losses = train_iters(
    pairs,
    encoder,
    attn_decoder,
    input_lang,
    output_lang,
    100,
    device,
    teacher_forcing_ratio,
    MAX_LENGTH,
    learning_rate=0.01,
)

In [None]:
losses