# Translating English to Romanian with a RNN
I'm trying to get a better understanding of RNN's before I move to transformers so I will be implementing a RNN that translates english to romanian!  
I will be following this [tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) but will train it to translate it to romanian. Afterwards, I want to ask my model questions in English and have it respond in Japanese.

## Table of Contents


In [130]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cpu")

# Data Cleaning
Our data is from https://www.manythings.org/anki/ and is a text file.  The file is a tab separated list of translation pairs: `Hi.	もしもし`.

We will represent every word in our language as a one-hot vector. We'll need a unique index per word to use as the input and targets of our network.  
Our Lang class will keep track of word to index as well as index to word, and we'll keep track of the number of words and use the final index as the index of rare words.

In [155]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

The files are in unicode. To simplify the files, we will convert them to ASCII, make everything lowercase, and trim most of the punctuation.

In [132]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

## Filtering Data
There are a lot of example sentences so we'll only take the smaller sentences.

We're filtering so that the length of the of the sentences is less than 10 and they only start with certain prefixes. 

In [176]:
MAX_LENGTH = 7

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

## Reading the Data
To read the file, we'll split the file into lines, then split the lines into pairs; we'll also add a reverse function

In [177]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('%s-%s/ron.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    pairs = [p[:2] for p in pairs]
    
    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

In [178]:
def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs('ron', 'eng', reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

In [233]:
input_lang, output_lang, pairs = prepare_data('ron', 'eng', reverse=True)

Reading lines...
Read 14237 sentence pairs
Trimmed to 5385 sentence pairs
Counting words...
Counted words:
eng 3778
ron 2931


# Seq2Seq Model
Seq2Seq models are models consisting of two RNN's: an encoder and a decoder. The encoder reads a sequence and outputs a single vector, the decoder reads that vector to produce an output sequence.

When you translate words directly from one language to another, the meaning is sometimes lost because the words are in different orders. This means it's difficult to produce a correct translation from just a sequence of words.  
We feed the sequence into an encoder, which ideally encodes the *meaning* of the input sentence into a single vector.

## The Encoder
The encoder outputs some value for every word in the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state as input for the next input word.

In [180]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## The Decoder
The decoder takes the encoder output vectors and outputs a sequence of words to create the translation

### Attention Decoder
When only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence.  
Attention allows the decoder network to *focus* on a different part of the encoders output for every step of the decoders own outputs. 
First, we calculate a set of **attention weights**. These will be multiplied by the encoders output to create a weighted combination, the result: `attn_applied` should contain information about that specific part of the input sequence and help the decoder choose the right output words. 

Calculating the attention weights is done with another feed-forward network: `attn`, using the decoder's input and hidden state as input. There are sentences of all sizes in the training data so we have to choose a max length (input length, for encoder outputs) it can apply to. Sentences of max length will use all the attention weights, while shorter sentences will only use the first few.

In [231]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

# Training
## Preparing Training Data
For each pair, we need an input tensor (indexes of words from the input) and a target tensor (indexes of words from the target). While creating these tensors, we append the *EOS* token to the end of each tensor.

In [218]:
def idxFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = idxFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long).view(-1,1)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    output_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, output_tensor)

## Training 
To train the model, we run the input through the encoder and keep track of every output and the latest hidden state. Then the decoder is given the *SOS* token as its first input and the last hidden state of the encoder as the first hidden state.

The concept of **Teacher Forcing** uses the real target outputs as each next input, instead of using the decoder's guess as the next input. This causes the network to converge faster, [but can cause some instability when the trained network is exploited.](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf)  

This means you can observe outputs with correct grammar, but the wrong translation. Intuitively, the model has learned to represent the output grammar and can "pick up" the meaning after the teacher tells it the first few words, but it hasn't learned how to properly create the sentence from the translation in the first place. 

We will be turning on/off our teacher randomly.

In [219]:
teacher_forcing_ratio = 0.5

def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

Helper function to print time elapsed and estimated time remaining.

In [220]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

## Training Process
The training process looks like this: 
* start a timer
* initialize parameters and criterion
* create a set of training pairs
* start empty losses array for plotting
* call train many times and print the progress occasionally.

In [221]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0 # reset every print_every
    plot_loss_total = 0 # reset every plot_every
    
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)]
    criterion = nn.NLLLoss()
    
    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        
        loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss
        
        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))
        
        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    
    showPlot(plot_losses)
            

## Plotting Results

In [222]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

## Evaluation
Mostly the same as training, but there are no targets so we feed the decoder predictions back to itself for each step. Every time it predicts a word we add it to the output string and when it predicts the *EOS* token we stop there.

In [227]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

In [235]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

In [225]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 75000, print_every=5000)

1m 23s (- 19m 23s) (5000 6%) 3.9686
2m 45s (- 17m 56s) (10000 13%) 3.1147
4m 10s (- 16m 40s) (15000 20%) 2.5928
5m 33s (- 15m 16s) (20000 26%) 2.1787
6m 56s (- 13m 52s) (25000 33%) 1.7657
8m 19s (- 12m 29s) (30000 40%) 1.4654
9m 43s (- 11m 7s) (35000 46%) 1.1567
11m 8s (- 9m 44s) (40000 53%) 0.9576
12m 32s (- 8m 21s) (45000 60%) 0.7640
13m 56s (- 6m 58s) (50000 66%) 0.6039
15m 21s (- 5m 35s) (55000 73%) 0.4720
16m 46s (- 4m 11s) (60000 80%) 0.3631
18m 11s (- 2m 47s) (65000 86%) 0.3118
19m 36s (- 1m 24s) (70000 93%) 0.2532
21m 1s (- 0m 0s) (75000 100%) 0.2207


In [236]:
evaluateRandomly(encoder1, attn_decoder1)

> eu fac naveta cu trenul .
= i commute by train .
< i commute by train . <EOS>

> am platit aproape de dolari .
= i paid about bucks .
< i paid about bucks . <EOS>

> ninge ?
= is it snowing ?
< is it snowing ? <EOS>

> pazeste mi spatele .
= watch my back .
< watch my back . <EOS>

> ei m au ignorat .
= they ignored me .
< they ignored me . <EOS>

> stii ce sa faci .
= you know what to do .
< you know what to do . <EOS>

> tom nu isi lua notite .
= tom wasn t taking notes .
< tom wasn t taking notes . <EOS>

> ea a devenit asistenta .
= she became a nurse .
< she became a nurse . <EOS>

> intinde ti bratele .
= stretch out your arms .
< stretch out your arms . <EOS>

> tom privea .
= tom looked .
< tom looked . <EOS>

