# Translating English to Romanian with a RNN
I'm trying to get a better understanding of RNN's before I move to transformers so I will be implementing a RNN that translates english to romanian!  
I will be following this [tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) but will train it to translate it to romanian. Afterwards, I want to ask my model questions in English and have it respond in Japanese.

## Table of Contents


In [58]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cpu")

# Data Cleaning
Our data is from https://www.manythings.org/anki/ and is a text file.  The file is a tab separated list of translation pairs: `Hi.	もしもし`.

We will represent every word in our language as a one-hot vector. We'll need a unique index per word to use as the input and targets of our network.  
Our Lang class will keep track of word to index as well as index to word, and we'll keep track of the number of words and use the final index as the index of rare words.

In [59]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

The files are in unicode. To simplify the files, we will convert them to ASCII, make everything lowercase, and trim most of the punctuation.

In [60]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

## Filtering Data
There are a lot of example sentences so we'll only take the smaller sentences.

We're filtering so that the length of the of the sentences is less than 10 and they only start with certain prefixes. 

In [61]:
MAX_LENGTH = 10

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

## Reading the Data
To read the file, we'll split the file into lines, then split the lines into pairs; we'll also add a reverse function

In [69]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('%s-%s/ron.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    pairs = [p[:2] for p in pairs]
    
    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

In [79]:
def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs('ron', 'eng', reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

In [80]:
input_lang, output_lang, pairs = prepare_data('ron', 'eng', reverse=True)

Reading lines...
Read 14237 sentence pairs
Trimmed to 11339 sentence pairs
Counting words...
Counted words:
eng 6602
ron 4717


# Seq2Seq Model
Seq2Seq models are models consisting of two RNN's: an encoder and a decoder. The encoder reads a sequence and outputs a single vector, the decoder reads that vector to produce an output sequence.

When you translate words directly from one language to another, the meaning is sometimes lost because the words are in different orders. This means it's difficult to produce a correct translation from just a sequence of words.  
We feed the sequence into an encoder, which ideally encodes the *meaning* of the input sentence into a single vector.

## The Encoder
The encoder outputs some value for every word in the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state as input for the next input word.

In [82]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## The Decoder
The decoder takes the encoder output vectors and outputs a sequence of words to create the translation

### Simple Decoder
In the simplest Seq2Seq decoder, we only use the last output of the encoder, sometimes referred to as the *context vector* as it encodes context from the entire sequence. This context vector is used as the initial input for the hidden state of the decoder.  

At every step of decoding, the decoder is given an input token and a hidden state. The initial input token is the *SOS* token and the initial hidden state is the *context vector*.

In [83]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(self, DecoderRNN).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.ReLU(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

# Training
