# Neural Translation Model
by Mac Brennan

## Overview

For this project we will be building a neural translation model that takes in a sentence in French and outputs a sentence in English. The model that will be used is called an encoder-decoder network. What this means is we have two neural networks:

- One called the encoder, that extracts the meaning from the French sentence, representing it as a tensor of numbers.
- One called the decoder that converts that tensor of numbers back into a sentence in English

Our job is to train the encoder and decoder to learn to do this in a way such that the English sentence output by the decoder has the same meaning as the input French sentence. To give a visual understanding of what is happening, the following illustration shows the model that will be built. Don't worry if it doesn't make complete sense, the details will be explained as we go. The goal is to give you a starting point for visualizing what is happening.

<p style='text-align: center !important;'>
 <img src='https://github.com/macbrennan90/macbrennan90.github.io/blob/master/images/encoder-decoder.png?raw=true'
      alt='Translation Model Summary'>
</p>


This project will be broken up into several parts as follows:

__Part 1:__ Preparing the words

+ Dataset
+ Word Embeddings

__Part 2:__ Building the Model

+ Bi-Directional LSTM Encoder
+ Decoder with Attention

__Part 3:__ Training the Model

__Part 4:__ Evaluation

__Part 5:__ Vizualize Attention

## Part 1: Preparing the words

### Dataset

The dataset that will be used is a text file of english sentences and the corresponding french sentences.

Each sentence is on a new line so the sentences will be split into a list.

In [45]:
# Before we get started we will load all the packages we will need
import os

# Pytorch
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np

#### Load the data
The data will be stored in two lists where each item is a sentence. The lists are:
+ english_sentences
+ french_sentences

In [2]:
with open('data/small_vocab_en', "r") as f:
    data1 = f.read()
with open('data/small_vocab_fr', "r") as f:
    data2 = f.read()
    
# The data is just in a text file with each sentence on its own line
english_sentences = data1.split('\n')
french_sentences = data2.split('\n')

In [6]:
print('Number of English sentences:', len(english_sentences), 
      '\nNumber of French sentences:', len(french_sentences),'\n')
print('Example/Target pair:\n')
print('  '+english_sentences[2])
print('  '+french_sentences[2])

Number of English sentences: 137861 
Number of French sentences: 137861 

Example/Target pair:

  california is usually quiet during march , and it is usually hot in june .
  california est généralement calme en mars , et il est généralement chaud en juin .


#### Vocabulary
We need to get a word count of each word in the dataset. This will give us a clearer picture of our data


In [46]:
english_sentences[0].split()

['new',
 'jersey',
 'is',
 'sometimes',
 'quiet',
 'during',
 'autumn',
 ',',
 'and',
 'it',
 'is',
 'snowy',
 'in',
 'april',
 '.']

In [15]:
en_word_count = {}
fr_word_count = {}

for sentence in english_sentences:
    for word in sentence.split():
        if word in en_word_count:
            en_word_count[word] +=1
        else:
            en_word_count[word] = 1
            
for sentence in french_sentences:
    for word in sentence.split():
        if word in fr_word_count:
            fr_word_count[word] +=1
        else:
            fr_word_count[word] = 1


In [27]:
print('Number of English words:', len(en_word_count))
print('Number of French words:', len(fr_word_count))

Number of English words: 227
Number of French words: 355


In [41]:
import operator

def get_value(items_tuple):
    return items_tuple[1]

sorted_en_words= sorted(en_word_count.items(), key=get_value, reverse=True)

In [42]:
sorted_en_words

[('is', 205858),
 (',', 140897),
 ('.', 129039),
 ('in', 75525),
 ('it', 75137),
 ('during', 74933),
 ('the', 67628),
 ('but', 63987),
 ('and', 59850),
 ('sometimes', 37746),
 ('usually', 37507),
 ('never', 37500),
 ('least', 27564),
 ('favorite', 27371),
 ('fruit', 27105),
 ('most', 14934),
 ('loved', 13666),
 ('liked', 13546),
 ('new', 12197),
 ('paris', 11334),
 ('india', 11277),
 ('united', 11270),
 ('states', 11270),
 ('california', 11250),
 ('jersey', 11225),
 ('france', 11170),
 ('china', 10953),
 ('he', 10786),
 ('she', 10786),
 ('grapefruit', 10118),
 ('your', 9734),
 ('my', 9700),
 ('his', 9700),
 ('her', 9700),
 ('fall', 9134),
 ('june', 9133),
 ('spring', 9102),
 ('january', 9090),
 ('winter', 9038),
 ('march', 9023),
 ('autumn', 9004),
 ('may', 8995),
 ('nice', 8984),
 ('september', 8958),
 ('july', 8956),
 ('april', 8954),
 ('november', 8951),
 ('summer', 8948),
 ('december', 8945),
 ('february', 8942),
 ('our', 8932),
 ('their', 8932),
 ('freezing', 8928),
 ('pleasant', 

In [43]:
sorted_fr_words = sorted(fr_word_count.items(), key=get_value, reverse=True)

In [44]:
sorted_fr_words

[('est', 196809),
 ('.', 135619),
 (',', 123135),
 ('en', 105768),
 ('il', 84079),
 ('les', 65255),
 ('mais', 63987),
 ('et', 59851),
 ('la', 49861),
 ('parfois', 37746),
 ('jamais', 37215),
 ('le', 35306),
 ("l'", 32917),
 ('généralement', 31292),
 ('moins', 27557),
 ('au', 25738),
 ('aimé', 24842),
 ('fruit', 23626),
 ('préféré', 22886),
 ('agréable', 17751),
 ('froid', 16794),
 ('son', 16496),
 ('chaud', 16405),
 ('de', 15070),
 ('plus', 14934),
 ('automne', 14727),
 ('mois', 14350),
 ('à', 13870),
 ('elle', 12056),
 ('citrons', 11679),
 ('paris', 11334),
 ('inde', 11277),
 ('états-unis', 11210),
 ('france', 11170),
 ('jersey', 11052),
 ('new', 11047),
 ('chine', 10936),
 ('pendant', 10741),
 ('pamplemousse', 10140),
 ('mon', 9403),
 ('votre', 9368),
 ('juin', 9133),
 ('printemps', 9100),
 ('janvier', 9090),
 ('hiver', 9038),
 ('mars', 9023),
 ('été', 8999),
 ('mai', 8995),
 ('septembre', 8958),
 ('juillet', 8956),
 ('avril', 8954),
 ('novembre', 8951),
 ('décembre', 8945),
 ('févri

So the dataset is pretty small, we may want to get a bigger data set, but we'll see how this one does.

### Word Embeddings

In [82]:
# make a dict with the top 100,000 words
en_words = []
en_vectors = []
with open('data/wiki.en.vec', "r") as f:
    f.readline()
    for _ in range(100):
        en_vecs = f.readline()
        word = en_vecs.split()[0]
        vector = np.float32(en_vecs.split()[1:])
        if word not in en_words:
            en_words.append(word)
            en_vectors.append(vector)

In [83]:
en_words

[',',
 '.',
 'the',
 '</s>',
 'of',
 '-',
 'in',
 'and',
 "'",
 ')',
 '(',
 'to',
 'a',
 'is',
 'was',
 'on',
 's',
 'for',
 'as',
 'by',
 'that',
 'it',
 'with',
 'from',
 'at',
 'he',
 'this',
 'be',
 'i',
 'an',
 'utc',
 'his',
 'not',
 '–',
 'are',
 'or',
 'talk',
 'which',
 'also',
 'has',
 'were',
 'but',
 'have',
 '#',
 'one',
 'rd',
 'new',
 'first',
 'page',
 'no',
 'you',
 'they',
 'had',
 'article',
 't',
 'who',
 '?',
 'all',
 'their',
 'there',
 'been',
 'made',
 'its',
 'people',
 'may',
 'after',
 '%',
 'other',
 'should',
 'two',
 'score',
 'her',
 'can',
 'would',
 'more',
 'if',
 'she',
 'about',
 'when',
 'time',
 'team',
 'american',
 'such',
 'th',
 'do',
 'discussion',
 'links',
 'only',
 'some',
 'up',
 'see',
 'united',
 'years',
 'into',
 '/',
 'school',
 'so',
 'world',
 'university',
 'during']

In [79]:
# make a dict with the top 100,000 words
fr_word2vec = {}
with open('data/wiki.fr.vec', "r") as f:
    f.readline()
    for _ in range(100):
        
        fr_vecs = f.readline()
        word = fr_vecs.split()[0]
        vector = np.float32(fr_vecs.split()[1:])
        if word not in fr_word2vec:
            fr_word2vec[word] = vector

In [80]:
fr_word2vec

{'#': array([-0.25004  , -0.29135  , -0.26787  ,  0.0019247,  0.070013 ,
        -0.25941  , -0.27919  , -0.29752  ,  0.27169  ,  0.038949 ,
        -0.27929  , -0.046832 , -0.36144  ,  0.51858  , -0.13666  ,
        -0.1768   ,  0.35704  ,  0.52863  , -0.2925   , -0.23158  ,
         0.026159 ,  0.01266  , -0.053051 ,  0.17067  , -0.032786 ,
        -0.34851  , -0.28979  , -0.0099246, -0.27903  , -0.56615  ,
         0.1442   ,  0.049089 ,  0.031445 ,  0.032972 ,  0.18406  ,
         0.25526  , -0.043553 ,  0.1554   ,  0.063371 ,  0.3494   ,
         0.023651 , -0.14081  ,  0.0055515,  0.056418 , -0.047313 ,
         0.36913  , -0.19925  ,  0.55417  ,  0.091695 ,  0.27058  ,
         0.19865  ,  0.031936 , -0.098765 ,  0.27911  , -0.0063712,
        -0.067249 , -0.14099  ,  0.12199  ,  0.18928  , -0.39528  ,
         0.083221 ,  0.23665  ,  0.17292  ,  0.11505  , -0.20174  ,
         0.04011  ,  0.54257  ,  0.41884  ,  0.26678  , -0.061063 ,
        -0.18088  , -0.051561 , -0.1208   ,

#### Preprocess the data
The example sentencs and label sentences need to be converted to ints

['new',
 'jersey',
 'is',
 'sometimes',
 'quiet',
 'during',
 'autumn',
 ',',
 'and',
 'it',
 'is',
 'snowy',
 'in',
 'april',
 '.']

## Encoder: Bi-Directional LSTM

In [20]:
seq_len = 5
batch_size = 1
input_dim = 10
hidden_size = 3
hidden_layers = 1
inputs = autograd.Variable(torch.randn((seq_len, batch_size, input_dim)))  # make a sequence of length 5, 1 batch, input 10dim vector

# initialize the hidden state. hidden layers have 3 nodes
hidden = (autograd.Variable(torch.randn(hidden_layers, batch_size, hidden_size)),
          autograd.Variable(torch.randn((hidden_layers, batch_size, hidden_size))))

In [21]:

lstm = nn.LSTM(input_dim, hidden_size)

In [22]:
out, hidden = lstm(inputs, hidden)

In [23]:
# outputs of each hidden node(3) for each item in sequence(5)
print(out)

# final hidden state and final cell state of sequence; notice that the hidden state equals the final output
print(hidden)

Variable containing:
(0 ,.,.) = 
 -0.1632 -0.0587 -0.0624

(1 ,.,.) = 
 -0.0296 -0.1508  0.0113

(2 ,.,.) = 
  0.0398  0.2324 -0.0406

(3 ,.,.) = 
  0.1305 -0.2526 -0.0687

(4 ,.,.) = 
  0.1295  0.1184 -0.1268
[torch.FloatTensor of size 5x1x3]

(Variable containing:
(0 ,.,.) = 
  0.1295  0.1184 -0.1268
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
  0.3113  0.2014 -0.2873
[torch.FloatTensor of size 1x1x3]
)


In [24]:
seq_len = 5
batch_size = 1
input_dim = 10
hidden_size = 3
hidden_layers = 1
num_dir = 2 # for bidirectional lstm
inputs = autograd.Variable(torch.randn((seq_len, batch_size, input_dim)))  # make a sequence of length 5, 1 batch, input 10dim vector

# initialize the hidden state. hidden layers have 3 nodes
hidden = (autograd.Variable(torch.randn(hidden_layers*num_dir, batch_size, hidden_size)),
          autograd.Variable(torch.randn((hidden_layers*num_dir, batch_size, hidden_size))))

In [25]:
lstm = nn.LSTM(input_dim, hidden_size, bidirectional=True)

In [26]:
out, hidden = lstm(inputs, hidden)

In [28]:
# outputs of each hidden node in both directions(3*2) for each item in sequence(5)
print(out)

# final hidden and cell state of model in both directions
# notice that the first 3 output of the final item equals the final first hidden state
# the second 3 outputs from the first item equals the final second hidden state
print(hidden)

Variable containing:
(0 ,.,.) = 
 -0.1693  0.7509 -0.0828 -0.5793  0.5285  0.6648

(1 ,.,.) = 
  0.0218  0.1844 -0.1121 -0.6127  0.2034  0.3170

(2 ,.,.) = 
  0.0731 -0.0683  0.0742 -0.0712  0.1552  0.0725

(3 ,.,.) = 
  0.0098 -0.0052 -0.1215 -0.0741  0.2897  0.0834

(4 ,.,.) = 
 -0.0263 -0.3019 -0.1845 -0.1061  0.4653  0.0832
[torch.FloatTensor of size 5x1x6]

(Variable containing:
(0 ,.,.) = 
 -0.0263 -0.3019 -0.1845

(1 ,.,.) = 
 -0.5793  0.5285  0.6648
[torch.FloatTensor of size 2x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.3161 -0.5384 -0.4495

(1 ,.,.) = 
 -0.7322  0.9997  0.9236
[torch.FloatTensor of size 2x1x3]
)


## Model

### Bi-Directional LSTM Encoder

In [None]:
class EncoderBiLSTM(nn.Module):
    def __init__(self,):
        super(EncoderBiLSTM, self).__init__()

### LSTM Decoder with Attention

In [None]:
class AttnDecoderLSTM(nn.Module):
    def __init__(self,):
        super(AttnDecoderLSTM, self).__init__()

## Training

## Visualizing Attention

In [9]:
a = 5