<h1>Neural Translation Model in PyTorch</h1>
by Mac Brennan

<p style='text-align: center !important;'>
 <img src='https://github.com/macbrennan90/macbrennan90.github.io/blob/master/images/encoder-decoder.png?raw=true'
      alt='Translation Model Summary'>
</p>

This project will be broken up into several parts as follows:

__Part 1:__ Preparing the words

+ Inspecting the Dataset
+ Using Word Embeddings
+ Organizing the Data

__Part 2:__ Building the Model

+ Bi-Directional LSTM Encoder
+ Building Attention
+ Decoder with Attention

__Part 3:__ Training the Model

+ Training Function
+ Training Loop

__Part 4:__ Using the Model for Evaluation

__Part 5:__ Vizualize the Attention

In [1]:
# Before we get started we will load all the packages we will need

# Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import os.path
import time
import math
import random

# Use gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
device

device(type='cuda')

## Part 1: Preparing the Words

### Inspecting the Dataset

The dataset that will be used is a text file of english sentences and the corresponding french sentences.

Each sentence is on a new line. The sentences will be split into a list.

#### Load the data
The data will be stored in two lists where each item is a sentence. The lists are:
+ english_sentences
+ french_sentences

In [60]:
with open('data/small_vocab_en', "r") as f:
    data1 = f.read()
with open('data/small_vocab_fr', "r") as f:
    data2 = f.read()
    
# The data is just in a text file with each sentence on its own line
english_sentences = data1.split('\n')
french_sentences = data2.split('\n')

In [61]:
print('Number of English sentences:', len(english_sentences), 
      '\nNumber of French sentences:', len(french_sentences),'\n')
print('Example/Target pair:\n')
print('  '+english_sentences[2])
print('  '+french_sentences[2])

Number of English sentences: 137861 
Number of French sentences: 137861 

Example/Target pair:

  california is usually quiet during march , and it is usually hot in june .
  california est généralement calme en mars , et il est généralement chaud en juin .


#### Vocabulary
Let's take a closer look at the dataset.


In [62]:
english_sentences[2].split()

['california',
 'is',
 'usually',
 'quiet',
 'during',
 'march',
 ',',
 'and',
 'it',
 'is',
 'usually',
 'hot',
 'in',
 'june',
 '.']

In [63]:
max_en_length = 0
for sentence in english_sentences:
    length = len(sentence.split())
    max_en_length = max(max_en_length, length)
print("The longest english sentence in our dataset is:", max_en_length)    

The longest english sentence in our dataset is: 17


In [64]:
max_fr_length = 0
for sentence in french_sentences:
    length = len(sentence.split())
    max_fr_length = max(max_fr_length, length)
print("The longest french sentence in our dataset is:", max_fr_length)  

The longest french sentence in our dataset is: 23


In [65]:
en_word_count = {}
fr_word_count = {}

for sentence in english_sentences:
    for word in sentence.split():
        if word in en_word_count:
            en_word_count[word] +=1
        else:
            en_word_count[word] = 1
            
for sentence in french_sentences:
    for word in sentence.split():
        if word in fr_word_count:
            fr_word_count[word] +=1
        else:
            fr_word_count[word] = 1


In [66]:
print('Number of unique English words:', len(en_word_count))
print('Number of unique French words:', len(fr_word_count))

Number of unique English words: 227
Number of unique French words: 355


In [67]:
def get_value(items_tuple):
    return items_tuple[1]

sorted_en_words= sorted(en_word_count.items(), key=get_value, reverse=True)

In [68]:
sorted_en_words[:10]

[('is', 205858),
 (',', 140897),
 ('.', 129039),
 ('in', 75525),
 ('it', 75137),
 ('during', 74933),
 ('the', 67628),
 ('but', 63987),
 ('and', 59850),
 ('sometimes', 37746)]

In [69]:
sorted_fr_words = sorted(fr_word_count.items(), key=get_value, reverse=True)

In [70]:
sorted_fr_words[:10]

[('est', 196809),
 ('.', 135619),
 (',', 123135),
 ('en', 105768),
 ('il', 84079),
 ('les', 65255),
 ('mais', 63987),
 ('et', 59851),
 ('la', 49861),
 ('parfois', 37746)]

So the dataset is pretty small, we may want to get a bigger data set, but we'll see how this one does.

### Using Word Embeddings

Here we are building an embedding matrix of pretrained word vectors. The word embeddings used here were downloaded from the fastText repository. These embeddings have 300 dimensions. To start we will add a few token embeddings for our specific case. We want a token to signal the start of the sentence, A token for words that we do not have an embedding for, and a token to pad sentences so all the sentences we use have the same length. This will allow us to train the model on batches of sentences that are different lengths, rather than one at a time.

After this step we will have a dictionary and an embedding matrix for each language. The dictionary will map words to an index value in the embedding matrix where its' corresponding embedding vector is stored.

#### Load Embeddings for the English data

In [71]:
# The data file containing the embeddings is very large so once we have the embeddings we want
# we will save them as a numpy array. This way we can load this much faster then having to re read from
# the large embedding file
if os.path.exists('data/en_words.npy') and os.path.exists('data/en_vectors.npy'):
    en_words = np.load('data/en_words.npy')
    en_vectors = np.load('data/en_vectors.npy')
    print('Embeddings load from .npy file')
else:
    # make a dict with the top 100,000 words
    en_words = ['<pad>', # Padding Token
                '<s>', # Start of sentence token
                '<unk>'# Unknown word token
               ]

    en_vectors = list(np.random.randn(3, 300))
    en_vectors[0] *= 0 # make the padding vector zeros

    with open('data/wiki.en.vec', "r") as f:
        f.readline()
        for _ in range(100000):
            en_vecs = f.readline()
            word = en_vecs.split()[0]
            vector = np.float32(en_vecs.split()[1:])

            # skip lines that don't have 300 dim
            if len(vector) != 300:
                continue

            if word not in en_words:
                en_words.append(word)
                en_vectors.append(vector)
        print(word, vector[:10]) # Last word embedding read from the file

    # Save the arrays so we don't have to load the full word embedding file
    np.save('data/en_words.npy', en_words)
    np.save('data/en_vectors.npy', en_vectors)

Embeddings load from .npy file


In [72]:
en_word2idx = {word:index for index, word in enumerate(en_words)}

In [73]:
hemophilia_idx = en_word2idx['hemophilia']
print('index for word hemophilia:', hemophilia_idx, 
      '\nvector for word hemophilia:\n',en_vectors[hemophilia_idx][:10])

index for word hemophilia: 99996 
vector for word hemophilia:
 [ 0.16189    -0.056121   -0.65560001  0.21569    -0.11878    -0.02066
  0.37613001 -0.24117    -0.098989   -0.010058  ]


The word embedding for hemophilia matches the one read from the file, so it looks like everything worked properly.

#### Load Embeddings for the Frech data

In [74]:
if os.path.exists('data/fr_words.npy') and os.path.exists('data/fr_vectors.npy'):
    fr_words = np.load('data/fr_words.npy')
    fr_vectors = np.load('data/fr_vectors.npy')
    print('Embeddings load from .npy file')
else:
    # make a dict with the top 100,000 words
    fr_words = ['<pad>',
                '<s>',
                '<unk>']

    fr_vectors = list(np.random.randn(3, 300))
    fr_vectors[0] = np.zeros(300) # make the padding vector zeros

    np.load()
    with open('data/wiki.fr.vec', "r") as f:
        f.readline()
        for _ in range(100000):
            fr_vecs = f.readline()
            word = fr_vecs.split()[0]
            try:
                vector = np.float32(fr_vecs.split()[1:])
            except ValueError:
                continue

             # skip lines that don't have 300 dim
            if len(vector) != 300:
                continue

            if word not in fr_words:
                fr_words.append(word)
                fr_vectors.append(vector)
        print(word, vector[:10])
    # Save the arrays so we don't have to load the full word embedding file
    np.save('data/fr_words.npy', fr_words)
    np.save('data/fr_vectors.npy', fr_vectors)

Embeddings load from .npy file


In [75]:
fr_word2idx = {word:index for index, word in enumerate(fr_words)}

In [76]:
chabeuil_idx = fr_word2idx['chabeuil']
print('index for word chabeuil:', chabeuil_idx, 
      '\nvector for word chabeuil:\n',fr_vectors[chabeuil_idx][:10])

index for word chabeuil: 99783 
vector for word chabeuil:
 [-0.18058001 -0.24758001  0.075607    0.17299999  0.24116001 -0.11223
 -0.28173     0.27373999  0.37997001  0.48008999]


The word embedding for chabeuil matches as well so everything worked correctly for the french vocab.

In [77]:
# Save our embedding vectors to numpy arrays
if os.path.exists('data/fr_words.npy') and os.path.exists('data/en_words.npy'):
    print('Arrays already saved')
else:
    np.save('data/fr_words.npy', fr_words)
    np.save('data/fr_vectors.npy', fr_vectors)
    np.save('data/en_words.npy', en_words)
    np.save('data/en_vectors.npy', en_vectors)

Arrays already saved


Ok, so we have all the pieces needed to take words and convert them into word embeddings. These word embeddings already have a lot of useful information about how words relate since we loaded the pre-trained word embeddings. Now we can build the translation model with the embedding matrices built in.

#### Visualizing the Word Embeddings

In [78]:
# use PCA and t-SNE to reduce the dimensionality of the word embeddings to 3 dimensions

# Plot 3-d representation of several word embeddings and the words in plotly

### Setting up PyTorch Dataset and Dataloader

Rather than organizing all the data from a file and storing it in a list or some other data structure, PyTorch allows us to create a dataset object. To get an example from a dataset we just index the dataset object like we would a list. However, all our processing can be contained in the objects initialization or indexing process.

This will also make training easier when we want to iterate through batches.

In [79]:
class French2EnglishDataset(Dataset):
    '''
        French and associated English sentences.
    '''
    
    def __init__(self, fr_sentences, en_sentences, fr_word2idx, en_word2vec, seq_length):
        self.fr_sentences = fr_sentences
        self.en_sentences = en_sentences
        self.fr_word2idx = fr_word2idx
        self.en_word2idx = en_word2idx
        self.seq_length = seq_length
    
    def __len__(self):
        return len(french_sentences)
    
    def __getitem__(self, idx):
        '''
            Returns a pair of tensors containing word indices
            for the specified sentence pair in the dataset.
        '''
        
        # init torch tensors, note that 0 is the padding index
        french_tensor = torch.zeros(self.seq_length, dtype=torch.long)
        english_tensor = torch.zeros(self.seq_length, dtype=torch.long)
        
        # Get sentence pair
        french_sentence = self.fr_sentences[idx].split()
        english_sentence = self.en_sentences[idx].split()
        
        # Add <EOS> tags
        french_sentence.append('</s>')
        english_sentence.append('</s>')
        
        # Load word indices
        for i, word in enumerate(french_sentence):
            if word in fr_word2idx:
                french_tensor[i] = fr_word2idx[word]
            else:
                french_tensor[i] = fr_word2idx['<unk>']
        
        for i, word in enumerate(english_sentence):
            if word in en_word2idx:
                english_tensor[i] = en_word2idx[word]
            else:
                english_tensor[i] = en_word2idx['<unk>']
            
        sample = {'french_tensor': french_tensor, 'french_sentence': self.fr_sentences[idx],
                  'english_tensor': english_tensor, 'english_sentence': self.en_sentences[idx]}
        return sample

In [80]:
french_english_dataset = French2EnglishDataset(french_sentences,
                                               english_sentences,
                                               fr_word2idx,
                                               en_word2idx,
                                               seq_length = 30)

#### Example output of dataset

In [24]:
test_sample = french_english_dataset[12] # get 13th item in dataset

In [25]:
print(test_sample['french_sentence'])
test_sample['french_tensor']

inde est pluvieux en juin , et il est parfois chaud en novembre .


tensor([  1582,     21,  26768,     16,    166,      4,     11,     24,
            21,    583,   5478,     16,    194,      7,      3,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0])

In [26]:
fr_word2idx['est']

21

In [27]:
# Build dataloader to check how the batching works
dataloader = DataLoader(french_english_dataset, batch_size=64,
                        shuffle=True, num_workers=4)

In [28]:
for i_batch, sample_batched in enumerate(dataloader):
    print(i_batch, sample_batched['french_tensor'].shape,
          sample_batched['english_tensor'].shape)
    if i_batch == 10:
        break

0 torch.Size([64, 30]) torch.Size([64, 30])
1 torch.Size([64, 30]) torch.Size([64, 30])
2 torch.Size([64, 30]) torch.Size([64, 30])
3 torch.Size([64, 30]) torch.Size([64, 30])
4 torch.Size([64, 30]) torch.Size([64, 30])
5 torch.Size([64, 30]) torch.Size([64, 30])
6 torch.Size([64, 30]) torch.Size([64, 30])
7 torch.Size([64, 30]) torch.Size([64, 30])
8 torch.Size([64, 30]) torch.Size([64, 30])
9 torch.Size([64, 30]) torch.Size([64, 30])
10 torch.Size([64, 30]) torch.Size([64, 30])


## Part 2: Building the Model

### Bi-Directional LSTM Encoder

In [81]:
class EncoderBiLSTM(nn.Module):
    def __init__(self, hidden_size, pretrained_embeddings):
        super(EncoderBiLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.embedding_dim = pretrained_embeddings.shape[1]
        self.vocab_size = pretrained_embeddings.shape[0]
        self.num_layers = 3
        self.dropout = 0.2 if self.num_layers > 1 else 0
        self.bidirectional = True
        
        
        # Construct the layers
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        #self.embedding.weight.requires_grad = False
        
        self.lstm = nn.LSTM(self.embedding_dim,
                            self.hidden_size,
                            self.num_layers,
                            batch_first = True,
                            dropout=self.dropout,
                            bidirectional=self.bidirectional)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input)
        output = self.lstm(embedded, hidden)
        return output
    
    def initHidden(self, batch_size):
        
        hidden_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),
                                   batch_size,
                                   self.hidden_size, 
                                   device=device)
        
        cell_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),
                                 batch_size,
                                 self.hidden_size, 
                                 device=device)
        
        return (hidden_state, cell_state)

In [29]:
class EncoderBiGRU(nn.Module):
    def __init__(self, hidden_size, pretrained_embeddings):
        super(EncoderBiGRU, self).__init__()
        self.hidden_size = hidden_size
        self.embedding_dim = pretrained_embeddings.shape[1]
        self.vocab_size = pretrained_embeddings.shape[0]
        self.num_layers = 3
        self.dropout = 0.2 if self.num_layers > 1 else 0
        self.bidirectional = True
        
        
        # Construct the layers
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        self.embedding.weight.requires_grad = False
        
        self.gru = nn.GRU(self.embedding_dim,
                            self.hidden_size,
                            self.num_layers,
                            batch_first = True,
                            dropout=self.dropout,
                            bidirectional=self.bidirectional)
    
    def forward(self, input, hidden):
        embedded = self.embedding(input)
        output = self.gru(embedded, hidden)
        return output
    
    def initHidden(self, batch_size):
        
        hidden_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),
                                   batch_size,
                                   self.hidden_size, 
                                   device=device)
        
        return hidden_state

#### Testing the Encoder

In [31]:
# Test the encoder on a sample input, input tensor has dimensions (batch_size, seq_length)

batch_size = 1
seq_length = 3
hidden_size = 5
encoder = EncoderBiLSTM(hidden_size, fr_vectors).to(device)
hidden = encoder.initHidden(batch_size)

inputs = torch.randint(0, 50, (batch_size, seq_length), dtype=torch.long, device=device)

encoder_output, hidden_state = encoder.forward(inputs, hidden)

print("The final output of the BiLSTM Encoder on our test input is: \n\n", encoder_output.shape)

print('\n\nEncoder output tensor: \n\n', encoder_output)

The final output of the BiLSTM Encoder on our test input is: 

 torch.Size([1, 3, 10])


Encoder output tensor: 

 tensor([[[ 0.1766,  0.1007,  0.1192, -0.0698,  0.0227,  0.2454, -0.1413,
          -0.0350,  0.1378,  0.1392],
         [ 0.1682,  0.1184,  0.1189, -0.1139,  0.0307,  0.1731, -0.0694,
          -0.1511,  0.0726,  0.0135],
         [ 0.1906,  0.1733,  0.0971, -0.0538,  0.1314,  0.1169, -0.0618,
          -0.1163,  0.0259, -0.0038]]], device='cuda:0')


In [94]:
hidden_state[0] # first dimension is 2 layers forward and backward

tensor([[[-0.2974,  0.0582, -0.0034, -0.2880, -0.3093]],

        [[-0.2216, -0.1478, -0.0425, -0.4917,  0.1357]],

        [[-0.0060,  0.0157,  0.0172, -0.1617,  0.1047]],

        [[ 0.0475, -0.2047, -0.2755,  0.0709,  0.0173]]], device='cuda:0')

In [97]:
hidden_state[0][::2].shape

torch.Size([2, 1, 5])

### Attention
Let's take a moment to go over how attention is being modeled.

In [98]:
# Initialize attention weights to one
attn_weights = torch.ones((batch_size, seq_length),device=device)

# Set all weights except the weights associated with the first sequence item equal to zero
# This would represent full attention on the first word in the sequence
attn_weights[:, 1:] = 0

attn_weights.unsqueeze_(1) # Add dimension for batch matrix multiplication
attn_applied = torch.bmm(attn_weights, encoder_output)
attn_applied.squeeze_(1) # Remove extra dimension

print('Attention weights:\n', attn_weights)
print('\nEncoder Output after attention is applied: \n', attn_applied)
print('\n', attn_applied.shape)

Attention weights:
 tensor([[[ 1.,  0.,  0.,  0.]],

        [[ 1.,  0.,  0.,  0.]],

        [[ 1.,  0.,  0.,  0.]]], device='cuda:0')

Encoder Output after attention is applied: 
 tensor([[-0.0971, -0.0180,  0.2225,  0.1585,  0.1202,  0.0191],
        [-0.0593, -0.0422,  0.1827,  0.1368,  0.2152, -0.0195],
        [-0.0840, -0.0323,  0.1780,  0.1227,  0.2055, -0.0143]], device='cuda:0')

 torch.Size([3, 6])


### LSTM Decoder with Attention

In [83]:
class AttnDecoderLSTM(nn.Module):
    def __init__(self, decoder_hidden_size, pretrained_embeddings, seq_length, encoder_hidden_dim):
        super(AttnDecoderLSTM, self).__init__()
        # Embedding parameters
        self.embedding_dim = pretrained_embeddings.shape[1]
        self.output_vocab_size = pretrained_embeddings.shape[0]
        
        # LSTM parameters
        self.decoder_hidden_size = decoder_hidden_size
        self.num_layers = 3 # Potentially add more layers to LSTM later
        self.dropout = 0.2 if self.num_layers > 1 else 0 # Potentially add dropout later
        
        # Attention parameters
        self.seq_length = seq_length
        self.encoder_hidden_dim = encoder_hidden_dim
        
        # Construct embedding layer for output language
        self.embedding = nn.Embedding(self.output_vocab_size, self.embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        #self.embedding.weight.requires_grad = False # we don't want to train the embedding weights
        
        # Construct layer that calculates attentional weights
        self.attn = nn.Linear(2 * self.decoder_hidden_size + self.embedding_dim, self.seq_length)
        
        # Construct layer that compresses the combined matrix of the input embeddings
        # and the encoder inputs after attention has been applied
        self.attn_with_input = nn.Linear(self.embedding_dim + self.encoder_hidden_dim, self.embedding_dim)
        
        # LSTM for Decoder
        self.lstm = nn.LSTM(self.embedding_dim,
                            self.decoder_hidden_size,
                            self.num_layers,
                            dropout=self.dropout)
        
        # Output layer
        self.out = nn.Linear(self.decoder_hidden_size, self.output_vocab_size)
    
    def forward(self, input, hidden, encoder_output):
        # Input word indices, should have dim(1, batch_size), output will be (1, batch_size, embedding_dim)
        embedded = self.embedding(input)
        
        # Calculate Attention weights
        attn_weights = F.softmax(self.attn(torch.cat((hidden[1][0], hidden[0][0], embedded[0]), 1)), dim=1)
        attn_weights = attn_weights.unsqueeze(1) # Add dimension for batch matrix multiplication
        
        # Apply Attention weights
        attn_applied = torch.bmm(attn_weights, encoder_output)
        attn_applied = attn_applied.squeeze(1) # Remove extra dimension, dim are now (batch_size, encoder_hidden_size)
        
        # Prepare LSTM input tensor

        attn_combined = torch.cat((embedded[0], attn_applied), 1) # Combine embedding input and attn_applied,
        lstm_input = F.relu(self.attn_with_input(attn_combined)) # pass through fully connected with ReLU
        lstm_input = lstm_input.unsqueeze(0) # Add seq dimension so tensor has expected dimensions for lstm
        
        output, hidden = self.lstm(lstm_input, hidden) # Output dim = (1, batch_size, decoder_hidden_size)
        output = F.log_softmax(self.out(output[0]), dim=1) # softmax over all words in vocab
        
        #### These might get taken out
        output_idx = output.topk(1, dim=1)[1].view(1,-1) # Get the index of highest probability
        output_embedded = self.embedding(output_idx)
        ####
        
        return output, hidden, attn_weights
    
    def initHidden(self, batch_size):
        
        hidden_state = torch.zeros(self.num_layers,
                                   batch_size,
                                   self.decoder_hidden_size, 
                                   device=device)
        
        cell_state = torch.zeros(self.num_layers,
                                 batch_size,
                                 self.decoder_hidden_size, 
                                 device=device)
        
        return (hidden_state, cell_state)

In [None]:
class AttnDecoderLSTMEmbeddingLoss(nn.Module):
    def __init__(self, decoder_hidden_size, pretrained_embeddings, seq_length, encoder_hidden_dim):
        super(AttnDecoderLSTM, self).__init__()
        # Embedding parameters
        self.embedding_dim = pretrained_embeddings.shape[1]
        self.output_vocab_size = pretrained_embeddings.shape[0]
        
        # LSTM parameters
        self.decoder_hidden_size = decoder_hidden_size
        self.num_layers = 3 # Potentially add more layers to LSTM later
        self.dropout = 0.2 if self.num_layers > 1 else 0 # Potentially add dropout later
        
        # Attention parameters
        self.seq_length = seq_length
        self.encoder_hidden_dim = encoder_hidden_dim
        
        # Construct embedding layer for output language
        self.embedding = nn.Embedding(self.output_vocab_size, self.embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        #self.embedding.weight.requires_grad = False # we don't want to train the embedding weights
        
        # Construct layer that calculates attentional weights
        self.attn = nn.Linear(2 * self.decoder_hidden_size + self.embedding_dim, self.seq_length)
        
        # Construct layer that compresses the combined matrix of the input embeddings
        # and the encoder inputs after attention has been applied
        self.attn_with_input = nn.Linear(self.embedding_dim + self.encoder_hidden_dim, self.embedding_dim)
        
        # LSTM for Decoder
        self.lstm = nn.LSTM(self.embedding_dim,
                            self.decoder_hidden_size,
                            self.num_layers,
                            dropout=self.dropout)
        
        # Output layer
        self.out = nn.Linear(self.decoder_hidden_size, self.embedding_dim)
    
    def forward(self, input, hidden, encoder_output):
        # Input word indices, should have dim(1, batch_size), output will be (1, batch_size, embedding_dim)
        embedded = self.embedding(input)
        
        # Calculate Attention weights
        attn_weights = F.softmax(self.attn(torch.cat((hidden[1][0], hidden[0][0], embedded[0]), 1)), dim=1)
        attn_weights = attn_weights.unsqueeze(1) # Add dimension for batch matrix multiplication
        
        # Apply Attention weights
        attn_applied = torch.bmm(attn_weights, encoder_output)
        attn_applied = attn_applied.squeeze(1) # Remove extra dimension, dim are now (batch_size, encoder_hidden_size)
        
        # Prepare LSTM input tensor

        attn_combined = torch.cat((embedded[0], attn_applied), 1) # Combine embedding input and attn_applied,
        lstm_input = F.relu(self.attn_with_input(attn_combined)) # pass through fully connected with ReLU
        lstm_input = lstm_input.unsqueeze(0) # Add seq dimension so tensor has expected dimensions for lstm
        
        output, hidden = self.lstm(lstm_input, hidden) # Output dim = (1, batch_size, decoder_hidden_size)
        output = self.out(output[0]) # outputs a 300 dimensional vector corresponding to a predicted word
   
         return output, hidden, attn_weights
    
    def initHidden(self, batch_size):
        
        hidden_state = torch.zeros(self.num_layers,
                                   batch_size,
                                   self.decoder_hidden_size, 
                                   device=device)
        
        cell_state = torch.zeros(self.num_layers,
                                 batch_size,
                                 self.decoder_hidden_size, 
                                 device=device)
        
        return (hidden_state, cell_state)

In [30]:
class AttnDecoderGRU(nn.Module):
    def __init__(self, decoder_hidden_size, pretrained_embeddings, seq_length, encoder_hidden_dim):
        super(AttnDecoderGRU, self).__init__()
        # Embedding parameters
        self.embedding_dim = pretrained_embeddings.shape[1]
        self.output_vocab_size = pretrained_embeddings.shape[0]
        
        # GRU parameters
        self.decoder_hidden_size = decoder_hidden_size
        self.num_layers = 3 # Potentially add more layers to LSTM later
        self.dropout = 0.2 if self.num_layers > 1 else 0 # Potentially add dropout later
        
        # Attention parameters
        self.seq_length = seq_length
        self.encoder_hidden_dim = encoder_hidden_dim
        
        # Construct embedding layer for output language
        self.embedding = nn.Embedding(self.output_vocab_size, self.embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        self.embedding.weight.requires_grad = False # we don't want to train the embedding weights
        
        # Construct layer that calculates attentional weights
        self.attn = nn.Linear(self.decoder_hidden_size + self.embedding_dim, self.seq_length)
        
        # Construct layer that compresses the combined matrix of the input embeddings
        # and the encoder inputs after attention has been applied
        self.attn_with_input = nn.Linear(self.embedding_dim + self.encoder_hidden_dim, self.embedding_dim)
        
        # gru for Decoder
        self.gru = nn.GRU(self.embedding_dim,
                            self.decoder_hidden_size,
                            self.num_layers,
                            dropout=self.dropout)
        
        # Output layer
        self.out = nn.Linear(self.decoder_hidden_size, self.output_vocab_size)
    
    def forward(self, input, hidden, encoder_output):
        # Input word indices, should have dim(1, batch_size), output will be (1, batch_size, embedding_dim)
        embedded = self.embedding(input)
        
        # Calculate Attention weights
        attn_weights = F.softmax(self.attn(torch.cat((hidden[0], embedded[0]), 1)), dim=1)
        attn_weights = attn_weights.unsqueeze(1) # Add dimension for batch matrix multiplication
        
        # Apply Attention weights
        attn_applied = torch.bmm(attn_weights, encoder_output)
        attn_applied = attn_applied.squeeze(1) # Remove extra dimension, dim are now (batch_size, encoder_hidden_size)
        
        # Prepare LSTM input tensor

        attn_combined = torch.cat((embedded[0], attn_applied), 1) # Combine embedding input and attn_applied,
        gru_input = F.relu(self.attn_with_input(attn_combined)) # pass through fully connected with ReLU
        gru_input = gru_input.unsqueeze(0) # Add seq dimension so tensor has expected dimensions for lstm
        
        output, hidden = self.gru(gru_input, hidden) # Output dim = (1, batch_size, decoder_hidden_size)
        output = F.log_softmax(self.out(output[0]), dim=1) # softmax over all words in vocab
        
        #### These might get taken out
        output_idx = output.topk(1, dim=1)[1].view(1,-1) # Get the index of highest probability
        output_embedded = self.embedding(output_idx)
        ####
        
        return output, hidden, attn_weights
    
    def initHidden(self, batch_size):
        
        hidden_state = torch.zeros(self.num_layers,
                                   batch_size,
                                   self.decoder_hidden_size, 
                                   device=device)
        
        return hidden_state

#### Testing the Decoder

In [37]:
# Test the decoder on sample inputs to check that the dimensions of everything is correct
decoder_hidden_size = 5

decoder = AttnDecoderLSTM(decoder_hidden_size, np.vstack(fr_vectors), seq_length, 2*hidden_size).to(device)

In [38]:
input_idx = torch.tensor([fr_word2idx['<s>']]*batch_size, dtype=torch.long, device=device)

In [39]:
input_idx.shape

torch.Size([32])

In [40]:
input_idx = input_idx.unsqueeze_(0)
decoder_hidden = decoder.initHidden(batch_size)

In [41]:
input_idx.shape

torch.Size([1, 32])

In [42]:
o, o_e, h, a = decoder.forward(input_idx, decoder_hidden, encoder_output)
print(o.shape)

NameError: name 'encoder_output' is not defined

In [84]:
decoder_hidden[0].shape

torch.Size([2, 1, 256])

### How do LSTM layers work in pytorch(old code)

In [158]:
seq_len = 5
batch_size = 3
embidding_dim = 10
hidden_size = 3
hidden_layers = 1
inputs = torch.randn((seq_len, batch_size, embedding_dim))  # make a sequence of length 5, 1 batch, input 10dim vector

# initialize the hidden state. hidden layers have 3 nodes
hidden = (torch.randn(hidden_layers, batch_size, hidden_size),
            torch.randn((hidden_layers, batch_size, hidden_size)))

In [159]:

lstm = nn.LSTM(embedding_dim, hidden_size)

In [160]:
out, hidden = lstm(inputs, hidden)

In [161]:
# outputs of each hidden node(3) for each item in sequence(5)
print(out)
print(out.shape)

# final hidden state and final cell state of sequence; notice that the hidden state equals the final output
print(hidden[0].shape)

tensor([[[-2.5325e-01,  4.9366e-02,  4.0028e-01],
         [ 8.1580e-01, -5.9460e-01, -4.5211e-06],
         [ 5.9945e-01,  3.4448e-04,  1.5961e-01]],

        [[ 5.9029e-01,  8.7725e-04,  1.1217e-02],
         [ 7.1056e-03, -1.5894e-04, -1.4747e-04],
         [-2.7879e-03,  1.2264e-06, -1.4473e-02]],

        [[ 1.3057e-01,  3.2360e-03,  5.4227e-01],
         [-3.9118e-01,  7.1249e-02, -3.7470e-01],
         [ 4.5797e-05,  5.1406e-05,  2.7413e-02]],

        [[ 9.9916e-02,  4.7488e-01,  9.7807e-03],
         [-3.8806e-01, -8.1084e-05,  7.0315e-02],
         [ 1.5172e-01,  5.9603e-01, -1.1859e-04]],

        [[ 2.7766e-01,  4.5121e-02, -5.6821e-01],
         [-7.7895e-02, -4.1036e-01, -6.2259e-03],
         [ 9.4735e-02,  1.3790e-04, -1.2973e-04]]])
torch.Size([5, 3, 3])
torch.Size([1, 3, 3])


In [24]:
seq_len = 5
batch_size = 1
input_dim = 10
hidden_size = 3
hidden_layers = 1
num_dir = 2 # for bidirectional lstm
inputs = autograd.Variable(torch.randn((seq_len, batch_size, input_dim)))  # make a sequence of length 5, 1 batch, input 10dim vector

# initialize the hidden state. hidden layers have 3 nodes
hidden = (autograd.Variable(torch.randn(hidden_layers*num_dir, batch_size, hidden_size)),
          autograd.Variable(torch.randn((hidden_layers*num_dir, batch_size, hidden_size))))

In [25]:
lstm = nn.LSTM(input_dim, hidden_size, bidirectional=True)

In [26]:
out, hidden = lstm(inputs, hidden)

In [28]:
# outputs of each hidden node in both directions(3*2) for each item in sequence(5)
print(out)

# final hidden and cell state of model in both directions
# notice that the first 3 output of the final item equals the final first hidden state
# the second 3 outputs from the first item equals the final second hidden state
print(hidden)

Variable containing:
(0 ,.,.) = 
 -0.1693  0.7509 -0.0828 -0.5793  0.5285  0.6648

(1 ,.,.) = 
  0.0218  0.1844 -0.1121 -0.6127  0.2034  0.3170

(2 ,.,.) = 
  0.0731 -0.0683  0.0742 -0.0712  0.1552  0.0725

(3 ,.,.) = 
  0.0098 -0.0052 -0.1215 -0.0741  0.2897  0.0834

(4 ,.,.) = 
 -0.0263 -0.3019 -0.1845 -0.1061  0.4653  0.0832
[torch.FloatTensor of size 5x1x6]

(Variable containing:
(0 ,.,.) = 
 -0.0263 -0.3019 -0.1845

(1 ,.,.) = 
 -0.5793  0.5285  0.6648
[torch.FloatTensor of size 2x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.3161 -0.5384 -0.4495

(1 ,.,.) = 
 -0.7322  0.9997  0.9236
[torch.FloatTensor of size 2x1x3]
)


## Part 3: Training the Model

### Example training iteration

Walk through one training iteration with a sample input then build a function.

In [122]:
# Set Hyperparameters
hidden_size = 256
batch_size = 32
seq_length = 30
learning_rate = 0.01

# Get sample input and target batch
dataloader = DataLoader(french_english_dataset, batch_size=batch_size,
                        shuffle=True, num_workers=4)

for batch in dataloader:
    input_tensor = batch['french_tensor'].to(device)
    target_tensor = batch['english_tensor'].transpose(1,0).to(device) # Change dimensions to (seq_length, batch_size)
    break

# get encoder, init encoder_hidden, and encoder_optimizer
encoder = EncoderBiLSTM(hidden_size, fr_vectors).to(device)
encoder_hidden = encoder.initHidden(batch_size) # init hidden layer

encoder_parameters = filter(lambda p: p.requires_grad, encoder.parameters())
encoder_optimizer = optim.SGD(encoder_parameters, lr=learning_rate)
encoder.embedding.weight.requires_grad = True

# run forward pass through encoder on entire sequence
encoder_output, encoder_hidden = encoder.forward(input_tensor, encoder_hidden)

# initialize decoder, decoder_input, decoder_hidden, and decoder optimizer
decoder = AttnDecoderLSTM(hidden_size, en_vectors, seq_length, encoder_output.shape[2]).to(device)
decoder_input =  torch.tensor([en_word2idx['<s>']]*batch_size, dtype=torch.long, device=device).unsqueeze(0)
decoder_hidden = decoder.initHidden(batch_size)
#(encoder_hidden[0][1].unsqueeze_(0), encoder_hidden[1][1].unsqueeze_(0))

decoder_parameters = filter(lambda p: p.requires_grad, decoder.parameters())
decoder_optimizer = optim.SGD(decoder_parameters, lr=learning_rate)
decoder.embedding.weight.requires_grad = True

# initialize loss and criterion function, we also want the target words as embedding vectors
loss = 0
criterion = F.cosine_similarity
target_embeddings = decoder.embedding(target_tensor)

# Iterate through the sequence length comparing the embeddings at each step
for di in range(seq_length):
    output, output_embedding, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                    decoder_hidden,
                                                                                    encoder_output)
    decoder_input = output.detach()
    loss += torch.sum(-criterion(target_embeddings[di], output_embedding[0])) / batch_size

loss.backward()

encoder_optimizer.step()
decoder_optimizer.step()

print(loss.item())

-2.362576961517334


### Training Function

In [84]:
def train(input_tensor, target_tensor, encoder, decoder,
          encoder_optimizer, decoder_optimizer, criterion):
    
    encoder_hidden = encoder.initHidden(input_tensor.shape[0])

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    # run forward pass through encoder on entire sequence
    encoder_output, encoder_hidden = encoder.forward(input_tensor, encoder_hidden)
    
    decoder_input =  torch.tensor([en_word2idx['<s>']]*input_tensor.shape[0], dtype=torch.long, device=device).unsqueeze(0)
    decoder_hidden = (encoder_hidden[0][::2].contiguous(), encoder_hidden[1][::2].contiguous())
    
    loss = 0
    #with torch.no_grad():
    target_embeddings = decoder.embedding(target_tensor).detach()
    
    use_teacher_forcing = True if random.random() < 0.5 else False

    if use_teacher_forcing:
        for di in range(seq_length):
            output, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                     decoder_hidden,
                                                                                     encoder_output)
            decoder_input = target_tensor[di].unsqueeze(0)
            loss += criterion(output, target_tensor[di])
    else:
        for di in range(seq_length):
            output, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                     decoder_hidden,
                                                                                     encoder_output)
            decoder_input = output.topk(1)[1].view(1,-1).detach()
            loss += criterion(output, target_tensor[di])
    
    
    loss.backward()
    
    nn.utils.clip_grad_norm_(encoder.parameters(), 15)
    nn.utils.clip_grad_norm_(decoder.parameters(), 15)

    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item()

In [None]:
def trainEmbeddingLoss(input_tensor, target_tensor, encoder, decoder,
          encoder_optimizer, decoder_optimizer, criterion):
    
    encoder_hidden = encoder.initHidden(input_tensor.shape[0])

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    
    # run forward pass through encoder on entire sequence
    encoder_output, encoder_hidden = encoder.forward(input_tensor, encoder_hidden)
    
    decoder_input =  torch.tensor([en_word2idx['<s>']]*input_tensor.shape[0], dtype=torch.long, device=device).unsqueeze(0)
    decoder_hidden = (encoder_hidden[0][::2].contiguous(), encoder_hidden[1][::2].contiguous())
    
    loss = 0
    #with torch.no_grad():
    target_embeddings = decoder.embedding(target_tensor).detach()
    
    use_teacher_forcing = True if random.random() < 0.5 else False

    if use_teacher_forcing:
        for di in range(seq_length):
            output, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                     decoder_hidden,
                                                                                     encoder_output)
            decoder_input = target_tensor[di].unsqueeze(0)
            loss += criterion(output, target_tensor[di])
    else:
        for di in range(seq_length):
            output, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                     decoder_hidden,
                                                                                     encoder_output)
            decoder_input = output.topk(1)[1].view(1,-1).detach()
            loss += criterion(output, target_tensor[di])
    
    
    loss.backward()
    
    nn.utils.clip_grad_norm_(encoder.parameters(), 15)
    nn.utils.clip_grad_norm_(decoder.parameters(), 15)

    encoder_optimizer.step()
    decoder_optimizer.step()
    
    return loss.item()

### Training Loop

In [85]:
def trainIters(encoder, decoder, dataloader, epochs, print_every_n_batches=100, learning_rate=0.01):
    
    # keep track of losses
    start = time.time()
    plot_losses = []

    # Initialize Encoder Optimizer
    encoder_parameters = filter(lambda p: p.requires_grad, encoder.parameters())
    encoder_optimizer = optim.SGD(encoder_parameters, lr=learning_rate)
    
    # Initialize Decoder Optimizer
    decoder_parameters = filter(lambda p: p.requires_grad, decoder.parameters())
    decoder_optimizer = optim.SGD(decoder_parameters, lr=learning_rate)

    
    criterion = nn.NLLLoss()
    
    # Cycle through epochs
    for epoch in range(epochs):
        
        # Cycle through batches
        for i, batch in enumerate(dataloader):
            
            input_tensor = batch['french_tensor'].to(device)
            target_tensor = batch['english_tensor'].transpose(1,0).to(device)
            

            loss = train(input_tensor, target_tensor, encoder, decoder,
                         encoder_optimizer, decoder_optimizer, criterion)
            if i % print_every_n_batches == 0:
                print(f'batch {i} loss: {loss}')
            plot_losses.append(loss)

        
        print(f'Epoch {epoch + 1}/{epochs}, loss on final batch: {loss}')

In [86]:
hidden_size = 512
seq_length = 30
batch_size = 64
learning_rate = 0.01
dataloader = DataLoader(french_english_dataset, batch_size=batch_size,
                        shuffle=True, num_workers=4) 

In [87]:
encoder = EncoderBiLSTM(hidden_size, fr_vectors).to(device)
decoder = AttnDecoderLSTM(hidden_size, en_vectors, seq_length, 2*hidden_size).to(device)

In [88]:
encoder.train()
decoder.train()
trainIters(encoder, decoder, dataloader, epochs=10, learning_rate = learning_rate)

batch 0 loss: 345.72991943359375
batch 100 loss: 113.1642837524414
batch 200 loss: 96.09139251708984
batch 300 loss: 80.20494842529297
batch 400 loss: 69.56568908691406
batch 500 loss: 63.800537109375
batch 600 loss: 64.2630615234375
batch 700 loss: 59.129756927490234
batch 800 loss: 60.50992965698242
batch 900 loss: 57.76666259765625
batch 1000 loss: 55.55705261230469
batch 1100 loss: 53.52363204956055
batch 1200 loss: 53.85311508178711
batch 1300 loss: 54.14085388183594
batch 1400 loss: 52.89352035522461
batch 1500 loss: 51.74748229980469
batch 1600 loss: 51.08420181274414
batch 1700 loss: 52.10051727294922
batch 1800 loss: 50.92216110229492
batch 1900 loss: 50.235633850097656
batch 2000 loss: 51.87394332885742
batch 2100 loss: 46.63319396972656
Epoch 1/10, loss on final batch: 53.807376861572266
batch 0 loss: 51.66504669189453
batch 100 loss: 49.43523406982422
batch 200 loss: 49.12508010864258
batch 300 loss: 50.18984603881836
batch 400 loss: 47.31230545043945
batch 500 loss: 49.384

Process Process-48:
Process Process-45:
Process Process-46:
Process Process-47:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mac/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwar

KeyboardInterrupt: 

 Try a simple decoder? Maybe the attention model isn't working like you thought.

 Is your evaluation function working properly?

## Part 4: Using the Model for Evaluation

In [51]:
en_idx2word = {k:i for i, k in en_word2idx.items()}
fr_idx2word = {k:i for i, k in fr_word2idx.items()}

In [52]:
for batch in dataloader:
    input_tensor = batch['french_tensor'].to(device)
    break
    

def evaluate(input_tensor):
    pass

In [53]:
input_tensor = batch['french_tensor'][10].unsqueeze_(0)
print(batch['english_sentence'][10])
for index in input_tensor[0]:
    word = fr_idx2word[index.item()]
    if word != '<pad>':
        print(word)
    else:
        break

california is wet during autumn , and it is sometimes cold in october .
californie
est
humide
à
<unk>
,
et
il
est
parfois
froid
en
octobre
.
</s>


In [54]:
with torch.no_grad():
    encoder_hidden = encoder.initHidden(1)
    encoder.eval()
    decoder.eval()

    encoder_output, encoder_hidden = encoder.forward(input_tensor.to(device), encoder_hidden)

    decoder_input =  torch.tensor([fr_word2idx['<s>']]*input_tensor.shape[0], dtype=torch.long, device=device).unsqueeze(0)
    decoder_hidden = (encoder_hidden[0][::2].contiguous(), encoder_hidden[1][::2].contiguous())
    

    output_list = []
    attn_weight_list = []
    for di in range(seq_length):
        output, output_embedding, decoder_hidden, attn_weights = decoder.forward(decoder_input,
                                                                                 decoder_hidden,
                                                                                 encoder_output)
        decoder_input = output.topk(1)[1].detach()
        output_list.append(output.topk(1)[1])
        attn_weight_list.append(attn_weights)

In [55]:
for index in output_list:
    word = en_idx2word[index.item()]
    if word != '<pad>':
        print(word)
    else:
        break

california
is
hot
during
fall
,
and
it
is
sometimes
cold
in
january
.
</s>


Current state of the model is that it understands places and the grammar but confuses some words. (Ex. Mango and Lemon)

Things to try:

start with random word embeddings and train them only using words in the vocab

Try a different dataset

Try a simple decoder



    

Use word embeddings in the loss function, instead of probability of the word in the language we can output a regression for a word embedding. Given the input sequence, and the previous embedding what is the next embedding. For the loss function you can compute the 1 - cosine_similarity and try to minimize it. 

To do this you would have to change the final output layer of the decoder to be equal to the embedding dimension.
You will also have to adjust the model to input this embedding to the next value in the sequence. Or you could 

## Part 5: Visualizing the Attention 

In [None]:
longest_sentence ='''If someone who doesn't know your background says that you sound like a native speaker, 
it means they probably noticed something about your speaking that made them realize you weren't a native speaker. 
In other words, you don't really sound like a native speaker.'''

longest_sentence.split()

In [None]:
longest