# Neural Machine Translation with Attention Using PyTorch
In this notebook we are going to perform machine translation using a deep learning based approach and attention mechanism. All the code is based on PyTorch and it was adopted from the tutorial provided on the official documentation of [TensorFlow](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb).

Specifically, we are going to train a sequence to sequence model for Spanish to English translation. If you are not familiar with sequence to sequence models, I have provided some references at the end of this tutorial to familiarize yourself with the concept. Even if you are not familiar with seq2seq models, you can still proceed with the coding exercise. I will explain tiny details that are important as we proceed. 

The tutorial is very brief and I encourage you to also take a look at the official TensorFlow [notebook](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb) for more detailed explanations. The purpose of this tutorial is to understand how to convert certain code blocks into a deep learning framework like PyTorch. You will soon realize that the frameworks are very similar to some extent. The data preparation part is slightly different so I would emphasize that you spend more time analyzing this part of the code. 

If you have questions you can also reach out to me at ellfae@gmail.com or Twitter ([@omarsar0](https://twitter.com/omarsar0)).

## Import libraries

In [0]:
!pip3 install http://download.pytorch.org/whl/cu80/torch-0.4.1-cp36-cp36m-linux_x86_64.whl 
!pip3 install torchvision

Collecting torch==0.4.1
[?25l  Downloading http://download.pytorch.org/whl/cu80/torch-0.4.1-cp36-cp36m-linux_x86_64.whl (483.0MB)
[K     |████████████████████████████████| 483.0MB 1.2MB/s 
[31mERROR: torchvision 0.4.2 has requirement torch==1.3.1, but you'll have torch 0.4.1 which is incompatible.[0m
[31mERROR: fastai 1.0.60 has requirement torch>=1.0.0, but you'll have torch 0.4.1 which is incompatible.[0m
[?25hInstalling collected packages: torch
  Found existing installation: torch 1.3.1
    Uninstalling torch-1.3.1:
      Successfully uninstalled torch-1.3.1
Successfully installed torch-0.4.1
Collecting torch==1.3.1
  Using cached https://files.pythonhosted.org/packages/88/95/90e8c4c31cfc67248bf944ba42029295b77159982f532c5689bcfe4e9108/torch-1.3.1-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: torch
  Found existing installation: torch 0.4.1
    Uninstalling torch-0.4.1:
      Successfully uninstalled torch-0.4.1
Successfully installed torch-1.3.1


In [0]:
import torch
import torch.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import unicodedata
import re
import time
import tensorflow as tf
import os
import math

print(torch.__version__)

1.3.1


## Import Data from Google Drive
I stored the data on my Google Drive, but you can also obtain it from [here](http://www.manythings.org/anki/) as well. 

In [0]:
from google.colab import drive
drive.mount('/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [0]:
#f = open('/gdrive/My Drive/DAIR RESOURCES/PyTorch/Neural Machine Translation with PyTorch/spa.txt', encoding='UTF-8').read().strip().split('\n')  
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://download.tensorflow.org/data/spa-eng.zip', 
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"
path_to_file = 'data1.txt'
f =  open(path_to_file, encoding='UTF-8').read().strip().split('\n') 

In [0]:
lines = f

In [0]:
# sample size (try with smaller sample size to reduce computation)
num_examples = 2000

# creates lists containing each pair
original_word_pairs = [[w for w in l.split('\t')] for l in lines[:num_examples]]

In [0]:
data = pd.DataFrame(original_word_pairs, columns=["eng", "es"])

In [0]:
data.head(90)

Unnamed: 0,eng,es
0,সিরিয়া ব্লগারদের ক্যামেরায় দেখা,سوريا عبر كاميرات المدونين
1,হোভিক এবং আব্দ উত্তর সিরিয়ার আলেপ্পো অন্চলের দ...,هوفيك وعبد هما صديقان من مدينة حلب في شمال سو...
2,তারা সিরিয়ার ব্লগোস্ফিয়ারের দুজন অগ্রনী ফটোব্ল...,"مدونتا هوفيك وعبد, المسميتان, سوريا تنظر وسور..."
3,হোভিক এবং আব্দের ফটোব্লগ দুটি সিরিয়া লুকস এবং ...,هذه رحلة صغيرة بين بعض مجموعاتهم الرائعة.
4,মৃত শহরগুলো,المدن الميتة
...,...,...
85,ইতিমধ্যে আমরা লিপিবদ্ধ করেছি প্রচুর ব্লগার এবং...,قمنا بتوثيق اعتقال واحتجاز العشرات من المدوني...
86,আমাদের কাভারেজের মধ্যে রয়েছে ২৫টি দেশের এরুপ ঘ...,تغطيتنا حتى الآن تضمنت تقارير عن 25 دولة، فضل...
87,বিশ্বজুড়ে অনলাইন কথোপকথনের বিরুদ্ধে হুমকি ও প্...,بالتزامن مع هذه المهمة من توثيق التهديدات على...
88,এছাড়াও আমরা বিশ্বের বিভিন্ন স্থানে এইসব হুমকি ...,الهدف من هذه الشبكة هو زيادة الوعي إلى قضايا ...


In [0]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    """
    Normalizes latin chars with accent to their canonical decomposition
    """
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    #w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

## Data Exploration
Let's explore the dataset a bit.

In [0]:
# Now we do the preprocessing using pandas and lambdas
data["eng"] = data.eng.apply(lambda w: preprocess_sentence(w))
data["es"] = data.es.apply(lambda w: preprocess_sentence(w))
data.sample(100)


Unnamed: 0,eng,es
44,"<start> আমার মনে আছে , দযা টাইমস অফ ইনডিযা (বে...",<start> اتذكر في تلك الايام ان مجلة times of i...
897,<start> আর সব দল ২০০৫ থেকে তাদের কাজের কি পরমা...,<start> ماذا لدى كل الاحزاب المعنية لتظهر من ك...
560,<start> সৌদি আরব যৌন আবেদনমযী নারীপোষাক বিক...,<start> السعودية ملابس مثيرة للبيع <end>
45,<start> আমার মত আরআইদের (দেশে বসবাসরত ভারতীয) ...,<start> الشيء الذي يعطي الهنود المقيمين مثلي ف...
965,<start> এখন যা পাওযা যাচছে তা হল একদল ধনী লোক...,<start> العطار ورين جمال التناقض بين اسلوبي ال...
...,...,...
1643,<start> পরিশেষে ৩আবিরা তার পাঠকদের অনরোধ করেছ...,<start> سبحان الله على ما ال عليه حالنا ،، لو ...
1742,<start> কেন তারা পরো মখ ঢাকার বিধান করে বযাপা...,<start> لماذا لايطبقوا تغطية الوجه باكمله لننت...
931,<start> বলগের লেখা থেকে বই পরকাশের (মিশরে) এটি...,<start> 1- هذه ليست المرة الاولى التي يتم فيها...
1502,<start> আমি এমন একটা দেশে আছি যেখানের সব থেকে ...,<start> انا اعيش في بلد حيث تغتصب النساء في اك...


## Building Vocabulary Index
The class below is useful for creating the vocabular and index mappings which will be used to convert out inputs into indexed sequences. 

In [0]:
# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa 
# (e.g., 5 -> "dad") for each language,
class LanguageIndex():
    def __init__(self, lang):
        """ lang are the list of phrases from each language"""
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()
        
        self.create_index()
        
    def create_index(self):
        for phrase in self.lang:
            # update with individual tokens
            self.vocab.update(phrase.split(' '))
            
        # sort the vocab
        self.vocab = sorted(self.vocab)

        # add a padding token with index 0
        self.word2idx['<pad>'] = 0
        
        # word to index mapping
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1 # +1 because of pad token
        
        # index to word mapping
        for word, index in self.word2idx.items():
            self.idx2word[index] = word

'''
def load_doc(filename):
  # open the file as read only
  file = open(filename, mode="r", encoding="utf-8")
  # read all text
  text = file.read()
  # close the file
  file.close()
  return text


text = load_doc("phonem_onlycharecters.txt")
splittedLines = text.splitlines()
#print(len(splittedLines))

count = 0
tar_word2idx = {}
tar_idx2word = {}
tar_word2idx['<pad>'] = 0
for i in range(0, len(splittedLines)):
  line = splittedLines[i]
  word = line.split("\t")
  if tar_word2idx.get(word[0]) == None:
      tar_word2idx[word[0]] = int(word[1])
  count = count + 1

tar_word2idx['<start>'] = count + 1
tar_word2idx['<end>'] = count + 2

for word, index in tar_word2idx.items():
  tar_idx2word[index] = word


text2 = load_doc("englishLetter.txt")
english_splittedLines = text2.splitlines()

count = 0
inp_word2idx = {}
inp_idx2word = {}
inp_word2idx['<pad>'] = 0
for i in range(0, len(english_splittedLines)):
  line = english_splittedLines[i]
  word = line.split("\t")

  if inp_word2idx.get(word[0]) == None:
      inp_word2idx[word[0]] = word[1]
  count = count + 1

inp_word2idx['<start>'] = count + 1
inp_word2idx['<end>'] = count + 2

for word, index in inp_word2idx.items():
  inp_idx2word[index] = word


def bangla_word_to_sequence_transformation(w):
  word = w
  for key in tar_word2idx:
      temp_word = word.replace(key, str(tar_word2idx[key])+",")
      word = temp_word
  # After sequence generation in case any bangla character remains
  # replace this character with
  word = re.sub('[^0-9,]', '0,', word)
  return word


def english_word_to_sequence_transformation(w):
  word = w
  for key in inp_word2idx:
      temp_word = word.replace(key, str(inp_word2idx[key])+",")
      #print(key, "\t",english_encountered_words[key],"\t",word)
      word = temp_word
  #a = 'l,kdfhi123,23,আম,soe78347834 (())&/&745  '
  #result = re.sub('[^0-9,]','0,', a)
  word = re.sub('[^0-9,]', '0,', word)
  return word


def preprocess_sentence(w):
  # w = unicode_to_ascii(w.lower().strip())
  #w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
#    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = english_word_to_sequence_transformation(w)

  w = w.rstrip().strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w 


def preprocess_pair_word_sentence(w):
  # w = unicode_to_ascii(w.lower().strip())
  #w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  word_ls = w.split("\t")
  bangla_w = '<start> ' + \
      bangla_word_to_sequence_transformation(word_ls[0]) + ' <end>'
  banglish_w = '<start> ' + \
      english_word_to_sequence_transformation(word_ls[1]) + ' <end>'

  bangla_w = bangla_w.rstrip().strip()
  banglish_w = banglish_w.rstrip().strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  #w = '<start> ' + w + ' <end>'
  arr = []
  arr.append(bangla_w)
  arr.append(banglish_w)
  return arr


def create_dataset(path, num_examples):
  lines = open(path, encoding='UTF-8').read().strip().split('\n')
  word_pairs = [preprocess_pair_word_sentence(l) for l in lines[:num_examples]]

  # word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
  return word_pairs


########################### Utilities
def max_length(tensor):
    return max(len(t) for t in tensor)


def get_bangla_sequence(w):
  bangla_wlist = w.split(' ')
  #print(" bangla >>>", bangla_wlist[0],"   ",bangla_wlist[1],"   ",bangla_wlist[2])
  arr = []
  #arr.append(word2idx[bangla_wlist[0]])
  arr.append(tar_word2idx['<start>'])
  #seq = bangla_word_to_sequence_transformation(bangla_wlist[1])
  seq = bangla_wlist[1]
  seq_elems = seq.split(",")
  arr2 = []
  for i in range(0, len(seq_elems)):
    if seq_elems[i] == "":
      continue
    arr2.append(int(seq_elems[i]))

  #arr.extend(seq.split(","))
  arr.extend(arr2)
  #arr.append(word2idx[bangla_wlist[2]])
  arr.append(tar_word2idx['<end>'])
  #print(arr)
  return arr


def get_banglish_sequence(w):
  banglish_wlist = w.split(' ')
  #print(" banglish >>>", banglish_wlist[0],"   ",banglish_wlist[1],"   ",banglish_wlist[2])
  arr = []
  #arr.append(word2idx[banglish_wlist[0]])
  arr.append(inp_word2idx['<start>'])
  #seq = english_word_to_sequence_transformation(w).split(',')
  #seq = english_word_to_sequence_transformation(banglish_wlist[1])
  seq = banglish_wlist[1]
  seq_elems = seq.split(",")
  arr2 = []
  for i in range(0, len(seq_elems)):
    if seq_elems[i] == "":
      continue
    arr2.append(int(seq_elems[i]))

  #arr.extend(seq.split(","))
  arr.extend(arr2)
  #arr.append(word2idx[banglish_wlist[2]])
  arr.append(inp_word2idx['<end>'])
  return arr


def load_dataset(path, num_examples):
    # creating cleaned input, output pairs
    pairs = create_dataset(path, num_examples)

    # Spanish sentences
    #print("  *******  ",word2idx['1659,1671,1667'])
    #input_tensor = [[word2idx[s] for s in sp.split(' ')] for en, sp in pairs]
    #input_tensor = [[get_banglish_sequence(s) for s in sp.split(' ')] for en, sp in pairs]
    input_tensor = [get_banglish_sequence(sp) for en, sp in pairs]

    # English sentences
    #target_tensor = [[word2idx[s] for s in en.split(' ')] for en, sp in pairs]
    #target_tensor = [get_bangla_sequence(s) for s in en.split(' ') for en, sp in pairs]
    target_tensor = [get_bangla_sequence(en) for en, sp in pairs]

    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(
        input_tensor), max_length(target_tensor)

    # Padding the input and output tensor to the maximum length
   # input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor,
                                                             #    maxlen=max_length_inp,
                                                               #  padding='post')

  #  target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor,
                                                                 # maxlen=max_length_tar,
                                                                #  padding='post')

    return input_tensor, target_tensor

'''

'\ndef load_doc(filename):\n  # open the file as read only\n  file = open(filename, mode="r", encoding="utf-8")\n  # read all text\n  text = file.read()\n  # close the file\n  file.close()\n  return text\n\n\ntext = load_doc("phonem_onlycharecters.txt")\nsplittedLines = text.splitlines()\n#print(len(splittedLines))\n\ncount = 0\ntar_word2idx = {}\ntar_idx2word = {}\ntar_word2idx[\'<pad>\'] = 0\nfor i in range(0, len(splittedLines)):\n  line = splittedLines[i]\n  word = line.split("\t")\n  if tar_word2idx.get(word[0]) == None:\n      tar_word2idx[word[0]] = int(word[1])\n  count = count + 1\n\ntar_word2idx[\'<start>\'] = count + 1\ntar_word2idx[\'<end>\'] = count + 2\n\nfor word, index in tar_word2idx.items():\n  tar_idx2word[index] = word\n\n\ntext2 = load_doc("englishLetter.txt")\nenglish_splittedLines = text2.splitlines()\n\ncount = 0\ninp_word2idx = {}\ninp_idx2word = {}\ninp_word2idx[\'<pad>\'] = 0\nfor i in range(0, len(english_splittedLines)):\n  line = english_splittedLines[i]\n

In [0]:
# index language using the class above
inp_lang = LanguageIndex(data["es"].values.tolist())
targ_lang = LanguageIndex(data["eng"].values.tolist())
# Vectorize the input and target languages
input_tensor = [[inp_lang.word2idx[s] for s in es.split(' ')]  for es in data["es"].values.tolist()]
target_tensor = [[targ_lang.word2idx[s] for s in eng.split(' ')]  for eng in data["eng"].values.tolist()]
#input_tensor , target_tensor = load_dataset('mergedDataAll.txt',15000)
input_tensor[:10]

[[246, 6792, 7259, 7925, 3481, 245],
 [246,
  10080,
  10870,
  10065,
  7044,
  9576,
  9194,
  6172,
  7672,
  6972,
  6792,
  101,
  11131,
  4323,
  707,
  874,
  3632,
  10448,
  7672,
  6792,
  106,
  245],
 [246,
  9180,
  10080,
  10870,
  101,
  3581,
  101,
  6792,
  5776,
  10806,
  5626,
  101,
  10065,
  7021,
  7288,
  1189,
  7393,
  6792,
  106,
  245],
 [246, 10052, 6474, 7065, 5135, 4889, 9054, 2382, 106, 245],
 [246, 3466, 3839, 245],
 [246,
  8330,
  1321,
  9055,
  9199,
  9576,
  2710,
  8797,
  3762,
  1318,
  2967,
  9340,
  101,
  8452,
  6453,
  2095,
  106,
  245],
 [246, 9021, 635, 6706, 1837, 106, 245],
 [246, 4246, 635, 6977, 106, 245],
 [246, 3127, 2706, 245],
 [246,
  8337,
  4323,
  9055,
  7943,
  9576,
  2710,
  8296,
  10052,
  3127,
  2382,
  7672,
  6792,
  106,
  245]]

In [0]:
target_tensor[:10]

[[205, 8530, 5773, 1955, 4085, 204],
 [205, 8874, 1664, 891, 1357, 8533, 1039, 444, 3818, 5617, 204],
 [205, 3698, 8533, 5788, 3838, 386, 5408, 204],
 [205,
  8874,
  1664,
  895,
  5407,
  3840,
  8530,
  7439,
  1664,
  8530,
  1320,
  8648,
  8533,
  496,
  7801,
  204],
 [205, 6614, 7689, 204],
 [205,
  1497,
  3839,
  2190,
  2564,
  2074,
  6614,
  2797,
  8067,
  2995,
  786,
  6614,
  7688,
  4363,
  1499,
  1755,
  7993,
  204],
 [205, 8862, 8547, 7622, 5885, 4193, 204],
 [205, 421, 7731, 4193, 204],
 [205, 2062, 5016, 204],
 [205, 3677, 2188, 969, 7260, 8532, 584, 2062, 5017, 2797, 7825, 204]]

In [0]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [0]:
# calculate the max_length of input and output tensor
max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)

In [0]:
def pad_sequences(x, max_len):
    padded = np.zeros((max_len), dtype=np.int64)
    if len(x) > max_len: padded[:] = x[:max_len]
    else: padded[:len(x)] = x
    return padded

In [0]:
# inplace padding
input_tensor = [pad_sequences(x, max_length_inp) for x in input_tensor]
target_tensor = [pad_sequences(x, max_length_tar) for x in target_tensor]
len(target_tensor)

2000

In [0]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.1)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(1800, 1800, 200, 200)

## Load data into DataLoader for Batching
This is just preparing the dataset so that it can be efficiently fed into the model through batches.

In [0]:
from torch.utils.data import Dataset, DataLoader

In [0]:
# conver the data to tensors and pass to the Dataloader 
# to create an batch iterator

class MyData(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y
        # TODO: convert this into torch code is possible
        self.length = [ np.sum(1 - np.equal(x, 0)) for x in X]
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x_len = self.length[index]
        return x,y,x_len
    
    def __len__(self):
        return len(self.data)

## Parameters
Let's define the hyperparameters and other things we need for training our NMT model.

In [0]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

#vocab_inp_size = len(inp_word2idx)
#vocab_tar_size = len(tar_word2idx)

train_dataset = MyData(input_tensor_train, target_tensor_train)
val_dataset = MyData(input_tensor_val, target_tensor_val)

dataset = DataLoader(train_dataset, batch_size = BATCH_SIZE, 
                     drop_last=True,
                     shuffle=True)

In [0]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.gru = nn.GRU(self.embedding_dim, self.enc_units)
        
    def forward(self, x, lens, device):
        # x: batch_size, max_length 
        
        # x: batch_size, max_length, embedding_dim
        x = self.embedding(x) 
                
        # x transformed = max_len X batch_size X embedding_dim
        # x = x.permute(1,0,2)
        x = pack_padded_sequence(x, lens) # unpad
    
        self.hidden = self.initialize_hidden_state(device)
        
        # output: max_length, batch_size, enc_units
        # self.hidden: 1, batch_size, enc_units
        output, self.hidden = self.gru(x, self.hidden) # gru returns hidden state of all timesteps as well as hidden state at last timestep
        
        # pad the sequence to the max length in the batch
        output, _ = pad_packed_sequence(output)
        
        return output, self.hidden

    def initialize_hidden_state(self, device):
        return torch.zeros((1, self.batch_sz, self.enc_units)).to(device)

In [0]:
### sort batch function to be able to use with pad_packed_sequence
def sort_batch(X, y, lengths):
    lengths, indx = lengths.sort(dim=0, descending=True)
    X = X[indx]
    y = y[indx]
    return X.transpose(0,1), y, lengths # transpose (batch x seq) to (seq x batch)

## Testing the Encoder
Before proceeding with training, we should always try to test out model behavior such as the size of outputs just to make that things are going as expected. In PyTorch this can be done easily since everything comes in eager execution by default.

In [0]:
### Testing Encoder part
# TODO: put whether GPU is available or not
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

encoder.to(device)
# obtain one sample from the data iterator
it = iter(dataset)
x, y, x_len = next(it)

# sort the batch first to be able to use with pac_pack_sequence
xs, ys, lens = sort_batch(x, y, x_len)

enc_output, enc_hidden = encoder(xs.to(device), lens, device)

print(enc_output.size()) # max_length, batch_size, enc_units

torch.Size([56, 64, 1024])


### Decoder

Here, we'll implement an encoder-decoder model with attention which you can read about in the TensorFlow [Neural Machine Translation (seq2seq) tutorial](https://github.com/tensorflow/nmt). This example uses a more recent set of APIs. This notebook implements the [attention equations](https://github.com/tensorflow/nmt#background-on-the-attention-mechanism) from the seq2seq tutorial. The following diagram shows that each input word is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence.

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*. 

Here are the equations that are implemented:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" alt="attention equation 0" width="800">
<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" alt="attention equation 1" width="800">

We're using *Bahdanau attention*. Lets decide on notation before writing the simplified form:

* FC = Fully connected (dense) layer
* EO = Encoder output
* H = hidden state
* X = input to the decoder

And the pseudo-code:

* `score = FC(tanh(FC(EO) + FC(H)))`
* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, 1)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
* `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
* `embedding output` = The input to the decoder X is passed through an embedding layer.
* `merged vector = concat(embedding output, context vector)`
* This merged vector is then given to the GRU
  
The shapes of all the vectors at each step have been specified in the comments in the code:

In [0]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, dec_units, enc_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.enc_units = enc_units
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.gru = nn.GRU(self.embedding_dim + self.enc_units, 
                          self.dec_units,
                          batch_first=True)
        self.fc = nn.Linear(self.enc_units, self.vocab_size)
        
        # used for attention
        self.W1 = nn.Linear(self.enc_units, self.dec_units)
        self.W2 = nn.Linear(self.enc_units, self.dec_units)
        self.V = nn.Linear(self.enc_units, 1)
    
    def forward(self, x, hidden, enc_output):
        # enc_output original: (max_length, batch_size, enc_units)
        # enc_output converted == (batch_size, max_length, hidden_size)
        enc_output = enc_output.permute(1,0,2)
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        hidden_with_time_axis = hidden.permute(1, 0, 2)
        
        # score: (batch_size, max_length, hidden_size) # Bahdanaus's
        # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
        # It doesn't matter which FC we pick for each of the inputs
        score = torch.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis))
        
        #score = torch.tanh(self.W2(hidden_with_time_axis) + self.W1(enc_output))
          
        # attention_weights shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        attention_weights = torch.softmax(self.V(score), dim=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * enc_output
        context_vector = torch.sum(context_vector, dim=1)
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        # takes case of the right portion of the model above (illustrated in red)
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        #x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        # ? Looks like attention vector in diagram of source
        x = torch.cat((context_vector.unsqueeze(1), x), -1)
        
        # passing the concatenated vector to the GRU
        # output: (batch_size, 1, hidden_size)
        output, state = self.gru(x)
        
        
        # output shape == (batch_size * 1, hidden_size)
        output =  output.view(-1, output.size(2))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state, attention_weights
    
    def initialize_hidden_state(self):
        return torch.zeros((1, self.batch_sz, self.dec_units))
      
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        self.hidden_size = hidden_size
        self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.rand(hidden_size))
        stdv = 1. / math.sqrt(self.v.size(0))
        self.v.data.normal_(mean=0, std=stdv)

    def forward(self, hidden, encoder_outputs, src_len=None):
        '''
        :param hidden: 
            previous hidden state of the decoder, in shape (layers*directions,B,H)
        :param encoder_outputs:
            encoder outputs from Encoder, in shape (T,B,H)
        :param src_len:
            used for masking. NoneType or tensor in shape (B) indicating sequence length
        :return
            attention energies in shape (B,T)
        '''
        print(encoder_outputs.data.shape)
        max_len = encoder_outputs.size(0)
        this_batch_size = encoder_outputs.size(1)
        H = hidden.repeat(max_len,1,1).transpose(0,1)
        encoder_outputs = encoder_outputs.transpose(0,1) # [B*T*H]
        print(encoder_outputs.data.shape)
        attn_energies = self.score(H,encoder_outputs) # compute attention score
        
        if src_len is not None:
            mask = []
            for b in range(src_len.size(0)):
                mask.append([0] * src_len[b].item() + [1] * (encoder_outputs.size(1) - src_len[b].item()))
            mask = cuda_(torch.ByteTensor(mask).unsqueeze(1)) # [B,1,T]
            attn_energies = attn_energies.masked_fill(mask, -1e18)
        
        return F.softmax(attn_energies).unsqueeze(1) # normalize with softmax

    def score(self, hidden, encoder_outputs):
        print(hidden.data.shape)
        print(encoder_outputs.data.shape)
        energy = nn.Tanh(self.attn(torch.cat((hidden, encoder_outputs), 2))) # [B*T*2H]->[B*T*H]
       
        energy = energy.transpose(2,1) # [B*H*T]
        #energy = energy.T
        v = self.v.repeat(encoder_outputs.data.shape[0],1).unsqueeze(1) #[B*1*H]
        energy = torch.bmm(v,energy) # [B*1*T]
        return energy.squeeze(1) #[B*T]

class BahdanauAttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size,  output_size,batch_size,embed_size=256, n_layers=1, dropout_p=0.1):
        super(BahdanauAttnDecoderRNN, self).__init__()
        # Define parameters
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout_p = dropout_p
        self.batch_sz = batch_size
        # Define layers
        self.embedding = nn.Embedding(output_size, self.embed_size)
        self.dropout = nn.Dropout(dropout_p)
        self.attn = Attn('concat', self.hidden_size)
        self.gru = nn.GRU(hidden_size + embed_size, hidden_size, n_layers, dropout=dropout_p,batch_first=True)
        self.attn_combine = nn.Linear(hidden_size + self.embed_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, word_input, last_hidden, encoder_outputs):
        '''
        :param word_input:
            word input for current time step, in shape (B)
        :param last_hidden:
            last hidden stat of the decoder, in shape (layers*direction*B*H)
        :param encoder_outputs:
            encoder outputs in shape (T*B*H)
        :return
            decoder output
        Note: we run this one step at a time i.e. you should use a outer loop 
            to process the whole sequence
        Tip(update):
        EncoderRNN may be bidirectional or have multiple layers, so the shape of hidden states can be 
        different from that of DecoderRNN
        You may have to manually guarantee that they have the same dimension outside this function,
        e.g, select the encoder hidden state of the foward/backward pass.
        '''
        # Get the embedding of the current input word (last output word)
        word_embedded = self.embedding(word_input.long()).view(1, word_input.size(0), -1) # (1,B,V)
       # self.embedding(word_input.long()).view(batch_size,1,-1)
        word_embedded = self.dropout(word_embedded)
        # Calculate attention weights and apply to encoder outputs
        attn_weights = self.attn(last_hidden[-1], encoder_outputs)
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))  # (B,1,V)
        context = context.transpose(0, 1)  # (1,B,V)
        # Combine embedded input word and attended context, run through RNN
        rnn_input = torch.cat((word_embedded, context), 2)
        rnn_input = self.attn_combine(rnn_input) # use it in case your size of rnn_input is different
        output, hidden = self.gru(rnn_input, last_hidden)
        output = output.squeeze(0)  # (1,B,V)->(B,V)
        # context = context.squeeze(0)
        # update: "context" input before final layer can be problematic.
        # output = F.log_softmax(self.out(torch.cat((output, context), 1)))
        output = F.log_softmax(self.out(output))
        # Return final output, hidden state
        return output, hidden,attn_weights
    
    def initialize_hidden_state(self):
        return torch.zeros((1, self.batch_sz, self.hidden_size))
      


## Testing the Decoder
Similarily, try to test the decoder.

In [0]:
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

encoder.to(device)
# obtain one sample from the data iterator
it = iter(dataset)
x, y, x_len = next(it)

print("Input: ", x.shape)
print("Output: ", y.shape)

# sort the batch first to be able to use with pac_pack_sequence
xs, ys, lens = sort_batch(x, y, x_len)

enc_output, enc_hidden = encoder(xs.to(device), lens, device)
print("Encoder Output: ", enc_output.shape) # batch_size X max_length X enc_units
print("Encoder Hidden: ", enc_hidden.shape) # batch_size X enc_units (corresponds to the last state)

#decoder = BahdanauAttnDecoderRNN(units,vocab_tar_size,BATCH_SIZE, embedding_dim)
decoder = Decoder(vocab_tar_size, embedding_dim, units, units, BATCH_SIZE)
decoder = decoder.to(device)

#print(enc_hidden.squeeze(0).shape)

dec_hidden = enc_hidden#.squeeze(0)
dec_input = torch.tensor([[targ_lang.word2idx['<start>']]] * BATCH_SIZE)
print("Decoder Input: ", dec_input.shape)
print("--------")

for t in range(1, y.size(1)):
    # enc_hidden: 1, batch_size, enc_units
    # output: max_length, batch_size, enc_units
    predictions, dec_hidden, _ = decoder(dec_input.to(device), 
                                         dec_hidden.to(device), 
                                         enc_output.to(device))
    
    print("Prediction: ", predictions.shape)
    print("Decoder Hidden: ", dec_hidden.shape)
    
    #loss += loss_function(y[:, t].to(device), predictions.to(device))
    
    dec_input = y[:, t].unsqueeze(1)
    print(dec_input.shape)
    break

Input:  torch.Size([64, 91])
Output:  torch.Size([64, 137])
Encoder Output:  torch.Size([49, 64, 1024])
Encoder Hidden:  torch.Size([1, 64, 1024])
Decoder Input:  torch.Size([64, 1])
--------
Prediction:  torch.Size([64, 9101])
Decoder Hidden:  torch.Size([1, 64, 1024])
torch.Size([64, 1])


In [0]:
criterion = nn.CrossEntropyLoss()

def loss_function(real, pred):
    """ Only consider non-zero inputs in the loss; mask needed """
    #mask = 1 - np.equal(real, 0) # assign 0 to all above 0 and 1 to all 0s
    #print(mask)
    mask = real.ge(1).type(torch.cuda.FloatTensor)
    
    loss_ = criterion(pred, real) * mask 
    return torch.mean(loss_)

In [0]:
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## TODO: Combine the encoder and decoder into one class
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, units, BATCH_SIZE)

encoder.to(device)
decoder.to(device)

optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), 
                       lr=0.001)

## Training
Now we start the training. We are only using 10 epochs but you can expand this to keep trainining the model for a longer period of time. Note that in this case we are teacher forcing during training. Find a more detailed explanation in the official TensorFlow [implementation](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb) of this notebook provided by the TensorFlow team. 

- Pass the input through the encoder which return encoder output and the encoder hidden state.
- The encoder output, encoder hidden state and the decoder input (which is the start token) is passed to the decoder.
- The decoder returns the predictions and the decoder hidden state.
- The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
- Use teacher forcing to decide the next input to the decoder.
- Teacher forcing is the technique where the target word is passed as the next input to the decoder.
- The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

In [0]:
EPOCHS = 100

for epoch in range(EPOCHS):
    start = time.time()
    
    encoder.train()
    decoder.train()
    
    total_loss = 0
    
    for (batch, (inp, targ, inp_len)) in enumerate(dataset):
        loss = 0
        
        xs, ys, lens = sort_batch(inp, targ, inp_len)
        enc_output, enc_hidden = encoder(xs.to(device), lens, device)
        dec_hidden = enc_hidden
        
        # use teacher forcing - feeding the target as the next input (via dec_input)
        dec_input = torch.tensor([[targ_lang.word2idx['<start>']]] * BATCH_SIZE)
        
        # run code below for every timestep in the ys batch
        for t in range(1, ys.size(1)):
            predictions, dec_hidden, _ = decoder(dec_input.to(device), 
                                         dec_hidden.to(device), 
                                         enc_output.to(device))
            loss += loss_function(ys[:, t].to(device), predictions.to(device))
            #loss += loss_
            dec_input = ys[:, t].unsqueeze(1)
            
        
        batch_loss = (loss / int(ys.size(1)))
        total_loss += batch_loss
        
        optimizer.zero_grad()
        
        loss.backward()

        ### UPDATE MODEL PARAMETERS
        optimizer.step()
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.detach().item()))
        
        
    ### TODO: Save checkpoint for model
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
    
    
    
    

Epoch 1 Batch 0 Loss 1.0746
Epoch 1 Loss 0.7605
Time taken for 1 epoch 91.58493709564209 sec

Epoch 2 Batch 0 Loss 0.8345
Epoch 2 Loss 0.6479
Time taken for 1 epoch 90.99389243125916 sec

Epoch 3 Batch 0 Loss 0.6690
Epoch 3 Loss 0.6003
Time taken for 1 epoch 90.67868375778198 sec

Epoch 4 Batch 0 Loss 0.6252
Epoch 4 Loss 0.5339
Time taken for 1 epoch 91.10711860656738 sec

Epoch 5 Batch 0 Loss 0.3946
Epoch 5 Loss 0.4563
Time taken for 1 epoch 91.44172215461731 sec

Epoch 6 Batch 0 Loss 0.3509
Epoch 6 Loss 0.3809
Time taken for 1 epoch 90.79493379592896 sec

Epoch 7 Batch 0 Loss 0.3342
Epoch 7 Loss 0.3193
Time taken for 1 epoch 90.32221794128418 sec

Epoch 8 Batch 0 Loss 0.2336
Epoch 8 Loss 0.2663
Time taken for 1 epoch 91.02484583854675 sec

Epoch 9 Batch 0 Loss 0.2045
Epoch 9 Loss 0.2163
Time taken for 1 epoch 90.35513043403625 sec

Epoch 10 Batch 0 Loss 0.1675
Epoch 10 Loss 0.1672
Time taken for 1 epoch 90.25397634506226 sec

Epoch 11 Batch 0 Loss 0.1291
Epoch 11 Loss 0.1241
Time tak

In [0]:
torch.save(encoder.state_dict(), '/gdrive/My Drive/nmt/encoder.dict')
torch.save(decoder.state_dict(), '/gdrive/My Drive/nmt/decoder.dict')
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, units, BATCH_SIZE)
encoder.load_state_dict(torch.load('/gdrive/My Drive/nmt/encoder.dict'))
decoder.load_state_dict(torch.load('/gdrive/My Drive/nmt/decoder.dict'))

<All keys matched successfully>

## Final Words
Notice that we only trained the model and that's it. In fact, this notebook is in experimental phase, so there could also be some bugs or something I missed during the process of converting code or training. Please comment your concerns here or submit it as an issue in the [GitHub version](https://github.com/omarsar/pytorch_neural_machine_translation_attention) of this notebook. I will appreciate it!

We didn't evaluate the model or analyzed it. To encourage you to practice what you have learned in the notebook, I will suggest that you try to convert the TensorFlow code used in the [original notebook](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb) and complete this notebook. I believe the code should be straightforward, the hard part was already done in this notebook. If you manage to complete it, please submit a PR on the GitHub version of this notebook. I will gladly accept your PR. Thanks for reading and hope this notebook was useful. Keep tuned for notebooks like this on my Twitter ([omarsar0](https://twitter.com/omarsar0)). 

## References

### Seq2Seq:
  - Sutskever et al. (2014) - [Sequence to Sequence Learning with Neural Networks](Sequence to Sequence Learning with Neural Networks)
  - [Sequence to sequence model: Introduction and concepts](https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d)
  - [Blog on seq2seq](https://guillaumegenthial.github.io/sequence-to-sequence.html)
  - [Bahdanau et al. (2016) NMT jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf)
  - [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)