[View in Colaboratory](https://colab.research.google.com/github/meethariprasad/phd/blob/master/NMT_Project_V3_Attention_decoder.ipynb)

# Neural Machine Translation from Hindi to English


Assignment is to build a Neural Machine Translation (NMT) model to translate Hindi Sentences into machine English. 

We will do this using by creating attention model as in Neural Machine Translation by Jointly Learning to Align and Translate: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio https://arxiv.org/pdf/1409.0473.pdf

We will be using following small parallel corpus "http://www.manythings.org/anki/hin-eng.zip"

Notes: In Appendix section at end we have given additional helper functions which can be used to improve model as future improvement effort.

## Loading Libraries.

In [1]:
# https://pypi.python.org/pypi/pydot
#Setting up graphviz (2.38.0-16ubuntu2) ...
#Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
try:
  !apt-get -qq install -y graphviz && pip install -q pydot
  import pydot
  import graphviz
except ModuleNotFoundError:
  print ("ModuleNotFoundError")

ModuleNotFoundError


In [0]:
# Install the PyDrive wrapper & import libraries. This is to upload model & weights to your GDrive.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
import os
os.getcwd()
os.listdir()

['datalab',
 'clean_pairs.pkl',
 '.config',
 '.ipython',
 '.rnd',
 '.keras',
 'hin.txt',
 '.nv',
 'enghindi.txt',
 'dataset.pkl',
 'attention_decoder.py',
 '.forever',
 '__pycache__',
 '.cache',
 '_about.txt',
 '.local']

In [4]:
!pip install keras==2.0.5



In [5]:
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
import random
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np
import gc
import os.path
#import pydot 
#import graphviz
 

#from faker import Faker
#import random
#from tqdm import tqdm
#from babel.dates import format_date
#from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


## 1 - Translating Source Language to Target Language

The model we will build here could be used to translate from from English to Hindi or any other parallel corpus. 

### 1.1 - Dataset

In this section we are going to download the dataset and prepare the dataset with Padding & Integer Encoding & One hot encoding.

In [0]:
if (os.path.isfile("hin-eng.zip")==False & os.path.isfile("enghindi.zip")==False ):
  import requests, zipfile, io, os
  #https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/hin.zip
  r = requests.get("http://www.manythings.org/anki/hin-eng.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
  #https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/enghindi.zip
  r = requests.get("https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/enghindi.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
#import requests, zipfile, io,os
#r = requests.get("http://www.manythings.org/anki/hin-eng.zip")
#z = zipfile.ZipFile(io.BytesIO(r.content))
#z.extractall()
#Verifying if the file hin.txt are downloaded properly.
#if os.path.isfile("hin.txt"):
#    print('hin.txt exists')

In [7]:
print("If you want dataset to be reconstructed using new sentence length, enter 1")
a = input()
print(type(int(a)))
a=int(a)
if((a==1)& os.path.isfile("dataset.pkl")==True):
  print("yeah. I will delete dataset.pkl")
  print(os.remove("dataset.pkl"))
else:
  print("cool. Either it is already deleted or I Will proceed with already created dataset.pkl")

If you want dataset to be reconstructed using new sentence length, enter 1
1
<class 'int'>
yeah. I will delete dataset.pkl
None


In [0]:
if (os.path.isfile("dataset.pkl")==False):
  file=open("hin.txt",'r',encoding='utf-8')
  content=file.read()
  file.close()
#Reading the file
#file=open("hin.txt",'r',encoding='utf-8')
#content=file.read()
#file.close()

In [0]:
if (os.path.isfile("dataset.pkl")==False):
  file=open("enghindi.txt",'r',encoding='utf-16')
  content2=file.read()
  file.close()

In [10]:
def save_clean_data(sentences, filename):
  dump(sentences, open(filename, 'wb'))
  print('Saved: %s' % filename)
  
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array
  
if (os.path.isfile("dataset.pkl")==False):
  # load doc into memory
  def load_doc(filename,encode_format):
    # open the file as read only
    file = open(filename, mode='rt', encoding=encode_format)
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

  # split a loaded document into sentences
  def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

  # clean a list of lines
  def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_punc = re.compile('[।%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    for pair in lines:
      clean_pair = list()
      for line in pair:
        # tokenize on white space
        line = line.split()
        # remove punctuation from each token
        line = [re_punc.sub('', w) for w in line]
        # remove tokens with numbers in them
        #line = [word for word in line if word.isalpha()]
        #line=re.sub('[।]', '', line)
        # store as string
        line=(' '.join(line))
        clean_pair.append(line.strip())
      cleaned.append(clean_pair)
    return array(cleaned)

  # save a list of clean sentences to file
  def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

  # load dataset
  filename = 'hin.txt'
  doc = load_doc(filename,encode_format='utf-8')
  # split into english-german pairs
  pairs = to_pairs(doc)
  del doc

  filename2 = 'enghindi.txt'
  doc2 = load_doc(filename2,encode_format='utf-16')
  # split into english-german pairs
  pairs2 = to_pairs(doc2)
  del doc2

  print(type(pairs),pairs[0:2],pairs2[0:2])
  # clean sentences
  clean_pairs = clean_pairs(pairs)
  lines=pairs2
  cleaned = list()
  re_punc = re.compile('[।%s]' % re.escape(string.punctuation))
  re_print = re.compile('[^%s]' % re.escape(string.printable))
  for pair in lines:
    clean_pair = list()
    for line in pair:
      line=line.strip()
      # tokenize on white space
      line = line.split()
      # remove punctuation from each token
      line = [re_punc.sub('', w) for w in line]
      # remove tokens with numbers in them
      #line = [word for word in line if word.isalpha()]
      #line=re.sub('[।]', '', line)
      # store as string
      line=(' '.join(line))
      clean_pair.append(line.strip())
    cleaned.append(clean_pair)
  clean_pairs2=(array(cleaned))
  print(type(clean_pairs2),clean_pairs2.shape,type(clean_pairs),clean_pairs.shape,(np.concatenate((clean_pairs2, clean_pairs))).shape)
  # save clean pairs to file
  print ("Number of clean pairs",clean_pairs.shape[0])
  # spot check
  for i in range(10):
    print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))
  clean_pairs=(np.concatenate((clean_pairs2, clean_pairs)))
  print(clean_pairs[1:2])
  save_clean_data(clean_pairs, 'clean_pairs.pkl')
  del clean_pairs,clean_pairs2,lines,re_punc,re_print,cleaned,pairs,pairs2

<class 'list'> [['Help!', 'बचाओ!'], ['Jump.', 'उछलो.']] [['fresh breath and shining teeth enhance your personality .', 'ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं ।'], ['your self-confidence also increases with teeth .', 'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है ।']]
<class 'numpy.ndarray'> (77145, 2) <class 'numpy.ndarray'> (2867, 2) (80012, 2)
Number of clean pairs 2867
[Help] => [बचाओ]
[Jump] => [उछलो]
[Jump] => [कूदो]
[Jump] => [छलांग]
[Hello] => [नमस्ते]
[Hello] => [नमस्कार]
[Cheers] => [वाहवाह]
[Cheers] => [चियर्स]
[Got it] => [समझे कि नहीं]
[Im OK] => [मैं ठीक हूँ]
[['your selfconfidence also increases with teeth'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है']]
Saved: clean_pairs.pkl


In [11]:
from pickle import load
from pickle import dump
from numpy.random import shuffle

if (os.path.isfile("dataset.pkl")==False):
  # load a clean dataset
  def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

  # save a list of clean sentences to file
  def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

  # load dataset
  raw_dataset = load_clean_sentences('clean_pairs.pkl')
  #del os.remove("clean_pairs.pkl")
  # reduce dataset size
  n_sentences = raw_dataset.shape[0]
  print (n_sentences)
  #Subsetting to 10000 records.
  #n_sentences=dataset.shape[0]
  # random shuffle
  random.Random(4).shuffle(raw_dataset)
  dataset = raw_dataset[:n_sentences, :]
  print(dataset.shape)
  random.Random(4).shuffle(dataset)
  del raw_dataset
# split into train/test
#train, test = dataset[:2800], dataset[2800:]
# save
#save_clean_data(dataset, 'dataset.pkl')
#save_clean_data(train, 'english-german-train.pkl')
#save_clean_data(test, 'english-german-test.pkl')

80012
(80012, 2)


In [0]:
#Trying to get sentences of length say hindi=12 words or less also with English words 10 or less.
#limited=dataset[:10]
#print((len(limited[2][0].split()) < 100))
#print((len(limited[2][1].split()) < 100))
#state1=(len(limited[2][0].split()) < 100)
#state2=(len(limited[2][1].split())<100)
#print(state1,state2,state1&state2)
#print((len(limited[2][0].split()) < 100 & len(limited[2][1].split())<100))

def get_sentences_subset(limited,source_len,target_len):
    indexes_list=[]
    for indexes in range(0,limited.shape[0]):
        #print(len(limited[i][0].split()),len(limited[i][1].split()))
        eng_len=len(limited[indexes][0].split()) 
        hin_len=len(limited[indexes][1].split())
        state1=(eng_len<=target_len)
        state2=(hin_len<=source_len)
        final=state2&state1
        #print(eng_len,hin_len,final)
        #print(state1,state2,final)
        if (final):
            indexes_list.append(indexes)
    #print(indexes_list,type(indexes_list))
    return(limited[indexes_list])

In [13]:
if (os.path.isfile("dataset.pkl")==False):
  print("Total sentences", dataset.shape[0])
  print("How many words long sentence in both language you need?")
  sentence_words=input()
  sentence_words=int(sentence_words)
  #print("How many words long sentence in Target Language you need?")
  #target_sentence_words=input()
  #target_sentence_words=int(target_sentence_words)
  raw_dataset_subset=get_sentences_subset(dataset,sentence_words,sentence_words)
  print("There are Number of Sentences are matching above criteria",raw_dataset_subset.shape)
  print("########################################################################")
  print("How many sentences you need?")
  n_sentences=input()
  n_sentences=int(n_sentences)
  #Subsetting to n sentences.
  #n_sentences=5000
  random.Random(4).shuffle(raw_dataset_subset)
  dataset = raw_dataset_subset[:n_sentences, :]
  random.Random(4).shuffle(dataset)
  print(raw_dataset_subset.shape,dataset,dataset.shape)
  save_clean_data(dataset, 'dataset.pkl')
#i=[0,2,3]
#print("Limited",limited,i)
#print("\n Indexed",limited[i],i)

Total sentences 80012
How many words long sentence in both language you need?
15
There are Number of Sentences are matching above criteria (41704, 2)
########################################################################
How many sentences you need?
40000
(41704, 2) [['fresh breath and shining teeth enhance your personality'
  'ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं']
 ['fresh breath and shining teeth enhance your personality'
  'ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं']
 ['fresh breath and shining teeth enhance your personality'
  'ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं']
 ...
 ['sleep apnia can also be fatal' 'स्लीप एप्रिया जानलेवा भी हो सकती है']
 ['hepatitisb is a worldwide disease which occurs due to hepatitisb virus lrbhbv rrb'
  'हेपेटाइटिसबी विश्\u200dवव्यापी बीमारी है जो हेपेटाइटिसबी वायरस के कारण होती है']
 ['patients are administered more anesthetic medicine which has its own harms'
  'मरीज को ज्यादा निश्चेतक दवा द

In [0]:
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))
dataset = load_clean_sentences('dataset.pkl')

In [0]:
raw_dataset_subset = None
raw_dataset =None
del raw_dataset_subset,raw_dataset

In [16]:
#Check the data sample
dataset.shape

(40000, 2)

In [0]:
#Converting it to tuples.
dataset_list=(list(tuple(map(tuple, dataset))))

In [0]:
#English Sentence List
english_sentences_list=list(dataset[:,0])

In [19]:
english_sentences_list[0]='Please make yourself at home'
english_sentences_list[0]

'Please make yourself at home'

In [0]:
#English Sentence Unique Word List and Length of Vocabulary
english_unique_words=set((' '.join(english_sentences_list)).split())
english_vocab_len=len(set((' '.join(english_sentences_list)).split()))

In [0]:
#Hindi Sentence List
hindi_sentences_list=list(dataset[:,1])
hindi_sentences_list[0]='इसको अपना घर ही समझो'

In [0]:
#Hindi Sentence Unique Word List and Length of Vocabulary
hindi_unique_words=set((' '.join(hindi_sentences_list)).split())
hindi_vocab_len=len(set((' '.join(hindi_sentences_list)).split()))

In [0]:
#Creating Dictionary with Unknown and Pad elements
english_dictionary=dict(zip(sorted(english_unique_words) + ['<unk>', '<pad>'], list(range(len(english_unique_words) + 2))))
hindi_dictionary=dict(zip(sorted(hindi_unique_words) + ['<unk>', '<pad>'], list(range(len(hindi_unique_words) + 2))))

In [0]:
#Making Sure that 0th Value is pad value to ensure masking.
def return_adjusted_dictionary(english_dictionary):
  zero_value=0
  Key_pad='<pad>'
  for Key, Value in english_dictionary.items():
    if Value == zero_value:
      Key_zero=Key
      #print (Key)
    if Key == Key_pad:
      value_pad=Value
      #print (Value)
  english_dictionary.update({Key_pad: 0, Key_zero: value_pad})
  return(english_dictionary)

In [0]:
hindi_dictionary=return_adjusted_dictionary(hindi_dictionary)
english_dictionary=return_adjusted_dictionary(english_dictionary)

In [0]:
#Reverse Dictionary for both languages
revere_dictionary_hindi=dict((v,k) for k,v in hindi_dictionary.items())
revere_dictionary_english=dict((v,k) for k,v in english_dictionary.items())

In [27]:
#Storing the index of padding value in variables to add it going ahead.
english_padding_value=english_dictionary['<pad>']
hindi_padding_value=hindi_dictionary['<pad>']
print(english_padding_value,hindi_padding_value)

0 0


In [28]:
#This going to be the global variable with maximum number of words found in a sentence
max_english_words=max(len(line.split()) for line in english_sentences_list)
max_hindi_words=max(len(line.split()) for line in hindi_sentences_list)
print(max_english_words,max_hindi_words)

15 15


In [0]:
def get_padded_encoding(sentences_list,language_dictionary,max_language_words):
    padding_value=language_dictionary['<pad>']
    language_array=[]
    #Iterate over List.
    for sentence in sentences_list:
        #Replaces English words with English Vocabulary Indexes and Hindi with Hindi Vocabulary Indexes.
        #logic: if a word not in dictionary enters, it will be replaced by unk key value.
        single_sentence_array=[]
        for word in sentence.split(): 
            try:
                #single_sentence_array=([language_dictionary[word] for word in sentence.split()])
                single_sentence_array.append(language_dictionary[word])
            except KeyError:
                unk='<unk>'
                single_sentence_array.append(language_dictionary[unk])
        #Find the length of english_single_sentence_array
        length_single_sentence=(len(single_sentence_array))
        #So how many times padding dictionary key needs to be appended, if we say maximum length of sentences to be considered is eng_max_len.
        if (max_language_words>length_single_sentence):
            padding_count=(max_language_words-length_single_sentence)
        else:
            padding_count=0
        if (padding_count>0):
            for pad in range(0,padding_count):
                single_sentence_array.append(padding_value)
        else:
            single_sentence_array=single_sentence_array[0:max_language_words]
        #Append to main array
        language_array.append(single_sentence_array)
    #Convert to Numpy array at the end
    language_array=np.array(language_array)
    return(language_array)

Instead of doing a padding over large sentence size, emperically it is found that it is better to do for a short sentences considering the limitation we are having with respect to corpus size.

In [30]:
#Get encoded sentences
hindi_encoding=get_padded_encoding(hindi_sentences_list,hindi_dictionary,max_english_words)
english_encoding=get_padded_encoding(english_sentences_list,english_dictionary,max_hindi_words)
print(hindi_encoding.shape,english_encoding.shape)

(40000, 15) (40000, 15)


In [31]:
#Verifying the encoding and decoding for a sample data.
print(english_sentences_list[1],hindi_sentences_list[1])
print(english_encoding[1],hindi_encoding[1])
#Check if encoding gives back the same answer
for key in english_encoding[1]:
    print(revere_dictionary_english[key])
for key in hindi_encoding[1]:
    print(revere_dictionary_hindi[key])
english_dictionary['<pad>']
hindi_dictionary['<pad>']

fresh breath and shining teeth enhance your personality ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं
[1821  594  194 4134 4600 1500 5176 3392    0    0    0    0    0    0
    0] [2533 5802  922 1794 2681  429 5408 1302 3024 6373    0    0    0    0
    0]
fresh
breath
and
shining
teeth
enhance
your
personality
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
ताजा
साँसें
और
चमचमाते
दाँत
आपके
व्यक्तित्व
को
निखारते
हैं
<pad>
<pad>
<pad>
<pad>
<pad>


0

In [0]:
#We will convert the english and hindi encodings to one hot encodings.
#Please note Input is of the dimension (number of sentences,max_length_language(every column is a word))
#Output is (number of sentences,Max_length_language(every row is a word),length of vocabulary)
#Basically every row of the onehotcode matrix must be for one word.
#How=1 => 1 0 0
#Are=2 => 0 1 0
#You=3 => 0 0 1
#We are trying to translate hindi to english, so our X is Hindi and Y is English
X=hindi_encoding
Y=english_encoding
hindi_encoding=None
english_encoding=None
dataset=None
english_sentences_list=None
hindi_sentences_list=None
del hindi_encoding,english_encoding,dataset,english_sentences_list,hindi_sentences_list
#Note: Instead of one hot we can use word embeddings for Xoh

#Xoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(hindi_dictionary)), X)))
#Yoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y)))
#print("X.shape:", X.shape)
#print("Y.shape:", Y.shape)
#print("Xoh.shape:", Xoh.shape)
#print("Yoh.shape:", Yoh.shape)

In [0]:
#sample_size=X.shape[0]
#test_start_index=(sample_size-100)
#test_batch_size=sample_size-test_start_index-2
#testXoh=get_single_onehot_array(X,test_start_index,test_batch_size,hindi_dictionary)
#testYoh=get_single_onehot_array(Y,test_start_index,test_batch_size,english_dictionary)
#print(sample_size,test_start_index,test_batch_size)

In [0]:
#print(Y.shape,X.shape,len(english_dictionary),len(hindi_dictionary))
#from keras.callbacks import ModelCheckpoint
#import pandas

#sample_size=X.shape[0]
#test_start_index=(sample_size-100)
#test_batch_size=sample_size-test_start_index-2
#train_iteration_size=test_start_index
#print("train_iteration_size",train_iteration_size)
#gc.collect()

In [0]:
#Every call will return batch size sized Y or x
#Intention generate the small dataset and train the model 
#and retrain the model again for next index small dataset and so on

#start_index=0
#iteration=0
#if (iteration>0):
#    start_index=iteration+batch_size
def get_single_onehot_array(Y,start_index,batch_size,dictionary):
    english_dictionary=dictionary
    end_index=start_index+batch_size
    if end_index>Y.shape[0]:
        end_index=Y.shape[0]
    if start_index>end_index:
        start_index=(end_index-1)
    result_array=[]
    for i in range(start_index,end_index):
        Yoh_single_sentence=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y[i])))
        Yoh_single_sentence=np.swapaxes(Yoh_single_sentence,0,1)
        Yoh_single_sentence=np.reshape(Yoh_single_sentence,(Yoh_single_sentence.shape[1],Yoh_single_sentence.shape[2]))
        #Making fist column of timestep of one hot encoding as 0 as this column will have 1 only for the row with encoding 0, which is our pad value, which will be masked.
        #if i % 1000 == 0:
        #    print(i)
        #print(type(Yoh_single_sentence),Yoh_single_sentence.shape)
        result_array.append(Yoh_single_sentence)
        #result_array = np.append(result_array,Yoh_single_sentence, axis=0)
    Yoh = np.array(result_array)
    Yoh[:,:,0]=np.zeros(Yoh.shape[1])
    return Yoh

In [0]:
#For Train:start_index=0,train_iteration_size=train_sample_size
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)
#For Test:start_index=train_sample_size,train_iteration_size=(samples_size-train_sample_size-1)
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)

def batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary):
  start_index=start_index
  training_sample_count=train_iteration_size
  
  while start_index < X.shape[0]:
      batch_size=1
      #Remember that the items that will get trained is iteration_range*batchsize
      start_index=(start_index*batch_size)
      #print("Iteration",iteration)
      #print("start_index,iteration,train_iteration size:",start_index,iteration,train_iteration_size)
      #print("train_iteration_size",train_iteration_size)
      #gc.collect()
      if start_index< train_iteration_size:
        #print("Start Index",start_index)
        #print("Before Start Index Reset",start_index)
        #Xoh_batch=get_single_onehot_array(X,start_index,batch_size,hindi_dictionary)
        Xoh_batch=np.reshape(X[start_index],(1,X.shape[1]))
        Yoh_batch=get_single_onehot_array(Y,start_index,batch_size,english_dictionary)
        #Xoh_batch=np.concatenate((Xoh_batch,Xoh_batch),axis=0)
        #Yoh_batch=np.concatenate((Yoh_batch,Yoh_batch),axis=0)
        #train_outputs = list(Yoh_batch.swapaxes(0,1))
        #train_s0 = np.zeros((Xoh_batch.shape[0],n_s))
        #train_c0 = np.zeros((Xoh_batch.shape[0],n_s))
        #trainX=[Xoh_batch, train_s0, train_c0]
        #trainY=list(Yoh_batch.swapaxes(0,1))
        gc.collect()
      else:
        print("Before Start Index Reset",start_index)
        start_index=0
        print("After Start Index Reset",start_index)
        #Xoh_batch=get_single_onehot_array(X,start_index,batch_size,hindi_dictionary)
        Xoh_batch=np.reshape(X[start_index],(1,X.shape[1]))
        Yoh_batch=get_single_onehot_array(Y,start_index,batch_size,english_dictionary)
        #train_outputs = list(Yoh_batch.swapaxes(0,1))
        #train_s0 = np.zeros((Xoh_batch.shape[0], n_s))
        #train_c0 = np.zeros((Xoh_batch.shape[0],n_s))
        #trainX=[Xoh_batch, train_s0, train_c0]
        #trainY=list(Yoh_batch.swapaxes(0,1))
        #Xoh_batch=np.concatenate((Xoh_batch,Xoh_batch),axis=0)
        #Yoh_batch=np.concatenate((Yoh_batch,Yoh_batch),axis=0)
        gc.collect()
        continue
      start_index = start_index+1
      #print(Xoh_batch.shape,Yoh_batch.shape)
      yield (Xoh_batch,Yoh_batch)

In [37]:
a = np.random.rand(3,2)
b = np.random.rand(3,2)
print(a.shape,b.shape)
for x in range(len(a)):
  print (a[x].shape)

(3, 2) (3, 2)
(2,)
(2,)
(2,)


In [0]:
if (os.path.isfile("attention_decoder.zip")==False):
  r = requests.get("https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/attention_decoder.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
  del z,r

In [83]:
#del model
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder



# configure problem
n_features = len(english_dictionary)
input_features=len(hindi_dictionary)
n_timesteps_in = X.shape[1]
n_timesteps_out = Y.shape[1]
LSTM_Unitsize=150
input_embed_dimension=100
print(n_timesteps_in,n_timesteps_out,input_features,n_features,LSTM_Unitsize)

15 15 6473 5187 150


In [0]:
#import gc
#locals()
# train LSTM
#model.fit(Xoh, Yoh, epochs=1, verbose=2)
# define model
#del model2
import gc
gc.collect()
from keras.layers import Dropout,Masking,Embedding
if (os.path.isfile("main_model_weights_attn_new.h5")==False):
  model2 = Sequential()
  model2.add(Embedding(input_features, input_embed_dimension, input_length=n_timesteps_in,mask_zero=True))
  model2.add(Dropout(0.2))
  model2.add(LSTM(LSTM_Unitsize,return_sequences=True,activation='relu'))
  model2.add(Masking(mask_value=0.))
  model2.add(AttentionDecoder(LSTM_Unitsize, n_features))
  model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
  #model2.save("model2_compiled.hd5")

In [0]:
#from IPython.display import SVG
#from keras.utils.vis_utils import model_to_dot
#import pydot

#SVG(model_to_dot(model2).create(prog='dot', format='svg'))

In [86]:
#from IPython.display import SVG
#from keras.utils.vis_utils import model_to_dot
#import pydot

#SVG(model_to_dot(model2).create(prog='dot', format='svg'))
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 15, 100)           647300    
_________________________________________________________________
dropout_2 (Dropout)          (None, 15, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 15, 150)           150600    
_________________________________________________________________
masking_2 (Masking)          (None, 15, 150)           0         
_________________________________________________________________
AttentionDecoder (AttentionD (None, 15, 5187)          31003656  
Total params: 31,801,556
Trainable params: 31,801,556
Non-trainable params: 0
_________________________________________________________________


In [106]:
#Train & Validation Parameters
sample_size=X.shape[0]

# Or Adjust Samplesize if you want it to be small
#sample_size=20000

test_batch_size=100
test_start_index=(sample_size-test_batch_size)
test_end_index=sample_size
test_generator_batch_size=100
test_samples_per_epoc=test_batch_size
###############################################################


train_start_index=0
train_end_index=test_start_index
train_generator_batch_size=20
train_samples_per_epoc=100
###############################################################

print("Data Size",sample_size)
print("######################################")
print("train_start_index:",train_start_index)
print("train_end_index:",train_end_index)
print("train_generator_batch_size",train_generator_batch_size)
print("train_samples_per_epoc",train_samples_per_epoc)
print("######################################")

print("test_start_index:",test_start_index)
print("test_end_index:",test_end_index)
print("test_generator_batch_size:",test_generator_batch_size)
print("test_samples_per_epoc",test_samples_per_epoc)
print("######################################")

print("Approximate Epcs Required to Cover Samples given train_samples_per_epoc(sample_size/train_samples_per_epoc)",round(sample_size/train_samples_per_epoc))
epoc=round(sample_size/train_samples_per_epoc)*40
print("epoc:",epoc)
gc.collect()

Data Size 40000
######################################
train_start_index: 0
train_end_index: 39900
train_generator_batch_size 20
train_samples_per_epoc 100
######################################
test_start_index: 39900
test_end_index: 40000
test_generator_batch_size: 100
test_samples_per_epoc 100
######################################
Approximate Epcs Required to Cover Samples given train_samples_per_epoc(sample_size/train_samples_per_epoc)*30 16000
epoc: 16000


276

In [0]:
#Train_X=X[train_start_index:train_end_index]
#Train_Y=Y[train_start_index:train_end_index]
#Test_X=X[test_start_index:test_end_index]
#Test_Y=Y[test_start_index:test_end_index]

In [0]:
#For Train:start_index=0,train_iteration_size=train_sample_size
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)
#For Test:start_index=train_sample_size,train_iteration_size=(samples_size-train_sample_size-1)
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)

def batch_generator(label,train_start_index,train_end_index,X,Y,train_generator_batch_size,english_dictionary):
  Xoh_batch = np.zeros((train_generator_batch_size,X.shape[1]))
  Yoh_batch = np.zeros((train_generator_batch_size,Train_Y.shape[1],len(english_dictionary)))
  batch_size=1
  #LOG_EVERY_N = 1000
  print(label,"start_index",train_start_index)
  start_index=train_start_index
  
  while True:
    for ind in range(train_generator_batch_size):

      index=(random.sample(range(train_start_index, train_end_index), 1))[0]
      #print("index",index)

      Xoh_batch[ind]=np.reshape(X[index],(1,X.shape[1]))
      Yoh_batch[ind]=get_single_onehot_array(Y,index,batch_size,english_dictionary)
    #print("X shape, Y shape",Xoh_batch.shape,Yoh_batch.shape)
    gc.collect()
    yield (Xoh_batch,Yoh_batch)

With the generator above, if we define train_generator_batch_size = 10 , that means it will take out 10 samples from features and labels to feed into each epoch until an epoch hits train_samples_per_epoc  sample limit. Then fit_generator() destroys the used data and move on repeating the same process in new epoch.


In [109]:
train_label="train"
test_label="test"
train_batch_generator=batch_generator(train_label,train_start_index,train_end_index,X,Y,train_generator_batch_size,english_dictionary)
test_batch_generator=batch_generator(test_label,test_start_index,test_end_index,X,Y,test_generator_batch_size,english_dictionary)
print(train_batch_generator,test_batch_generator)

<generator object batch_generator at 0x7fb255f19db0> <generator object batch_generator at 0x7fb255f1f200>


In [110]:
#Run the model.fit. If only best validation model needs to be saved, then change save_best_only=True in checkpoint.
from keras.callbacks import ModelCheckpoint,EarlyStopping
checkpoint = ModelCheckpoint('best_model_new.h5', monitor='val_acc', verbose=2,save_best_only=True, mode='auto',save_weights_only=False)
early_stop=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=100,mode='auto')
import pandas
#pandas.DataFrame(model.fit(trainX, trainY, epochs=200, batch_size=100, validation_data=(testX, testY), callbacks=[checkpoint]).history).to_csv("history.csv")
#model.fit(trainX, trainY, epochs=200, batch_size=20, validation_data=(testX, testY), callbacks=[checkpoint])
model2.fit_generator(train_batch_generator,steps_per_epoch=train_samples_per_epoc
                     ,validation_data=test_batch_generator,validation_steps=test_samples_per_epoc
                     ,epochs=epoc
                     #,verbose=1
                     ,callbacks=[checkpoint,early_stop]
                    )

train start_index 0
Epoch 1/16000
Epoch 00000: val_acc improved from -inf to 0.09577, saving model to best_model_new.h5
Epoch 2/16000
Epoch 3/16000

Epoch 4/16000
Epoch 5/16000

Epoch 6/16000
Epoch 7/16000

Epoch 8/16000
Epoch 9/16000

Epoch 10/16000
Epoch 11/16000
Epoch 12/16000
  1/100 [..............................] - ETA: 31s - loss: 1.1812 - acc: 0.3300

Epoch 13/16000
Epoch 14/16000

Epoch 15/16000
Epoch 16/16000
Epoch 17/16000


Epoch 18/16000
Epoch 19/16000

Epoch 20/16000
Epoch 21/16000

Epoch 22/16000
Epoch 23/16000

Epoch 24/16000
Epoch 25/16000

Epoch 26/16000
Epoch 27/16000
Epoch 28/16000
  8/100 [=>............................] - ETA: 29s - loss: 0.9239 - acc: 0.4279

Epoch 29/16000
Epoch 30/16000

Epoch 31/16000
Epoch 32/16000
Epoch 33/16000


Epoch 34/16000
Epoch 35/16000
Epoch 36/16000


Epoch 37/16000
Epoch 38/16000

Epoch 39/16000
Epoch 40/16000
Epoch 41/16000


Epoch 42/16000
Epoch 43/16000

Epoch 44/16000
Epoch 45/16000

Epoch 46/16000
Epoch 47/16000
Epoch 48/16000
 10/100 [==>...........................] - ETA: 29s - loss: 0.5415 - acc: 0.5020

Epoch 49/16000
Epoch 50/16000

Epoch 51/16000
Epoch 52/16000
Epoch 53/16000
  7/100 [=>............................] - ETA: 30s - loss: 0.6838 - acc: 0.5000

Epoch 54/16000
Epoch 55/16000

Epoch 56/16000
Epoch 57/16000
Epoch 58/16000
  1/100 [..............................] - ETA: 30s - loss: 0.5410 - acc: 0.4733

Epoch 59/16000
Epoch 60/16000
Epoch 61/16000


Epoch 62/16000
Epoch 63/16000

Epoch 64/16000
Epoch 65/16000
Epoch 66/16000
  1/100 [..............................] - ETA: 32s - loss: 0.4372 - acc: 0.5433

Epoch 67/16000
Epoch 68/16000

Epoch 69/16000
Epoch 70/16000
Epoch 71/16000
  1/100 [..............................] - ETA: 31s - loss: 0.2993 - acc: 0.5833

Epoch 72/16000
Epoch 73/16000

Epoch 74/16000
Epoch 75/16000
Epoch 76/16000
  7/100 [=>............................] - ETA: 30s - loss: 0.4604 - acc: 0.5295

Epoch 77/16000
Epoch 78/16000

Epoch 79/16000
Epoch 80/16000
Epoch 81/16000
 10/100 [==>...........................] - ETA: 28s - loss: 0.3211 - acc: 0.5543

Epoch 82/16000
Epoch 83/16000
Epoch 84/16000
  2/100 [..............................] - ETA: 31s - loss: 0.6785 - acc: 0.5283

Epoch 85/16000
Epoch 86/16000
Epoch 87/16000
  1/100 [..............................] - ETA: 31s - loss: 0.1865 - acc: 0.5900

Epoch 88/16000
Epoch 89/16000

Epoch 90/16000
Epoch 91/16000
Epoch 92/16000
  1/100 [..............................] - ETA: 31s - loss: 0.2282 - acc: 0.6700

Epoch 93/16000
Epoch 94/16000

Epoch 95/16000
Epoch 96/16000
Epoch 97/16000
  8/100 [=>............................] - ETA: 29s - loss: 0.4130 - acc: 0.5283

Epoch 98/16000
Epoch 99/16000
Epoch 100/16000
  1/100 [..............................] - ETA: 32s - loss: 0.3229 - acc: 0.6633

Epoch 101/16000
Epoch 102/16000

Epoch 103/16000
Epoch 104/16000
Epoch 105/16000
  2/100 [..............................] - ETA: 31s - loss: 0.3981 - acc: 0.5817

Epoch 106/16000
Epoch 107/16000
Epoch 108/16000


Epoch 109/16000
Epoch 110/16000

Epoch 111/16000
Epoch 112/16000
Epoch 113/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.3188 - acc: 0.5307

Epoch 114/16000
Epoch 115/16000

Epoch 116/16000
Epoch 117/16000
Epoch 118/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.3076 - acc: 0.5467

Epoch 119/16000
Epoch 120/16000

Epoch 121/16000
Epoch 122/16000
Epoch 123/16000
  1/100 [..............................] - ETA: 31s - loss: 0.1426 - acc: 0.6200

Epoch 124/16000
Epoch 125/16000

Epoch 126/16000
Epoch 127/16000
Epoch 128/16000


Epoch 129/16000
Epoch 130/16000

Epoch 131/16000
Epoch 132/16000
Epoch 133/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.2156 - acc: 0.5741

Epoch 134/16000
Epoch 135/16000
Epoch 136/16000
  1/100 [..............................] - ETA: 31s - loss: 0.1646 - acc: 0.5933

Epoch 137/16000
Epoch 138/16000
Epoch 139/16000


Epoch 140/16000
Epoch 141/16000


Epoch 142/16000
Epoch 143/16000
Epoch 144/16000

Epoch 145/16000
Epoch 146/16000
Epoch 147/16000
  8/100 [=>............................] - ETA: 30s - loss: 0.1544 - acc: 0.5929

Epoch 148/16000
Epoch 149/16000
Epoch 150/16000
  1/100 [..............................] - ETA: 32s - loss: 0.4123 - acc: 0.4933

Epoch 151/16000
Epoch 152/16000
Epoch 153/16000


Epoch 154/16000
Epoch 155/16000
Epoch 156/16000


Epoch 157/16000
Epoch 158/16000

Epoch 159/16000
Epoch 160/16000
Epoch 161/16000
  2/100 [..............................] - ETA: 31s - loss: 0.0658 - acc: 0.6250

Epoch 162/16000
Epoch 163/16000
Epoch 164/16000


Epoch 165/16000
Epoch 166/16000

Epoch 167/16000
Epoch 168/16000
Epoch 169/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.2354 - acc: 0.5937

Epoch 170/16000
Epoch 171/16000

Epoch 172/16000
Epoch 173/16000
Epoch 174/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1219 - acc: 0.5722

Epoch 175/16000
Epoch 176/16000
Epoch 177/16000
  1/100 [..............................] - ETA: 31s - loss: 0.0834 - acc: 0.5933

Epoch 178/16000
Epoch 179/16000
Epoch 180/16000


Epoch 181/16000
Epoch 182/16000

Epoch 183/16000
Epoch 184/16000
Epoch 185/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.2439 - acc: 0.5944

Epoch 186/16000
Epoch 187/16000
Epoch 188/16000
  1/100 [..............................] - ETA: 30s - loss: 0.2536 - acc: 0.5267

Epoch 189/16000
Epoch 190/16000
Epoch 191/16000


Epoch 192/16000
Epoch 193/16000

Epoch 194/16000
Epoch 195/16000

Epoch 196/16000
Epoch 197/16000
Epoch 198/16000
 10/100 [==>...........................] - ETA: 29s - loss: 0.1361 - acc: 0.6123

Epoch 199/16000
Epoch 200/16000
Epoch 201/16000
  1/100 [..............................] - ETA: 30s - loss: 0.2499 - acc: 0.6167

Epoch 202/16000
Epoch 203/16000


Epoch 204/16000
Epoch 205/16000
Epoch 206/16000

Epoch 207/16000
Epoch 208/16000
Epoch 209/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1314 - acc: 0.5778

Epoch 210/16000
Epoch 211/16000

Epoch 212/16000
Epoch 213/16000
Epoch 214/16000
  1/100 [..............................] - ETA: 32s - loss: 0.2465 - acc: 0.5400

Epoch 215/16000
Epoch 216/16000

Epoch 217/16000
Epoch 218/16000
Epoch 219/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1038 - acc: 0.6396

Epoch 220/16000
Epoch 221/16000
Epoch 222/16000


Epoch 223/16000
Epoch 224/16000
Epoch 225/16000


Epoch 226/16000
Epoch 227/16000
Epoch 228/16000


Epoch 229/16000
Epoch 230/16000
Epoch 231/16000


Epoch 232/16000
Epoch 233/16000
Epoch 234/16000


Epoch 235/16000
Epoch 236/16000

Epoch 237/16000
Epoch 238/16000
Epoch 239/16000
  1/100 [..............................] - ETA: 31s - loss: 0.0217 - acc: 0.6300

Epoch 240/16000
Epoch 241/16000
Epoch 242/16000


Epoch 243/16000
Epoch 244/16000
Epoch 245/16000


Epoch 246/16000
Epoch 247/16000

Epoch 248/16000
Epoch 249/16000
Epoch 250/16000
  8/100 [=>............................] - ETA: 29s - loss: 0.1621 - acc: 0.6021

Epoch 251/16000
Epoch 252/16000
Epoch 253/16000
  1/100 [..............................] - ETA: 31s - loss: 0.1316 - acc: 0.6400

Epoch 254/16000
Epoch 255/16000
Epoch 256/16000


Epoch 257/16000
Epoch 258/16000

Epoch 259/16000
Epoch 260/16000
Epoch 261/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1024 - acc: 0.6148

Epoch 262/16000
Epoch 263/16000
Epoch 264/16000
  1/100 [..............................] - ETA: 32s - loss: 0.0926 - acc: 0.5833

Epoch 265/16000
Epoch 266/16000
Epoch 267/16000


Epoch 268/16000
Epoch 269/16000
Epoch 270/16000


Epoch 271/16000
Epoch 272/16000
Epoch 273/16000


Epoch 274/16000
Epoch 275/16000
Epoch 276/16000


Epoch 277/16000
Epoch 278/16000

Epoch 279/16000
Epoch 280/16000
Epoch 281/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1747 - acc: 0.5978

Epoch 282/16000
Epoch 283/16000
Epoch 284/16000
  1/100 [..............................] - ETA: 30s - loss: 0.1745 - acc: 0.6033

Epoch 285/16000
Epoch 286/16000
Epoch 287/16000


Epoch 288/16000
Epoch 289/16000


Epoch 290/16000
Epoch 291/16000
Epoch 292/16000

Epoch 293/16000
Epoch 294/16000
Epoch 295/16000
  8/100 [=>............................] - ETA: 30s - loss: 0.0662 - acc: 0.5767

Epoch 296/16000
Epoch 297/16000
Epoch 298/16000
  1/100 [..............................] - ETA: 33s - loss: 0.1173 - acc: 0.5200

Epoch 299/16000
Epoch 300/16000
Epoch 301/16000


Epoch 302/16000
Epoch 303/16000


Epoch 304/16000
Epoch 305/16000
Epoch 306/16000

Epoch 307/16000
Epoch 308/16000
Epoch 309/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.0604 - acc: 0.5885

Epoch 310/16000
Epoch 311/16000
Epoch 312/16000
  1/100 [..............................] - ETA: 31s - loss: 0.0845 - acc: 0.5767

Epoch 313/16000
Epoch 314/16000
Epoch 315/16000


Epoch 316/16000
Epoch 317/16000

Epoch 318/16000
Epoch 319/16000
Epoch 320/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1625 - acc: 0.5848

Epoch 321/16000
Epoch 322/16000
Epoch 323/16000
  1/100 [..............................] - ETA: 31s - loss: 0.1646 - acc: 0.5800

Epoch 324/16000
Epoch 325/16000
Epoch 326/16000


Epoch 327/16000
Epoch 328/16000
Epoch 329/16000


Epoch 330/16000
Epoch 331/16000
Epoch 332/16000


Epoch 333/16000
Epoch 334/16000

Epoch 335/16000
Epoch 336/16000
Epoch 337/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1114 - acc: 0.5881

Epoch 338/16000
Epoch 339/16000
Epoch 340/16000
  1/100 [..............................] - ETA: 32s - loss: 0.0843 - acc: 0.6400

Epoch 341/16000
Epoch 342/16000
Epoch 343/16000


Epoch 344/16000
Epoch 345/16000
Epoch 346/16000


Epoch 347/16000
Epoch 348/16000
Epoch 349/16000


Epoch 350/16000
Epoch 351/16000

Epoch 352/16000
Epoch 353/16000
Epoch 354/16000
  9/100 [=>............................] - ETA: 30s - loss: 0.0990 - acc: 0.5948

Epoch 355/16000
Epoch 356/16000
Epoch 357/16000
  1/100 [..............................] - ETA: 32s - loss: 0.1121 - acc: 0.5700

Epoch 358/16000
Epoch 359/16000
Epoch 360/16000


Epoch 361/16000
Epoch 362/16000

Epoch 363/16000
Epoch 364/16000
Epoch 365/16000
  2/100 [..............................] - ETA: 32s - loss: 0.1963 - acc: 0.6200

Epoch 366/16000
Epoch 367/16000

Epoch 368/16000
Epoch 369/16000
Epoch 370/16000
  9/100 [=>............................] - ETA: 30s - loss: 0.1079 - acc: 0.6137

Epoch 371/16000
Epoch 372/16000
Epoch 373/16000
  1/100 [..............................] - ETA: 30s - loss: 0.1110 - acc: 0.6533

Epoch 374/16000
Epoch 375/16000
Epoch 376/16000


Epoch 377/16000
Epoch 378/16000
Epoch 379/16000


Epoch 380/16000
Epoch 381/16000
Epoch 382/16000


Epoch 383/16000
Epoch 384/16000
Epoch 385/16000


Epoch 386/16000
Epoch 387/16000
Epoch 388/16000


Epoch 389/16000
Epoch 390/16000
Epoch 391/16000


Epoch 392/16000
Epoch 393/16000
Epoch 394/16000


Epoch 395/16000
Epoch 396/16000
Epoch 397/16000


Epoch 398/16000
Epoch 399/16000
Epoch 400/16000


Epoch 401/16000
Epoch 402/16000
Epoch 403/16000


Epoch 404/16000
Epoch 405/16000
Epoch 406/16000


Epoch 407/16000
Epoch 408/16000
Epoch 409/16000


Epoch 410/16000
Epoch 411/16000
Epoch 412/16000


Epoch 413/16000
Epoch 414/16000
Epoch 415/16000


Epoch 416/16000
Epoch 417/16000
Epoch 418/16000


Epoch 419/16000
Epoch 420/16000

Epoch 421/16000
Epoch 422/16000

Epoch 423/16000
Epoch 424/16000
Epoch 425/16000
 10/100 [==>...........................] - ETA: 29s - loss: 0.0890 - acc: 0.6113

Epoch 426/16000
Epoch 427/16000

Epoch 428/16000
Epoch 429/16000
Epoch 430/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.0951 - acc: 0.5959

Epoch 431/16000
Epoch 432/16000
Epoch 433/16000
  1/100 [..............................] - ETA: 30s - loss: 0.0160 - acc: 0.7133

Epoch 434/16000
Epoch 435/16000
Epoch 436/16000


Epoch 437/16000
Epoch 438/16000
Epoch 439/16000


Epoch 440/16000
Epoch 441/16000
Epoch 442/16000


Epoch 443/16000
Epoch 444/16000
Epoch 445/16000


Epoch 446/16000
Epoch 447/16000
Epoch 448/16000


Epoch 449/16000
Epoch 450/16000
Epoch 451/16000


Epoch 452/16000
Epoch 453/16000
Epoch 454/16000


Epoch 455/16000
Epoch 456/16000
Epoch 457/16000


Epoch 458/16000
Epoch 459/16000
Epoch 460/16000


Epoch 461/16000
Epoch 462/16000
Epoch 463/16000


Epoch 464/16000
Epoch 465/16000
Epoch 466/16000


Epoch 467/16000
Epoch 468/16000

Epoch 469/16000
Epoch 470/16000
Epoch 471/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.0554 - acc: 0.6233

Epoch 472/16000
Epoch 473/16000

Epoch 474/16000
Epoch 475/16000
Epoch 476/16000
  9/100 [=>............................] - ETA: 30s - loss: 0.0509 - acc: 0.6030

Epoch 477/16000
Epoch 478/16000
Epoch 479/16000


Epoch 480/16000
Epoch 481/16000
Epoch 482/16000


Epoch 483/16000
Epoch 484/16000
Epoch 485/16000


Epoch 486/16000
Epoch 487/16000

Epoch 488/16000
Epoch 489/16000
Epoch 490/16000
  2/100 [..............................] - ETA: 31s - loss: 0.1221 - acc: 0.5767

Epoch 491/16000
Epoch 492/16000
Epoch 493/16000


Epoch 494/16000
Epoch 495/16000

Epoch 496/16000
Epoch 497/16000
Epoch 498/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.1252 - acc: 0.6137

Epoch 499/16000
Epoch 500/16000
Epoch 501/16000
  1/100 [..............................] - ETA: 31s - loss: 0.0153 - acc: 0.5533

Epoch 502/16000
Epoch 503/16000
Epoch 504/16000


Epoch 505/16000
Epoch 506/16000
Epoch 507/16000


Epoch 508/16000
Epoch 509/16000
Epoch 510/16000


Epoch 511/16000
Epoch 512/16000

Epoch 513/16000
Epoch 514/16000
Epoch 515/16000
  9/100 [=>............................] - ETA: 29s - loss: 0.0657 - acc: 0.6044

Epoch 516/16000
Epoch 517/16000
Epoch 518/16000
  1/100 [..............................] - ETA: 33s - loss: 0.0793 - acc: 0.6200

Epoch 519/16000
Epoch 520/16000
Epoch 521/16000


Epoch 522/16000
Epoch 523/16000
Epoch 524/16000


Epoch 525/16000
Epoch 526/16000
Epoch 527/16000


Epoch 528/16000
Epoch 529/16000
Epoch 530/16000


Epoch 531/16000
Epoch 532/16000
Epoch 533/16000


Epoch 534/16000
Epoch 535/16000
Epoch 536/16000


Epoch 537/16000
Epoch 538/16000
Epoch 539/16000


Epoch 540/16000
Epoch 541/16000
Epoch 542/16000


Epoch 543/16000
Epoch 544/16000
Epoch 545/16000


Epoch 546/16000
Epoch 547/16000
Epoch 548/16000


Epoch 549/16000
Epoch 550/16000
Epoch 551/16000


Epoch 552/16000
Epoch 553/16000
Epoch 554/16000


Epoch 555/16000
Epoch 556/16000
Epoch 557/16000


Epoch 558/16000
Epoch 559/16000
Epoch 560/16000


Epoch 561/16000
Epoch 562/16000
Epoch 563/16000


Epoch 564/16000
Epoch 565/16000
Epoch 566/16000


Epoch 567/16000
Epoch 568/16000
Epoch 569/16000


Epoch 570/16000
Epoch 571/16000
Epoch 572/16000


Epoch 573/16000
Epoch 574/16000
Epoch 575/16000


Epoch 576/16000
Epoch 577/16000
Epoch 578/16000


Epoch 579/16000
Epoch 580/16000
Epoch 581/16000


Epoch 582/16000
Epoch 583/16000
Epoch 584/16000


Epoch 585/16000
Epoch 586/16000
Epoch 587/16000


Epoch 588/16000
Epoch 589/16000
Epoch 590/16000


Epoch 591/16000
Epoch 592/16000
Epoch 593/16000


Epoch 594/16000
Epoch 595/16000
Epoch 596/16000


Epoch 597/16000
Epoch 598/16000
Epoch 599/16000


Epoch 600/16000
Epoch 601/16000
Epoch 602/16000


Epoch 603/16000
Epoch 604/16000
Epoch 605/16000


Epoch 606/16000
Epoch 607/16000
Epoch 608/16000


Epoch 609/16000
Epoch 610/16000
Epoch 611/16000




<keras.callbacks.History at 0x7fb2574636a0>

In [111]:
import os
os.listdir()

['datalab',
 'clean_pairs.pkl',
 '.config',
 '.ipython',
 '.rnd',
 'model2.json',
 '.keras',
 'best_model_new.h5',
 'hin.txt',
 'model2_weights.h5',
 '.nv',
 'enghindi.txt',
 'dataset.pkl',
 'attention_decoder.py',
 '.forever',
 '__pycache__',
 '.cache',
 '_about.txt',
 '.local']

In [112]:
from keras.models import model_from_json
model_json = model2.to_json()
with open("model2.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model2.save_weights("model2_weights.h5")
print("Saved model to disk")
 
# later...
 
# load json and create model
json_file = open('model2.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json,custom_objects={'AttentionDecoder': AttentionDecoder(LSTM_Unitsize, n_features)})
print("Loaded model from disk")


Saved model to disk
Loaded model from disk


In [113]:
# load weights into new model
loaded_model.load_weights("model2_weights.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Loaded model from disk


In [114]:
import os
os.listdir()

['datalab',
 'clean_pairs.pkl',
 '.config',
 '.ipython',
 '.rnd',
 'model2.json',
 '.keras',
 'best_model_new.h5',
 'hin.txt',
 'model2_weights.h5',
 '.nv',
 'enghindi.txt',
 'dataset.pkl',
 'attention_decoder.py',
 '.forever',
 '__pycache__',
 '.cache',
 '_about.txt',
 '.local']

In [0]:
# Install the PyDrive wrapper & import libraries. This is to upload model & weights to your GDrive.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [117]:
# Create & upload a file.
modelname="model2.json"
uploaded = drive.CreateFile({'title': modelname})
uploaded.SetContentFile(modelname)
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

Uploaded file with ID 1WROTQHnDCdJptpXCLvKn5RqVupj83dui


In [118]:
# Create & upload a file.
modelname="model2_weights.h5"
uploaded = drive.CreateFile({'title': modelname})
uploaded.SetContentFile(modelname)
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))


Uploaded file with ID 1c7KUemDpcEnnTRMrEihZ5gjWJzAIBOzA


In [0]:
# Create & upload a file.
if (os.path.isfile("best_model_new.h5")==False):
  uploaded = drive.CreateFile({'title': modelname})
  uploaded.SetContentFile(modelname)
  uploaded.Upload()
  print('Uploaded file with ID {}'.format(uploaded.get('id')))

In [0]:
from keras.models import load_model
prediction_model = load_model('best_model_new.h5',custom_objects={'AttentionDecoder': AttentionDecoder(LSTM_Unitsize, n_features)})
#prediction_model=loaded_model

In [138]:
#Test
test_index=test_start_index+14
test=np.reshape(X[test_index],(1,X.shape[1]))
prediction=np.round(prediction_model.predict(test))
predicted_argmax=np.argmax(prediction,axis=2)
predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
Y[test_index],predicted_argmax

(array([2194, 4649, 2393, 3879,  296, 3229, 4321, 4644, 4533, 3179, 2239,
        1616,    0,    0,    0]),
 array([2194, 4649, 2393, 3879,  296, 3229, 4321, 4644, 4533, 3179, 2239,
        1616, 4719,    0,    0]))

In [139]:
#Predict for test sentence index which model has not seen yet.
print("Pure Testacases are from following sentence numbers in X",test_start_index)
#Given encoding matrix of sentence & dictionary, get the sentence
def return_sentences(X,Y,revere_dictionary_english,test_index,model2):
  
  def return_predicted_array(X,test_index,model2):
    test=np.reshape(X[test_index],(1,X.shape[1]))
    encoding_prediction=np.round(model2.predict(test))
    predicted_argmax=np.argmax(encoding_prediction,axis=2)
    predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
    return predicted_argmax
  
  encoding_prediction=return_predicted_array(X,test_index,prediction_model)
  
  encoding_actual=Y[test_index]
  
  def return_sentence_list(encoding,revere_dictionary_english,test_index):
    #print(test_index)
    sentence=list()
    for key in encoding:
      key=int(key)
      #print(type(int(key)))
      #print(revere_dictionary_english[key])
      sentence.append(revere_dictionary_english[key])
    return sentence

  def concatenate_list_data(list):
      result= ''
      for element in list:
          result += str(element)
          result += str(" ")
      return result
  actual_sentence=return_sentence_list(encoding_actual,revere_dictionary_english,test_index)
  actual_sentence=concatenate_list_data(actual_sentence)
  
  predicted_sentence=return_sentence_list(encoding_prediction,revere_dictionary_english,test_index)
  predicted_sentence=concatenate_list_data(predicted_sentence)
  return(actual_sentence,predicted_sentence)

#print(test_index)
test_sentences=10
for test_sentence_index in range(test_start_index,test_start_index+test_sentences):
  Actual,Predicted=return_sentences(X,Y,revere_dictionary_english,test_sentence_index,prediction_model)
  print("#############################")
  print("Actual Sentence is:")
  print(Actual)
  print("Predicted Sentence is:")
  print(Predicted)
  print("#############################")

Pure Testacases are from following sentence numbers in X 39900
#############################
Actual Sentence is:
rch kits are being provided directly to the districts by the government of india <pad> 
Predicted Sentence is:
rch kits are being provided directly to the districts by the government of india <pad> 
#############################
#############################
Actual Sentence is:
abhyantarvritti pranayama is extremely beneficial for the patients of asthma <pad> <pad> <pad> <pad> <pad> 
Predicted Sentence is:
abhyantarvritti pranayama is extremely beneficial for the patients of asthma patients <pad> <pad> of diabetes 
#############################
#############################
Actual Sentence is:
cylindrical glasses have to be worn always <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
Predicted Sentence is:
cylindrical glasses have to be worn always <pad> women be <pad> <pad> <pad> always <pad> 
#############################
#############################
Actual Sentence is:
i

[Click here to use google transliterate and copy the hindi sentence from there and paste in below cell when it asks for input.](https://www.google.co.in/inputtools/try/)

In [140]:
#Enter source language sentence from google transliterate https://www.google.co.in/inputtools/try/
print("Enter Sentences less than 15 words. As of now that is what is set.")
user_sentence=input()
print(type(user_sentence))

Enter Sentences less than 15 words. As of now that is what is set.
ताजा साँसें और चमचमाते दाँत
<class 'str'>


In [141]:
#Get Encoding
words=user_sentence.split()
user_encoding=[]
for word in words:
  try:
    #print(hindi_dictionary[word])
    user_encoding.append(hindi_dictionary[word])
  except KeyError:
    #print(hindi_dictionary['<unk>'])
    user_encoding.append(hindi_dictionary['<unk>'])
user_encoding
#print(X.shape[1],len(user_encoding))
if (X.shape[1]>len(user_encoding)):
  padding_count=X.shape[1]-len(user_encoding)
  for x in range(0,padding_count):
    user_encoding.append(0)
else:
  user_encoding=user_encoding[0:X.shape[1]]
user_encoding=np.array(user_encoding)
user_encoding

array([2533, 5802,  922, 1794, 2681,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0])

In [142]:
def return_predicted_array_user(user_encoding,model2):
  test=np.reshape(user_encoding,(1,X.shape[1]))
  encoding_prediction=np.round(model2.predict(test))
  predicted_argmax=np.argmax(encoding_prediction,axis=2)
  predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
  return predicted_argmax

#from keras.models import load_model
#bestmodel = load_model('complete_model_with_weigths.h5')

predicted_user=return_predicted_array_user(user_encoding,prediction_model)
print("Predicted Array",predicted_user)

def return_sentence_list(encoding,revere_dictionary_english):
  sentence=list()
  for ind in encoding:
    #print(ind)
    predicted_word=revere_dictionary_english[ind]
    #print(predicted_word)
    sentence.append(revere_dictionary_english[ind])
  return sentence
  
user_translation_list=return_sentence_list(predicted_user,revere_dictionary_english)

def concatenate_list_data(list):
  result= ''
  for element in list:
    result += str(element)
    result += str(" ")
  return result

predicted_sentence=concatenate_list_data(user_translation_list)
print("Predicted Sentence")
print("###################################")
print(predicted_sentence)

Predicted Array [1821  594  194 4134 4600 1500    0 3392 3392 1500 3392 3392 3392 3392
 3392]
Predicted Sentence
###################################
fresh breath and shining teeth enhance <pad> personality personality enhance personality personality personality personality personality 


## 3. Quantitative Analysis
While training you can see the loss as well as the accuracy on each of the positions of the output. The output snapshot below gives you an example of what the accuracies could be at 100th iteration in above settings: 

Epoch 100/100
2800/2810 [============================>.] - ETA: 0s - loss: 5.1898 - dense_2_loss_1: 0.7850 - dense_2_loss_2: 0.8572 - dense_2_loss_3: 0.8971 - dense_2_loss_4: 0.8881 - dense_2_loss_5: 0.9539 - dense_2_loss_6: 0.8085 - dense_2_acc_1: 0.7875 - dense_2_acc_2: 0.7571 - dense_2_acc_3: 0.7443 - dense_2_acc_4: 0.7479 - dense_2_acc_5: 0.7379 - dense_2_acc_6: 0.7975Epoch 00099: saving model to main_model_weights.h5
2810/2810 [==============================] - 17s - loss: 5.1868 - dense_2_loss_1: 0.7848 - dense_2_loss_2: 0.8570 - dense_2_loss_3: 0.8973 - dense_2_loss_4: 0.8891 - dense_2_loss_5: 0.9526 - dense_2_loss_6: 0.8061 - dense_2_acc_1: 0.7872 - dense_2_acc_2: 0.7569 - dense_2_acc_3: 0.7438 - dense_2_acc_4: 0.7480 - dense_2_acc_5: 0.7384 - dense_2_acc_6: 0.7982 - val_loss: 40.8334 - val_dense_2_loss_1: 4.3684 - val_dense_2_loss_2: 7.5746 - val_dense_2_loss_3: 6.9904 - val_dense_2_loss_4: 8.9484 - val_dense_2_loss_5: 6.6466 - val_dense_2_loss_6: 6.3049 - val_dense_2_acc_1: 0.4561 - val_dense_2_acc_2: 0.2807 - val_dense_2_acc_3: 0.1754 - val_dense_2_acc_4: 0.1579 - val_dense_2_acc_5: 0.2456 - val_dense_2_acc_6: 0.3684

######################################################################################################

<caption><left> Thus at 100-th iteration with unaltered settings above, `dense_2_acc_6: 0.7975` means that you are predicting the 6th word of the output correctly 79% of the time in the current batch of data. Also val_dense_2_acc_6: 0.3684 means the 6th digit prediction accuracy is 36%  </left></caption>
 

We can now see the results on new examples.

We should be able to see following results. We have first sentence from training example and another one from true test. We can see that there is a pretty good translation for the data from training and for true test it was able to predict first place pretty accurately but failed in following portions.

 ##### 

Hindi इसको अपना घर ही समझो
Expected: Please make yourself at home
Predicted output: Please yourself yourself home <pad> <pad>

 ##### 

Hindi हमने खरीदी
Predicted output: We leave up <pad> <pad> <pad>

## 5 References

Neural Machine Translation by Jointly Learning to Align and Translate: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio https://arxiv.org/pdf/1409.0473.pdf

https://machinelearningmastery.com

https://www.coursera.org/

https://www.udemy.com/

## Appendix

One thing we can do to improve the model is instead of one hot encodings of words of length vocabulary, get the word2vec vectors for each word with fixed length.

Another thing that can be done is train only short sentences.

In below section we will provide the functions to help to do the tasks.

In [0]:
#Converting input to word2vec.
def sentences_to_word2vec_input_format(language_sentences_list):
    word2vec_sentence_feed=list()
    for sentence in language_sentences_list:
        word2vec_sentence_feed.append(sentence.split())
    return(word2vec_sentence_feed)
english_sentences_w2v_format=sentences_to_word2vec_input_format(english_sentences_list)
hindi_sentences_w2v_format=sentences_to_word2vec_input_format(hindi_sentences_list)

In [0]:
from gensim.models import Word2Vec
# train model
english_model = Word2Vec(english_sentences_w2v_format, min_count=1)
english_words_vocab = list(english_model.wv.vocab)
hindi_model = Word2Vec(hindi_sentences_w2v_format, min_count=1)
english_words_vocab = list(hindi_model.wv.vocab)

In [0]:
def sentences_to_w2vec(language_encoding,revere_dictionary_language,language_model):
    import numpy as np
    sentence_level_w2vec_list=[]
    #arr = np.empty((2,), float)
    number_of_sentences=language_encoding.shape[0]
    for i in range(0,number_of_sentences):
        language_list_padded=[]
        #print (english_encoding[i])
        for key in language_encoding[i]:
            #print(revere_dictionary_english[key])
            word=(revere_dictionary_language[key])
            try:
                #print("Found word Shape of word vector",(english_model[word]).shape,arr.shape)
                language_list_padded.append(language_model[word])
            except KeyError:
                unk='<unk>'
                #print("not found! Assigning Unknown Vector",  (english_model[unk]).shape)
                language_list_padded.append(language_model[unk])
        #print(np.array(language_list_padded))
        sentence_level_w2vec_list.append((np.array(language_list_padded)))
    sentence_level_w2vec=np.array(sentence_level_w2vec_list)
    return(sentence_level_w2vec)

In [0]:
X=hindi_encoding
Y=english_encoding
#Y will remain the same.
Yoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y)))
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Yoh.shape:", Yoh.shape)

In [0]:
#Run this if you want word2vec instead of One hot encoding
#Naming it still as X0h and Yoh to avoid changes in too many places further.
#Yoh 
Xoh=sentences_to_w2vec(hindi_encoding,revere_dictionary_hindi,hindi_model)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

One might also like to get the sentences of only specific length from source as well as target, for example get all sentences which has maximum 5 words and in hindi maximum 8 words. Use below function and feed the length you need.

In [0]:
#dataset=Ndarray with following dimentions (sentence_length, 2)
#source_len is the length of language in dataset[0][1]
#target_len is the length of language in dataset[0][0]
def get_sentences_subset(dataset,source_len,target_len):
    limited=dataset
    indexes_list=[]
    for indexes in range(0,limited.shape[0]):
        #print(len(limited[i][0].split()),len(limited[i][1].split()))
        eng_len=len(limited[indexes][0].split()) 
        hin_len=len(limited[indexes][1].split())
        state1=(eng_len<=target_len)
        state2=(hin_len<=source_len)
        final=state2&state1
        #print(eng_len,hin_len,final)
        #print(state1,state2,final)
        if (final):
            indexes_list.append(indexes)
    #print(indexes_list,type(indexes_list))
    return(limited[indexes_list])