[View in Colaboratory](https://colab.research.google.com/github/meethariprasad/phd/blob/master/NMT_Project_V3_Attention_decoder.ipynb)

# Neural Machine Translation from Hindi to English


Assignment is to build a Neural Machine Translation (NMT) model to translate Hindi Sentences into machine English. 

We will do this using by creating attention model as in Neural Machine Translation by Jointly Learning to Align and Translate: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio https://arxiv.org/pdf/1409.0473.pdf

We will be using following small parallel corpus "http://www.manythings.org/anki/hin-eng.zip"

Notes: In Appendix section at end we have given additional helper functions which can be used to improve model as future improvement effort.

In [2]:
import os
os.listdir()

['datalab',
 '.config',
 '.keras',
 '.local',
 'dataset.pkl',
 '.forever',
 '.rnd',
 '.cache',
 '.gdfuse',
 '__pycache__',
 'drive',
 'clean_pairs.pkl',
 'best_model_new.h5',
 'attention_decoder.py',
 'hin.txt',
 'enghindi.txt',
 '.nv',
 '.ipython',
 '_about.txt']

## Loading Libraries.

In [3]:
# https://pypi.python.org/pypi/pydot
#Setting up graphviz (2.38.0-16ubuntu2) ...
#Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
try:
  !apt-get -qq install -y graphviz && pip install -q pydot
  import pydot
  import graphviz
except ModuleNotFoundError:
  print ("ModuleNotFoundError")

ModuleNotFoundError


In [4]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

gpg: keybox '/tmp/tmp37tzq9fw/pubring.gpg' created
gpg: /tmp/tmp37tzq9fw/trustdb.gpg: trustdb created
gpg: key AD5F235DF639B041: public key "Launchpad PPA for Alessandro Strada" imported
gpg: Total number processed: 1
gpg:               imported: 1
··········


In [5]:
!mkdir -p drive
!google-drive-ocamlfuse drive

fuse: mountpoint is not empty
fuse: if you are sure this is safe, use the 'nonempty' mount option


In [6]:
import os
os.listdir()

['datalab',
 '.config',
 '.keras',
 '.local',
 'dataset.pkl',
 '.forever',
 '.rnd',
 '.cache',
 '.gdfuse',
 '__pycache__',
 'drive',
 'clean_pairs.pkl',
 'best_model_new.h5',
 'attention_decoder.py',
 'hin.txt',
 'enghindi.txt',
 '.nv',
 '.ipython',
 '_about.txt']

In [0]:
# Install the PyDrive wrapper & import libraries. This is to upload model & weights to your GDrive.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [8]:
import os
os.getcwd()
os.listdir()

['datalab',
 '.config',
 '.keras',
 '.local',
 'dataset.pkl',
 '.forever',
 '.rnd',
 '.cache',
 '.gdfuse',
 '__pycache__',
 'drive',
 'clean_pairs.pkl',
 'best_model_new.h5',
 'attention_decoder.py',
 'hin.txt',
 'enghindi.txt',
 '.nv',
 '.ipython',
 '_about.txt']

In [9]:
!pip install keras==2.0.5



In [10]:
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
import random
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np
import gc
import os.path
#import pydot 
#import graphviz
 

#from faker import Faker
#import random
#from tqdm import tqdm
#from babel.dates import format_date
#from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


## 1 - Translating Source Language to Target Language

The model we will build here could be used to translate from from English to Hindi or any other parallel corpus. 

### 1.1 - Dataset

In this section we are going to download the dataset and prepare the dataset with Padding & Integer Encoding & One hot encoding.

In [0]:
if (os.path.isfile("hin-eng.zip")==False & os.path.isfile("enghindi.zip")==False ):
  import requests, zipfile, io, os
  #https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/hin.zip
  r = requests.get("http://www.manythings.org/anki/hin-eng.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
  #https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/enghindi.zip
  r = requests.get("https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/enghindi.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
#import requests, zipfile, io,os
#r = requests.get("http://www.manythings.org/anki/hin-eng.zip")
#z = zipfile.ZipFile(io.BytesIO(r.content))
#z.extractall()
#Verifying if the file hin.txt are downloaded properly.
#if os.path.isfile("hin.txt"):
#    print('hin.txt exists')

In [89]:
print("If you want dataset to be reconstructed using new sentence length, enter 1")
a = input()
print(type(int(a)))
a=int(a)
if((a==1)& os.path.isfile("dataset.pkl")==True):
  print("yeah. I will delete dataset.pkl")
  print(os.remove("dataset.pkl"))
else:
  print("cool. Either it is already deleted or I Will proceed with already created dataset.pkl")

If you want dataset to be reconstructed using new sentence length, enter 1
1
<class 'int'>
yeah. I will delete dataset.pkl
None


In [0]:
if (os.path.isfile("dataset.pkl")==False):
  file=open("hin.txt",'r',encoding='utf-8')
  content=file.read()
  file.close()
#Reading the file
#file=open("hin.txt",'r',encoding='utf-8')
#content=file.read()
#file.close()

In [0]:
if (os.path.isfile("dataset.pkl")==False):
  file=open("enghindi.txt",'r',encoding='utf-16')
  content2=file.read()
  file.close()

In [0]:
#End of Sentence.
eos=" "+"</s>"

In [93]:
def save_clean_data(sentences, filename):
  dump(sentences, open(filename, 'wb'))
  print('Saved: %s' % filename)
  
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array
  
if (os.path.isfile("dataset.pkl")==False):
  # load doc into memory
  def load_doc(filename,encode_format):
    # open the file as read only
    file = open(filename, mode='rt', encoding=encode_format)
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

  # split a loaded document into sentences
  def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

  # clean a list of lines
  def clean_pairs(lines,):
    cleaned = list()
    # prepare regex for char filtering
    re_punc = re.compile('[।%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    for pair in lines:
      clean_pair = list()
      for line in pair:
        # tokenize on white space
        line = line.split()
        # remove punctuation from each token
        line = [re_punc.sub('', w) for w in line]
        # remove tokens with numbers in them
        #line = [word for word in line if word.isalpha()]
        #line=re.sub('[।]', '', line)
        # store as string
        line=(' '.join(line))
        line=line+eos
        clean_pair.append(line.strip())
      cleaned.append(clean_pair)
    return array(cleaned)

  # save a list of clean sentences to file
  def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

  # load dataset
  filename = 'hin.txt'
  doc = load_doc(filename,encode_format='utf-8')
  # split into english-german pairs
  pairs = to_pairs(doc)
  del doc

  filename2 = 'enghindi.txt'
  doc2 = load_doc(filename2,encode_format='utf-16')
  # split into english-german pairs
  pairs2 = to_pairs(doc2)
  del doc2

  print(type(pairs),pairs[0:2],pairs2[0:2])
  # clean sentences
  clean_pairs = clean_pairs(pairs)
  lines=pairs2
  cleaned = list()
  re_punc = re.compile('[।%s]' % re.escape(string.punctuation))
  re_print = re.compile('[^%s]' % re.escape(string.printable))
  for pair in lines:
    clean_pair = list()
    for line in pair:
      line=line.strip()
      # tokenize on white space
      line = line.split()
      # remove punctuation from each token
      line = [re_punc.sub('', w) for w in line]
      # remove tokens with numbers in them
      #line = [word for word in line if word.isalpha()]
      #line=re.sub('[।]', '', line)
      # store as string
      line=(' '.join(line))
      line=line+eos
      clean_pair.append(line.strip())
    cleaned.append(clean_pair)
  clean_pairs2=(array(cleaned))
  print(type(clean_pairs2),clean_pairs2.shape,type(clean_pairs),clean_pairs.shape,(np.concatenate((clean_pairs2, clean_pairs))).shape)
  # save clean pairs to file
  print ("Number of clean pairs",clean_pairs.shape[0])
  # spot check
  for i in range(10):
    print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))
  clean_pairs=(np.concatenate((clean_pairs2, clean_pairs)))
  print(clean_pairs[1][1],len(((clean_pairs[1][1].split()))))
  save_clean_data(clean_pairs, 'clean_pairs.pkl')
  del clean_pairs,clean_pairs2,lines,re_punc,re_print,cleaned,pairs,pairs2

<class 'list'> [['Help!', 'बचाओ!'], ['Jump.', 'उछलो.']] [['fresh breath and shining teeth enhance your personality .', 'ताजा साँसें और चमचमाते दाँत आपके व्यक्तित्व को निखारते हैं ।'], ['your self-confidence also increases with teeth .', 'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है ।']]
<class 'numpy.ndarray'> (77145, 2) <class 'numpy.ndarray'> (2867, 2) (80012, 2)
Number of clean pairs 2867
[Help </s>] => [बचाओ </s>]
[Jump </s>] => [उछलो </s>]
[Jump </s>] => [कूदो </s>]
[Jump </s>] => [छलांग </s>]
[Hello </s>] => [नमस्ते </s>]
[Hello </s>] => [नमस्कार </s>]
[Cheers </s>] => [वाहवाह </s>]
[Cheers </s>] => [चियर्स </s>]
[Got it </s>] => [समझे कि नहीं </s>]
[Im OK </s>] => [मैं ठीक हूँ </s>]
दाँतों से आपका आत्मविश्‍वास भी बढ़ता है  </s> 8
Saved: clean_pairs.pkl


In [94]:
from pickle import load
from pickle import dump
from numpy.random import shuffle

if (os.path.isfile("dataset.pkl")==False):
  # load a clean dataset
  def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

  # save a list of clean sentences to file
  def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

  # load dataset
  raw_dataset = load_clean_sentences('clean_pairs.pkl')
  #del os.remove("clean_pairs.pkl")
  # reduce dataset size
  n_sentences = raw_dataset.shape[0]
  print (n_sentences)
  #Subsetting to 10000 records.
  #n_sentences=dataset.shape[0]
  # random shuffle
  random.Random(4).shuffle(raw_dataset)
  dataset = raw_dataset[:n_sentences, :]
  print(dataset.shape)
  random.Random(4).shuffle(dataset)
  del raw_dataset
# split into train/test
#train, test = dataset[:2800], dataset[2800:]
# save
#save_clean_data(dataset, 'dataset.pkl')
#save_clean_data(train, 'english-german-train.pkl')
#save_clean_data(test, 'english-german-test.pkl')

80012
(80012, 2)


In [0]:
#Trying to get sentences of length say hindi=12 words or less also with English words 10 or less.
#limited=dataset[:10]
#print((len(limited[2][0].split()) < 100))
#print((len(limited[2][1].split()) < 100))
#state1=(len(limited[2][0].split()) < 100)
#state2=(len(limited[2][1].split())<100)
#print(state1,state2,state1&state2)
#print((len(limited[2][0].split()) < 100 & len(limited[2][1].split())<100))

def get_sentences_subset(limited,source_len,target_len):
    indexes_list=[]
    for indexes in range(0,limited.shape[0]):
        #print(len(limited[i][0].split()),len(limited[i][1].split()))
        eng_len=len(limited[indexes][0].split()) 
        hin_len=len(limited[indexes][1].split())
        state1=(eng_len<=target_len)
        state2=(hin_len<=source_len)
        final=state2&state1
        #print(eng_len,hin_len,final)
        #print(state1,state2,final)
        if (final):
            indexes_list.append(indexes)
    #print(indexes_list,type(indexes_list))
    return(limited[indexes_list])

In [96]:
if (os.path.isfile("dataset.pkl")==False):
  print("Total sentences", dataset.shape[0])
  print("How many words long sentence in both language you need?")
  sentence_words=input()
  sentence_words=int(sentence_words)
  #print("How many words long sentence in Target Language you need?")
  #target_sentence_words=input()
  #target_sentence_words=int(target_sentence_words)
  raw_dataset_subset=get_sentences_subset(dataset,sentence_words,sentence_words)
  print("There are Number of Sentences are matching above criteria",raw_dataset_subset.shape)
  print("########################################################################")
  print("How many sentences you need?")
  n_sentences=input()
  n_sentences=int(n_sentences)
  #Subsetting to n sentences.
  #n_sentences=5000
  random.Random(4).shuffle(raw_dataset_subset)
  dataset = raw_dataset_subset[:n_sentences, :]
  random.Random(4).shuffle(dataset)
  print(raw_dataset_subset.shape,dataset,dataset.shape)
  save_clean_data(dataset, 'dataset.pkl')
#i=[0,2,3]
#print("Limited",limited,i)
#print("\n Indexed",limited[i],i)

Total sentences 80012
How many words long sentence in both language you need?
10
There are Number of Sentences are matching above criteria (14211, 2)
########################################################################
How many sentences you need?
500
(14211, 2) [['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दाँतों से आपका आत्मविश्\u200dवास भी बढ़ता है  </s>']
 ['your selfconfidence also increases with teeth  </s>'
  'दा

In [0]:
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))
dataset = load_clean_sentences('dataset.pkl')

In [0]:
raw_dataset_subset = None
raw_dataset =None
del raw_dataset_subset,raw_dataset

In [99]:
#Check the data sample
dataset.shape

(500, 2)

In [0]:
#Converting it to tuples.
dataset_list=(list(tuple(map(tuple, dataset))))

In [0]:
#English Sentence List
english_sentences_list=list(dataset[:,0])

In [102]:
english_sentences_list[0]='Please make yourself at home'+eos
english_sentences_list[0]

'Please make yourself at home </s>'

In [0]:
#English Sentence Unique Word List and Length of Vocabulary
english_unique_words=set((' '.join(english_sentences_list)).split())
english_vocab_len=len(set((' '.join(english_sentences_list)).split()))

In [0]:
#Hindi Sentence List
hindi_sentences_list=list(dataset[:,1])
hindi_sentences_list[0]='इसको अपना घर ही समझो'+eos

In [0]:
#Hindi Sentence Unique Word List and Length of Vocabulary
hindi_unique_words=set((' '.join(hindi_sentences_list)).split())
hindi_vocab_len=len(set((' '.join(hindi_sentences_list)).split()))

In [0]:
#Creating Dictionary with Unknown and Pad elements
english_dictionary=dict(zip(sorted(english_unique_words) + ['<unk>', '<pad>'], list(range(len(english_unique_words) + 2))))
hindi_dictionary=dict(zip(sorted(hindi_unique_words) + ['<unk>', '<pad>'], list(range(len(hindi_unique_words) + 2))))

In [0]:
#Making Sure that 0th Value is pad value to ensure masking.
def return_adjusted_dictionary(english_dictionary):
  zero_value=0
  Key_pad='<pad>'
  for Key, Value in english_dictionary.items():
    if Value == zero_value:
      Key_zero=Key
      #print (Key)
    if Key == Key_pad:
      value_pad=Value
      #print (Value)
  english_dictionary.update({Key_pad: 0, Key_zero: value_pad})
  return(english_dictionary)

In [0]:
hindi_dictionary=return_adjusted_dictionary(hindi_dictionary)
english_dictionary=return_adjusted_dictionary(english_dictionary)

In [0]:
#Reverse Dictionary for both languages
revere_dictionary_hindi=dict((v,k) for k,v in hindi_dictionary.items())
revere_dictionary_english=dict((v,k) for k,v in english_dictionary.items())

In [110]:
#Storing the index of padding value in variables to add it going ahead.
english_padding_value=english_dictionary['<pad>']
hindi_padding_value=hindi_dictionary['<pad>']
print(english_padding_value,hindi_padding_value)

0 0


In [111]:
#This going to be the global variable with maximum number of words found in a sentence
max_english_words=max(len(line.split()) for line in english_sentences_list)
max_hindi_words=max(len(line.split()) for line in hindi_sentences_list)
print(max_english_words,max_hindi_words)

10 10


In [0]:
def get_padded_encoding(sentences_list,language_dictionary,max_language_words):
    padding_value=language_dictionary['<pad>']
    language_array=[]
    #Iterate over List.
    for sentence in sentences_list:
        #Replaces English words with English Vocabulary Indexes and Hindi with Hindi Vocabulary Indexes.
        #logic: if a word not in dictionary enters, it will be replaced by unk key value.
        single_sentence_array=[]
        for word in sentence.split(): 
            try:
                #single_sentence_array=([language_dictionary[word] for word in sentence.split()])
                single_sentence_array.append(language_dictionary[word])
            except KeyError:
                unk='<unk>'
                single_sentence_array.append(language_dictionary[unk])
        #Find the length of english_single_sentence_array
        length_single_sentence=(len(single_sentence_array))
        #So how many times padding dictionary key needs to be appended, if we say maximum length of sentences to be considered is eng_max_len.
        if (max_language_words>length_single_sentence):
            padding_count=(max_language_words-length_single_sentence)
        else:
            padding_count=0
        if (padding_count>0):
            for pad in range(0,padding_count):
                single_sentence_array.append(padding_value)
        else:
            single_sentence_array=single_sentence_array[0:max_language_words]
        #Append to main array
        language_array.append(single_sentence_array)
    #Convert to Numpy array at the end
    language_array=np.array(language_array)
    return(language_array)

Instead of doing a padding over large sentence size, emperically it is found that it is better to do for a short sentences considering the limitation we are having with respect to corpus size.

In [113]:
#Get encoded sentences
hindi_encoding=get_padded_encoding(hindi_sentences_list,hindi_dictionary,max_english_words)
english_encoding=get_padded_encoding(english_sentences_list,english_dictionary,max_hindi_words)
print(hindi_encoding.shape,english_encoding.shape)

(500, 10) (500, 10)


In [114]:
#Verifying the encoding and decoding for a sample data.
print(english_sentences_list[1],hindi_sentences_list[1])
print(english_encoding[1],hindi_encoding[1])
#Check if encoding gives back the same answer
for key in english_encoding[1]:
    print(revere_dictionary_english[key])
for key in hindi_encoding[1]:
    print(revere_dictionary_hindi[key])
english_dictionary['<pad>']
hindi_dictionary['<pad>']

your selfconfidence also increases with teeth  </s> दाँतों से आपका आत्मविश्‍वास भी बढ़ता है  </s>
[196 150   8  98 191 173 199   0   0   0] [ 86 191  14  10 142 131 203 210   0   0]
your
selfconfidence
also
increases
with
teeth
</s>
<pad>
<pad>
<pad>
दाँतों
से
आपका
आत्मविश्‍वास
भी
बढ़ता
है
</s>
<pad>
<pad>


0

In [0]:
#We will convert the english and hindi encodings to one hot encodings.
#Please note Input is of the dimension (number of sentences,max_length_language(every column is a word))
#Output is (number of sentences,Max_length_language(every row is a word),length of vocabulary)
#Basically every row of the onehotcode matrix must be for one word.
#How=1 => 1 0 0
#Are=2 => 0 1 0
#You=3 => 0 0 1
#We are trying to translate hindi to english, so our X is Hindi and Y is English
X=hindi_encoding
Y=english_encoding
hindi_encoding=None
english_encoding=None
dataset=None
english_sentences_list=None
hindi_sentences_list=None
del hindi_encoding,english_encoding,dataset,english_sentences_list,hindi_sentences_list
#Note: Instead of one hot we can use word embeddings for Xoh

#Xoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(hindi_dictionary)), X)))
#Yoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y)))
#print("X.shape:", X.shape)
#print("Y.shape:", Y.shape)
#print("Xoh.shape:", Xoh.shape)
#print("Yoh.shape:", Yoh.shape)

In [0]:
#sample_size=X.shape[0]
#test_start_index=(sample_size-100)
#test_batch_size=sample_size-test_start_index-2
#testXoh=get_single_onehot_array(X,test_start_index,test_batch_size,hindi_dictionary)
#testYoh=get_single_onehot_array(Y,test_start_index,test_batch_size,english_dictionary)
#print(sample_size,test_start_index,test_batch_size)

In [0]:
#print(Y.shape,X.shape,len(english_dictionary),len(hindi_dictionary))
#from keras.callbacks import ModelCheckpoint
#import pandas

#sample_size=X.shape[0]
#test_start_index=(sample_size-100)
#test_batch_size=sample_size-test_start_index-2
#train_iteration_size=test_start_index
#print("train_iteration_size",train_iteration_size)
#gc.collect()

In [0]:
#Every call will return batch size sized Y or x
#Intention generate the small dataset and train the model 
#and retrain the model again for next index small dataset and so on

#start_index=0
#iteration=0
#if (iteration>0):
#    start_index=iteration+batch_size
def get_single_onehot_array(Y,start_index,batch_size,dictionary):
    english_dictionary=dictionary
    end_index=start_index+batch_size
    if end_index>Y.shape[0]:
        end_index=Y.shape[0]
    if start_index>end_index:
        start_index=(end_index-1)
    result_array=[]
    for i in range(start_index,end_index):
        Yoh_single_sentence=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y[i])))
        Yoh_single_sentence=np.swapaxes(Yoh_single_sentence,0,1)
        Yoh_single_sentence=np.reshape(Yoh_single_sentence,(Yoh_single_sentence.shape[1],Yoh_single_sentence.shape[2]))
        #Making fist column of timestep of one hot encoding as 0 as this column will have 1 only for the row with encoding 0, which is our pad value, which will be masked.
        #if i % 1000 == 0:
        #    print(i)
        #print(type(Yoh_single_sentence),Yoh_single_sentence.shape)
        result_array.append(Yoh_single_sentence)
        #result_array = np.append(result_array,Yoh_single_sentence, axis=0)
    Yoh = np.array(result_array)
    Yoh[:,:,0]=np.zeros(Yoh.shape[1])
    return Yoh

In [0]:
if (os.path.isfile("attention_decoder.zip")==False):
  r = requests.get("https://github.com/meethariprasad/phd/raw/master/assignments/NLP/Translation/attention_decoder.zip")
  z = zipfile.ZipFile(io.BytesIO(r.content))
  z.extractall()
  os.listdir()
  del z,r

In [120]:
#del model
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder



# configure problem
n_features = len(english_dictionary)
input_features=len(hindi_dictionary)
n_timesteps_in = X.shape[1]
n_timesteps_out = Y.shape[1]
LSTM_Unitsize=150
input_embed_dimension=100
print(n_timesteps_in,n_timesteps_out,input_features,n_features,LSTM_Unitsize)
#Previous run: 20 20 8651 6768 150 model saved.
#model available to load and predict ready. 60% Validation Accuracy under 40000 data with 600 Epoch run on GPU

10 10 211 200 150


In [0]:
#Trying nce loss from tensorflow.
#nn.nce_loss
#del model2
#import keras.backend as K
import tensorflow as tf

def keras_nce_loss(tgt, pred):
    return tf.nn.nce_loss(labels=tgt,inputs=pred,num_sampled=100)

In [157]:
#import gc
#locals()
# train LSTM
#model.fit(Xoh, Yoh, epochs=1, verbose=2)
# define model
#del model2
import gc
gc.collect()
from keras.layers import Dropout,Masking,Embedding
from attention_decoder import AttentionDecoder

if (os.path.isfile("main_model_weights_attn_new.h5")==False):
  model2 = Sequential()
  model2.add(Embedding(input_features, input_embed_dimension, input_length=n_timesteps_in,mask_zero=True))
  model2.add(Dropout(0.2))
  model2.add(LSTM(LSTM_Unitsize,return_sequences=True,activation='relu'))
  model2.add(Masking(mask_value=0.))
  model2.add(AttentionDecoder(LSTM_Unitsize, n_features))
  model2.compile(loss=keras_nce_loss, optimizer='adam', metrics=['acc'])
  #model2.save("model2_compiled.hd5")

TypeError: ignored

In [0]:
#from IPython.display import SVG
#from keras.utils.vis_utils import model_to_dot
#import pydot

#SVG(model_to_dot(model2).create(prog='dot', format='svg'))

In [124]:
#from IPython.display import SVG
#from keras.utils.vis_utils import model_to_dot
#import pydot

#SVG(model_to_dot(model2).create(prog='dot', format='svg'))
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 10, 100)           21100     
_________________________________________________________________
dropout_2 (Dropout)          (None, 10, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 10, 150)           150600    
_________________________________________________________________
masking_2 (Masking)          (None, 10, 150)           0         
_________________________________________________________________
AttentionDecoder (AttentionD (None, 10, 200)           393450    
Total params: 565,150
Trainable params: 565,150
Non-trainable params: 0
_________________________________________________________________


In [125]:
#Train & Validation Parameters
sample_size=X.shape[0]

# Or Adjust Samplesize if you want it to be small
#sample_size=20000

test_batch_size=100
test_start_index=(sample_size-test_batch_size)
test_end_index=sample_size
test_generator_batch_size=20
test_samples_per_epoc=test_batch_size
###############################################################


train_start_index=0
train_end_index=test_start_index
train_generator_batch_size=20
train_samples_per_epoc=100
###############################################################

print("Data Size",sample_size)
print("######################################")
print("train_start_index:",train_start_index)
print("train_end_index:",train_end_index)
print("train_generator_batch_size",train_generator_batch_size)
print("train_samples_per_epoc",train_samples_per_epoc)
print("######################################")

print("test_start_index:",test_start_index)
print("test_end_index:",test_end_index)
print("test_generator_batch_size:",test_generator_batch_size)
print("test_samples_per_epoc",test_samples_per_epoc)
print("######################################")

print("Approximate Epcs Required to Cover Samples given train_samples_per_epoc(sample_size/train_samples_per_epoc)",round(sample_size/train_samples_per_epoc))
epoc=round(sample_size/train_samples_per_epoc)*40
print("epoc:",epoc)
gc.collect()

Data Size 500
######################################
train_start_index: 0
train_end_index: 400
train_generator_batch_size 20
train_samples_per_epoc 100
######################################
test_start_index: 400
test_end_index: 500
test_generator_batch_size: 20
test_samples_per_epoc 100
######################################
Approximate Epcs Required to Cover Samples given train_samples_per_epoc(sample_size/train_samples_per_epoc) 5
epoc: 200


522

In [0]:
#Train_X=X[train_start_index:train_end_index]
#Train_Y=Y[train_start_index:train_end_index]
#Test_X=X[test_start_index:test_end_index]
#Test_Y=Y[test_start_index:test_end_index]

In [0]:
#For Train:start_index=0,train_iteration_size=train_sample_size
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)
#For Test:start_index=train_sample_size,train_iteration_size=(samples_size-train_sample_size-1)
#batch_generator(start_index,X,Y,train_iteration_size,hindi_dictionary,english_dictionary,n_s)
#Keras new version has timeseries generator which pretty much does same as the code below.

def batch_generator(label,train_start_index,train_end_index,X,Y,train_generator_batch_size,english_dictionary):
  Xoh_batch = np.zeros((train_generator_batch_size,X.shape[1]))
  Yoh_batch = np.zeros((train_generator_batch_size,Y.shape[1],len(english_dictionary)))
  batch_size=1
  #LOG_EVERY_N = 1000
  print(label,"start_index",train_start_index)
  start_index=train_start_index
  
  while True:
    for ind in range(train_generator_batch_size):
      if (start_index<train_end_index):
        
        #For Sequential Index Generation
        index=start_index
        #if(label=="train" and (index%1000 == 0)):
        #  print(label,"example index",index)
        #elif (label=="test" and (index == train_end_index-1)):
        #  print(label,"example index",index)

        Xoh_batch[ind]=np.reshape(X[index],(1,X.shape[1]))
        Yoh_batch[ind]=get_single_onehot_array(Y,index,batch_size,english_dictionary)
        start_index=start_index+1
      else:
        if (label=="train"):
          print(label,"start_index before reset",start_index)
        start_index=train_start_index
    #print(Xoh_batch.shape,Yoh_batch.shape)
    yield (Xoh_batch,Yoh_batch)

With the generator above, if we define train_generator_batch_size = 10 , that means it will take out 10 samples from features and labels to feed into each epoch until an epoch hits train_samples_per_epoc  sample limit. Then fit_generator() destroys the used data and move on repeating the same process in new epoch.


In [128]:
train_label="train"
test_label="test"
train_batch_generator=batch_generator(train_label,train_start_index,train_end_index,X,Y,train_generator_batch_size,english_dictionary)
test_batch_generator=batch_generator(test_label,test_start_index,test_end_index,X,Y,test_generator_batch_size,english_dictionary)
print(train_batch_generator,test_batch_generator)

<generator object batch_generator at 0x7f214a8faca8> <generator object batch_generator at 0x7f214a8ca360>


In [129]:
#Run the model.fit. If only best validation model needs to be saved, then change save_best_only=True in checkpoint.
from keras.callbacks import ModelCheckpoint,EarlyStopping
checkpoint = ModelCheckpoint('best_model_new.h5', monitor='val_acc', verbose=2,save_best_only=True, mode='auto',save_weights_only=False)
early_stop=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=50,mode='auto')
import pandas
gc.collect()
#pandas.DataFrame(model.fit(trainX, trainY, epochs=200, batch_size=100, validation_data=(testX, testY), callbacks=[checkpoint]).history).to_csv("history.csv")
#model.fit(trainX, trainY, epochs=200, batch_size=20, validation_data=(testX, testY), callbacks=[checkpoint])
model2.fit_generator(train_batch_generator,steps_per_epoch=train_samples_per_epoc
                     ,validation_data=test_batch_generator,validation_steps=test_samples_per_epoc
                     ,epochs=epoc
                     #,verbose=1
                     ,callbacks=[checkpoint,early_stop]
                     ,max_q_size=1
                     #,workers=5
                     ,initial_epoch=57
                    )

Epoch 58/200
train start_index 0
 19/100 [====>.........................] - ETA: 12s - loss: 3.4131 - acc: 0.1113train start_index before reset 400
test start_index 400
Epoch 00057: val_acc improved from -inf to 0.30520, saving model to best_model_new.h5
Epoch 59/200
 19/100 [====>.........................] - ETA: 6s - loss: 1.2721 - acc: 0.4087train start_index before reset 400

Epoch 00058: val_acc improved from 0.30520 to 0.44785, saving model to best_model_new.h5
Epoch 60/200
 19/100 [====>.........................] - ETA: 7s - loss: 0.4852 - acc: 0.5729train start_index before reset 400
Epoch 00059: val_acc improved from 0.44785 to 0.56685, saving model to best_model_new.h5
Epoch 61/200
 13/100 [==>...........................] - ETA: 7s - loss: 0.1294 - acc: 0.6462

 19/100 [====>.........................] - ETA: 7s - loss: 0.1859 - acc: 0.6434train start_index before reset 400
Epoch 00060: val_acc improved from 0.56685 to 0.59790, saving model to best_model_new.h5
Epoch 62/200
 20/100 [=====>........................] - ETA: 7s - loss: 0.1227 - acc: 0.6585train start_index before reset 400

Epoch 63/200
train start_index before reset 400
 20/100 [=====>........................] - ETA: 6s - loss: 0.0615 - acc: 0.6710train start_index before reset 400
Epoch 64/200
train start_index before reset 400
 14/100 [===>..........................] - ETA: 7s - loss: 0.0360 - acc: 0.6671

 20/100 [=====>........................] - ETA: 6s - loss: 0.0515 - acc: 0.6737train start_index before reset 400
Epoch 65/200
train start_index before reset 400
 20/100 [=====>........................] - ETA: 6s - loss: 0.0456 - acc: 0.6780train start_index before reset 400

Epoch 66/200
train start_index before reset 400
 21/100 [=====>........................] - ETA: 6s - loss: 0.0805 - acc: 0.6693train start_index before reset 400
Epoch 67/200
  1/100 [..............................] - ETA: 9s - loss: 0.0577 - acc: 0.6750train start_index before reset 400
 11/100 [==>...........................] - ETA: 7s - loss: 0.0121 - acc: 0.6768

 21/100 [=====>........................] - ETA: 6s - loss: 0.0152 - acc: 0.6814train start_index before reset 400
Epoch 68/200
  1/100 [..............................] - ETA: 9s - loss: 0.0148 - acc: 0.6850train start_index before reset 400
 21/100 [=====>........................] - ETA: 6s - loss: 0.0059 - acc: 0.6833train start_index before reset 400

Epoch 69/200
  1/100 [..............................] - ETA: 8s - loss: 0.0104 - acc: 0.6850train start_index before reset 400
 21/100 [=====>........................] - ETA: 6s - loss: 0.0040 - acc: 0.6833train start_index before reset 400
Epoch 70/200
  1/100 [..............................] - ETA: 8s - loss: 0.0027 - acc: 0.6800train start_index before reset 400
 10/100 [==>...........................] - ETA: 8s - loss: 0.0025 - acc: 0.6815

 22/100 [=====>........................] - ETA: 6s - loss: 0.0031 - acc: 0.6855train start_index before reset 400
Epoch 71/200
  2/100 [..............................] - ETA: 8s - loss: 0.0035 - acc: 0.7075train start_index before reset 400
 22/100 [=====>........................] - ETA: 6s - loss: 0.0025 - acc: 0.6850train start_index before reset 400

Epoch 72/200
  2/100 [..............................] - ETA: 8s - loss: 0.0031 - acc: 0.6950train start_index before reset 400
 22/100 [=====>........................] - ETA: 6s - loss: 0.0020 - acc: 0.6843train start_index before reset 400
Epoch 73/200
  2/100 [..............................] - ETA: 8s - loss: 0.0028 - acc: 0.6950train start_index before reset 400
 12/100 [==>...........................] - ETA: 7s - loss: 0.0014 - acc: 0.6833

 22/100 [=====>........................] - ETA: 6s - loss: 0.0017 - acc: 0.6843train start_index before reset 400
Epoch 74/200
  2/100 [..............................] - ETA: 8s - loss: 0.0015 - acc: 0.6950train start_index before reset 400
 23/100 [=====>........................] - ETA: 6s - loss: 0.0015 - acc: 0.6865train start_index before reset 400

Epoch 75/200
  3/100 [..............................] - ETA: 8s - loss: 0.1011 - acc: 0.6700train start_index before reset 400
 23/100 [=====>........................] - ETA: 6s - loss: 0.0655 - acc: 0.6680train start_index before reset 400
Epoch 76/200
  3/100 [..............................] - ETA: 8s - loss: 0.0075 - acc: 0.6967train start_index before reset 400
 12/100 [==>...........................] - ETA: 7s - loss: 0.0036 - acc: 0.6871

 23/100 [=====>........................] - ETA: 6s - loss: 0.0049 - acc: 0.6848train start_index before reset 400
Epoch 77/200
  3/100 [..............................] - ETA: 8s - loss: 0.0028 - acc: 0.6950train start_index before reset 400
 23/100 [=====>........................] - ETA: 6s - loss: 0.0020 - acc: 0.6848train start_index before reset 400

Epoch 78/200
  3/100 [..............................] - ETA: 8s - loss: 0.0015 - acc: 0.6967train start_index before reset 400
Epoch 79/200
  4/100 [>.............................] - ETA: 8s - loss: 0.0058 - acc: 0.7013train start_index before reset 400
 13/100 [==>...........................] - ETA: 7s - loss: 0.1771 - acc: 0.6765

Epoch 80/200
  4/100 [>.............................] - ETA: 8s - loss: 0.0027 - acc: 0.6975train start_index before reset 400

Epoch 81/200
  4/100 [>.............................] - ETA: 8s - loss: 0.0022 - acc: 0.6937train start_index before reset 400
Epoch 82/200
  4/100 [>.............................] - ETA: 8s - loss: 0.0016 - acc: 0.7012train start_index before reset 400
 12/100 [==>...........................] - ETA: 7s - loss: 0.2704 - acc: 0.6638

Epoch 83/200
  5/100 [>.............................] - ETA: 8s - loss: 0.0020 - acc: 0.7050train start_index before reset 400

Epoch 84/200
  5/100 [>.............................] - ETA: 8s - loss: 0.0966 - acc: 0.6870train start_index before reset 400
Epoch 85/200
  5/100 [>.............................] - ETA: 8s - loss: 0.0081 - acc: 0.7080train start_index before reset 400
 14/100 [===>..........................] - ETA: 7s - loss: 0.0041 - acc: 0.6914

Epoch 86/200
  5/100 [>.............................] - ETA: 8s - loss: 0.0018 - acc: 0.7060train start_index before reset 400

Epoch 87/200
  6/100 [>.............................] - ETA: 8s - loss: 0.0082 - acc: 0.7033train start_index before reset 400
Epoch 88/200
  6/100 [>.............................] - ETA: 8s - loss: 0.0048 - acc: 0.7050train start_index before reset 400
 13/100 [==>...........................] - ETA: 7s - loss: 0.0031 - acc: 0.6946

Epoch 89/200
  6/100 [>.............................] - ETA: 8s - loss: 0.0019 - acc: 0.7033train start_index before reset 400

Epoch 90/200
  6/100 [>.............................] - ETA: 7s - loss: 0.0012 - acc: 0.7025train start_index before reset 400
Epoch 91/200
  7/100 [=>............................] - ETA: 7s - loss: 0.0011 - acc: 0.7007train start_index before reset 400
 13/100 [==>...........................] - ETA: 7s - loss: 9.3043e-04 - acc: 0.6938

Epoch 92/200
  7/100 [=>............................] - ETA: 7s - loss: 9.0102e-04 - acc: 0.6993train start_index before reset 400


Epoch 93/200
  7/100 [=>............................] - ETA: 8s - loss: 7.8020e-04 - acc: 0.6971train start_index before reset 400
Epoch 94/200
  7/100 [=>............................] - ETA: 8s - loss: 6.2866e-04 - acc: 0.7007train start_index before reset 400
 12/100 [==>...........................] - ETA: 7s - loss: 5.6627e-04 - acc: 0.6942

Epoch 95/200
  8/100 [=>............................] - ETA: 8s - loss: 5.6500e-04 - acc: 0.6988train start_index before reset 400

Epoch 96/200
  8/100 [=>............................] - ETA: 7s - loss: 5.8005e-04 - acc: 0.6956train start_index before reset 400
Epoch 97/200
  8/100 [=>............................] - ETA: 8s - loss: 5.0781e-04 - acc: 0.6956train start_index before reset 400
 14/100 [===>..........................] - ETA: 7s - loss: 3.9993e-04 - acc: 0.6932

Epoch 98/200
  8/100 [=>............................] - ETA: 7s - loss: 4.1079e-04 - acc: 0.6950train start_index before reset 400

train start_index before reset 400
Epoch 99/200
  9/100 [=>............................] - ETA: 7s - loss: 3.9203e-04 - acc: 0.6956train start_index before reset 400
Epoch 100/200
  9/100 [=>............................] - ETA: 8s - loss: 3.5254e-04 - acc: 0.6944

train start_index before reset 400

KeyboardInterrupt: ignored

In [71]:
from keras.models import model_from_json
model_json = model2.to_json()
with open("model2.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model2.save_weights("model2_weights.h5")
print("Saved model to disk")
 
# later...
 
# load json and create model
json_file = open('model2.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json,custom_objects={'AttentionDecoder': AttentionDecoder(LSTM_Unitsize, n_features)})
print("Loaded model from disk")


Saved model to disk
Loaded model from disk


In [72]:
# load weights into new model
loaded_model.load_weights("model2_weights.h5")
print("Loaded weights from disk")
 
# Compile the model
loaded_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print("Compile model")

Loaded weights from disk
Compile model


In [0]:
# Install the PyDrive wrapper & import libraries. This is to upload model & weights to your GDrive.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [74]:
# Create & upload a file.
modelname="model2.json"
uploaded = drive.CreateFile({'title': modelname})
uploaded.SetContentFile(modelname)
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

Uploaded file with ID 1onnsT2hFRDUfxnVi_oMoI99xVd20D6Ss


In [75]:
# Create & upload a file.
modelname="model2_weights.h5"
uploaded = drive.CreateFile({'title': modelname})
uploaded.SetContentFile(modelname)
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))


Uploaded file with ID 1oOQPd7_uUI7HXH6qMt-VhYc_f21deN6-


In [77]:
# Create & upload a file.
if (os.path.isfile("best_model_new.h5")==True):
  uploaded = drive.CreateFile({'title': modelname})
  uploaded.SetContentFile(modelname)
  uploaded.Upload()
  print('Uploaded file with ID {}'.format(uploaded.get('id')))

Uploaded file with ID 1POtvelqieUz92ot9fjAcUMfZsAhwcB7_


In [0]:
from keras.models import load_model
#If you need best validation accuracy model
prediction_model = load_model('best_model_new.h5',custom_objects={'AttentionDecoder': AttentionDecoder(LSTM_Unitsize, n_features)})
#If you need overall model which was loaded with json, weights & compiled earlier
#prediction_model=loaded_model

In [79]:
#Test
test_index=test_start_index+14
test=np.reshape(X[test_index],(1,X.shape[1]))
prediction=np.round(prediction_model.predict(test))
predicted_argmax=np.argmax(prediction,axis=2)
predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
Y[test_index],predicted_argmax

(array([2654, 2480,  465,  249, 2491, 5521, 1165,  826, 4759,  249, 3673,
        3927, 4143, 4155, 2653, 6767,    0,    0,    0,    0]),
 array([2654, 2480,  465,  249, 2491, 5521, 1165,  826, 4759,  249, 3673,
        3927, 4143, 4155, 2653, 6767, 6767,    0, 6767, 6767]))

In [80]:
#Predict for test sentence index which model has not seen yet.
print("Pure Testacases are from following sentence numbers in X",test_start_index)
#Given encoding matrix of sentence & dictionary, get the sentence
def return_sentences(X,Y,revere_dictionary_english,test_index,model2):
  
  def return_predicted_array(X,test_index,model2):
    test=np.reshape(X[test_index],(1,X.shape[1]))
    encoding_prediction=np.round(model2.predict(test))
    predicted_argmax=np.argmax(encoding_prediction,axis=2)
    predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
    return predicted_argmax
  
  encoding_prediction=return_predicted_array(X,test_index,prediction_model)
  
  encoding_actual=Y[test_index]
  
  def return_sentence_list(encoding,revere_dictionary_english,test_index):
    #print(test_index)
    sentence=list()
    for key in encoding:
      key=int(key)
      #print(type(int(key)))
      #print(revere_dictionary_english[key])
      sentence.append(revere_dictionary_english[key])
    return sentence

  def concatenate_list_data(list):
      result= ''
      for element in list:
          result += str(element)
          result += str(" ")
      return result
  actual_sentence=return_sentence_list(encoding_actual,revere_dictionary_english,test_index)
  actual_sentence=concatenate_list_data(actual_sentence)
  
  predicted_sentence=return_sentence_list(encoding_prediction,revere_dictionary_english,test_index)
  predicted_sentence=concatenate_list_data(predicted_sentence)
  return(actual_sentence,predicted_sentence)

#print(test_index)
test_sentences=10
for test_sentence_index in range(test_start_index,test_start_index+test_sentences):
  Actual,Predicted=return_sentences(X,Y,revere_dictionary_english,test_sentence_index,prediction_model)
  print("#############################")
  print("Actual Sentence is:")
  print(Actual)
  print("Predicted Sentence is:")
  print(Predicted)
  print("#############################")

Pure Testacases are from following sentence numbers in X 54900
#############################
Actual Sentence is:
administer the dpt vaccine to the child </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
Predicted Sentence is:
administer the dpt vaccine to the child </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> 
#############################
#############################
Actual Sentence is:
do not clean deep wounds yourself </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
Predicted Sentence is:
do not clean deep wounds yourself </s> </s> </s> </s> </s> </s> <pad> <pad> </s> </s> </s> </s> </s> </s> 
#############################
#############################
Actual Sentence is:
clean the mouth after meal </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 
Predicted Sentence is:
clean the mouth after meal </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> 

[Click here to use google transliterate and copy the hindi sentence from there and paste in below cell when it asks for input.](https://www.google.co.in/inputtools/try/)

In [85]:
#Enter source language sentence from google transliterate https://www.google.co.in/inputtools/try/
print("Enter Sentences less than 15 words. As of now that is what is set.")
user_sentence=input()
user_sentence=user_sentence+eos
print(type(user_sentence))

Enter Sentences less than 15 words. As of now that is what is set.
आपका दांत सुन्दर है 
<class 'str'>


In [86]:
#Get Encoding
words=user_sentence.split()
user_encoding=[]
for word in words:
  try:
    #print(hindi_dictionary[word])
    user_encoding.append(hindi_dictionary[word])
  except KeyError:
    #print(hindi_dictionary['<unk>'])
    user_encoding.append(hindi_dictionary['<unk>'])
user_encoding
#print(X.shape[1],len(user_encoding))
if (X.shape[1]>len(user_encoding)):
  padding_count=X.shape[1]-len(user_encoding)
  for x in range(0,padding_count):
    user_encoding.append(0)
else:
  user_encoding=user_encoding[0:X.shape[1]]
user_encoding=np.array(user_encoding)
user_encoding

array([ 555, 8649, 7981, 8512, 8650,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0])

In [87]:
def return_predicted_array_user(user_encoding,model2):
  test=np.reshape(user_encoding,(1,X.shape[1]))
  encoding_prediction=np.round(model2.predict(test))
  predicted_argmax=np.argmax(encoding_prediction,axis=2)
  predicted_argmax=np.reshape(predicted_argmax,(predicted_argmax.shape[1],))
  return predicted_argmax

#from keras.models import load_model
#bestmodel = load_model('complete_model_with_weigths.h5')

predicted_user=return_predicted_array_user(user_encoding,model2)
print("Predicted Array",predicted_user)

def return_sentence_list(encoding,revere_dictionary_english):
  sentence=list()
  for ind in encoding:
    #print(ind)
    predicted_word=revere_dictionary_english[ind]
    #print(predicted_word)
    sentence.append(revere_dictionary_english[ind])
  return sentence
  
user_translation_list=return_sentence_list(predicted_user,revere_dictionary_english)

def concatenate_list_data(list):
  result= ''
  for element in list:
    result += str(element)
    result += str(" ")
  return result

predicted_sentence=concatenate_list_data(user_translation_list)
print("Predicted Sentence")
print("###################################")
print(predicted_sentence)

Predicted Array [3091    0    0 4135    0 6767 6767 6767    0    0    0    0    0 6767
 6767 6767 6767 6767 6767 6767]
Predicted Sentence
###################################
its <pad> <pad> of <pad> </s> </s> </s> <pad> <pad> <pad> <pad> <pad> </s> </s> </s> </s> </s> </s> </s> 


We can now see the results on new examples.

## 5 References

Neural Machine Translation by Jointly Learning to Align and Translate: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio https://arxiv.org/pdf/1409.0473.pdf

https://machinelearningmastery.com

https://www.coursera.org/

https://www.udemy.com/

## Appendix

One thing we can do to improve the model is instead of one hot encodings of words of length vocabulary, get the word2vec vectors for each word with fixed length.

Another thing that can be done is train only short sentences.

In below section we will provide the functions to help to do the tasks.

In [0]:
#Converting input to word2vec.
def sentences_to_word2vec_input_format(language_sentences_list):
    word2vec_sentence_feed=list()
    for sentence in language_sentences_list:
        word2vec_sentence_feed.append(sentence.split())
    return(word2vec_sentence_feed)
english_sentences_w2v_format=sentences_to_word2vec_input_format(english_sentences_list)
hindi_sentences_w2v_format=sentences_to_word2vec_input_format(hindi_sentences_list)

In [0]:
from gensim.models import Word2Vec
# train model
english_model = Word2Vec(english_sentences_w2v_format, min_count=1)
english_words_vocab = list(english_model.wv.vocab)
hindi_model = Word2Vec(hindi_sentences_w2v_format, min_count=1)
english_words_vocab = list(hindi_model.wv.vocab)

In [0]:
def sentences_to_w2vec(language_encoding,revere_dictionary_language,language_model):
    import numpy as np
    sentence_level_w2vec_list=[]
    #arr = np.empty((2,), float)
    number_of_sentences=language_encoding.shape[0]
    for i in range(0,number_of_sentences):
        language_list_padded=[]
        #print (english_encoding[i])
        for key in language_encoding[i]:
            #print(revere_dictionary_english[key])
            word=(revere_dictionary_language[key])
            try:
                #print("Found word Shape of word vector",(english_model[word]).shape,arr.shape)
                language_list_padded.append(language_model[word])
            except KeyError:
                unk='<unk>'
                #print("not found! Assigning Unknown Vector",  (english_model[unk]).shape)
                language_list_padded.append(language_model[unk])
        #print(np.array(language_list_padded))
        sentence_level_w2vec_list.append((np.array(language_list_padded)))
    sentence_level_w2vec=np.array(sentence_level_w2vec_list)
    return(sentence_level_w2vec)

In [0]:
X=hindi_encoding
Y=english_encoding
#Y will remain the same.
Yoh=np.array(list(map(lambda x: to_categorical(x, num_classes=len(english_dictionary)), Y)))
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Yoh.shape:", Yoh.shape)

In [0]:
#Run this if you want word2vec instead of One hot encoding
#Naming it still as X0h and Yoh to avoid changes in too many places further.
#Yoh 
Xoh=sentences_to_w2vec(hindi_encoding,revere_dictionary_hindi,hindi_model)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

One might also like to get the sentences of only specific length from source as well as target, for example get all sentences which has maximum 5 words and in hindi maximum 8 words. Use below function and feed the length you need.

In [0]:
#dataset=Ndarray with following dimentions (sentence_length, 2)
#source_len is the length of language in dataset[0][1]
#target_len is the length of language in dataset[0][0]
def get_sentences_subset(dataset,source_len,target_len):
    limited=dataset
    indexes_list=[]
    for indexes in range(0,limited.shape[0]):
        #print(len(limited[i][0].split()),len(limited[i][1].split()))
        eng_len=len(limited[indexes][0].split()) 
        hin_len=len(limited[indexes][1].split())
        state1=(eng_len<=target_len)
        state2=(hin_len<=source_len)
        final=state2&state1
        #print(eng_len,hin_len,final)
        #print(state1,state2,final)
        if (final):
            indexes_list.append(indexes)
    #print(indexes_list,type(indexes_list))
    return(limited[indexes_list])