# Neural Machine Translation - Assignment 2

In this task you will develop a neural machine translation (NMT) system to translate text from one language to another. For this, you wil need to chose the data to train the models, perform data processing and train a sequence2sequence neural model.


## Section 1- Data Collection and Preprocessing 


---


**Task 1  (5 marks)**

---

There are few datasets to train an NMT system available from Tatoeba Project (http://www.manythings.org/anki/) or OPUS project (http://opus.nlpl.eu/).

*  Download a langauge pair (preferably European language) and **extract** the file(s) and upload it to colab
*  Create a list of lines by splitting the text file at every occurance of '\n'
*  Print number of sentences
*  Limit the amount of senteces to 10,000 lines (but more than 5,000 lines)
*  Split the data into train and test [You can split validation set here or while training use kerase validation_split option]
*  Print 100th sentence in original script[ not unicode] for source and target language


In [0]:
#This code has been written to upload the file in collab. If some one is running the code in local then no need to run this code block.
from google.colab import files
uploaded = files.upload()
from nltk.tokenize import sent_tokenize

In [0]:
#This code has been written to read the file and then split the file to make a list of all the lines where "\n" occurs. It is giving me a list of list of
#all. Then I have flatten the list to get a single list having all the lines.
def linesSplitting(filename):
  route = []
  with open(filename, 'r') as french_file:
      lines=french_file.read().split('\n')
  #NE_flat_list = [item.strip() for sublist in route for item in sublist]
  return lines
lines = linesSplitting('fra.txt')

In [2]:
#To print the number of lines in the list for the lines and the 100th of the original script for source and target language
print(len(lines))
print(lines[100])

160873
Come on.	Viens !


In [0]:
#Here I have taken a sample of 10000 sentences from the lsi and renamed the list as sampled_list. This will be used to create the train and test dataset.
sample_list = lines[0:10000]

In [0]:
#To split the train and test from the sample_list
from sklearn.model_selection import train_test_split 
train,test = train_test_split(sample_list, test_size=0.20)

**Task 2 (5 marks)** 

---

* Add '\t' to denote begining of sentence and '\n'  or '<eos\>' to denote end of the sentence to the each target line.
* Preprocess (word tokenisation, lowecasing) the text.

In [0]:
#This code block has been written to split the sample list in to input text and target text. The total data is needed to make the vocabulary which will be
#passed to embedding layer to get the embedding of the words. These will then use in test as well as train data to train the model and to decode the 
#test data translation
#In the target sentence I have added "start_" to indicate the starting of the sentence and "_end" to denote end of the sentence.
#I have spliited the sentence based on tab seperator.
from collections import Counter
input_texts = list()
target_texts = list()
for line in sample_list[: min(20000, len(sample_list) - 1)]:
  input_text, target_text = line.split('\t')
  target_text = 'start_ ' + target_text.lower() + ' _end'
  input_texts.append(input_text.lower())
  target_texts.append(target_text)

In [0]:
#This code I have written to split the train and test data seperately into traget data and input data.
#Here I have done the same thing like previous adding start and end token in the target sentence.
#The seperator here also is tab seperator.
from collections import Counter
tar_count_train = Counter()
tar_count_test = Counter()
input_texts_train = list()
target_texts_train = list()
input_texts_test = list()
target_texts_test = list()

for line in train[: min(10000, len(train) - 1)]:
  input_text_train, target_text_train = line.split('\t')
  target_text_train = 'start_ ' + target_text_train.lower() + ' _end'
  input_texts_train.append(input_text_train.lower())
  target_texts_train.append(target_text_train)
  
  
for line in test[: min(10000, len(test) - 1)]:
  input_text_test, target_text_test = line.split('\t')
  target_text_test = 'start_ ' + target_text_test.lower() + ' _end'
  input_texts_test.append(input_text_test.lower())
  target_texts_test.append(target_text_test)


In [0]:
#This code has been written to download the nltk dependencies which will be used further. If some is using collab then this code needs to be run.
#If some on is running the code in local machine then no need to run this if nltk is already installed.
import nltk
nltk.download('punkt')
nltk.download('all')

In [0]:
#In this code I have tokenized the words of input text set and target text set in target and source vocabulary. I have used the nltk word tokenizer for this.
#The source vocabulary and target vocabulary consists all the unique words that are there in the sample list.
#Also I ahve made a list where I have put the source and target text data that will be used further.
import numpy as np
Source_vocabulary=set()
source_length_list=[]
for line in input_texts:
    source_length_list.append(len(nltk.word_tokenize(line)))
    for word in nltk.word_tokenize(line):
        if word not in Source_vocabulary:
          Source_vocabulary.add(word)
    
target_vocabulary=set()
target_length_list=[]
for line in target_texts:
    target_length_list.append(len(nltk.word_tokenize(line)))
    for word in nltk.word_tokenize(line):
        if word not in target_vocabulary:
            target_vocabulary.add(word)


In [0]:
#Here I am taking the length of the source and target vocabulary to get the number of tokens present which will then be used in encoder and decoder
#part.
num_source_tokens = len((Source_vocabulary))
num_target_tokens = len((target_vocabulary))
#This code has been written to get the maximum sequence of the source and data set.
max_source_seq_length = np.max(source_length_list)
max_target_seq_length = np.max(target_length_list)
#I am sorting the source and target vocabulary to get the final source and target vocabulary.
Source_vocabulary = sorted(list(Source_vocabulary))
target_vocabulary = sorted(list(target_vocabulary))

In [10]:
#Printing all the values 
print('Number of samples:', len(input_texts))
print('Number of unique source language tokens:', num_source_tokens)
print('Number of unique target language tokens:', num_target_tokens)
print('Max sequence length of source language:', max_source_seq_length)
print('Max sequence length of target language:', max_target_seq_length)
print("Source Vocabulary",Source_vocabulary)
print("Target Vocabulary",target_vocabulary)

Number of samples: 9999
Number of unique source language tokens: 2106
Number of unique target language tokens: 4585
Max sequence length of source language: 6
Max sequence length of target language: 13
Source Vocabulary ['!', '$', '%', '&', "'d", "'ll", "'m", "'re", "'s", "'ve", ',', '.', '100', '17', '19', '30', '3:30', '5', '50', '65', '8:30', '99', '?', 'a', 'abandon', 'aboard', 'about', 'above', 'absent', 'absurd', 'accept', 'ache', 'ached', 'aches', 'act', 'action', 'active', 'actor', 'acts', 'adaptable', 'addicted', 'adjust', 'admire', 'adopted', 'adorable', 'adore', 'adores', 'adult', 'adults', 'advice', 'afford', 'afraid', 'after', 'afternoon', 'again', 'against', 'age', 'agent', 'agree', 'agreed', 'agrees', 'ahead', 'ai', 'aim', 'air', 'airs', 'alert', 'alive', 'all', 'allow', 'almost', 'alone', 'along', 'already', 'alright', 'also', 'always', 'am', 'amazed', 'amazing', 'ambition', 'ambitious', 'ambush', 'american', 'amnesia', 'amuse', 'amused', 'amusing', 'an', 'and', 'angel',

** Task 3 (10 marks)**

---

*  Assign each unique word to an integer value (5 marks).
*  Create word embedding for your vocabulary using pretrained Glove embeddigns (5 marks) (http://nlp.stanford.edu/data/glove.6B.zip) [see Lab 7]
* Print the first line of the embeddings (see below) 

In [0]:
#Here I have assigned each unique character to an integer and mapped in such way that it can be treated as index.
#This has been done for both source and target vocabulary.
input_token_index = dict([(word,i) for i, word in enumerate(Source_vocabulary)])
target_token_index= dict([(word,i) for i, word in enumerate(target_vocabulary)])

In [0]:
#In this part I have built three matrix which are initiated with zeros.
#The input matrix has been made according to the lenth of the train data and the maximum sequence that is present in the train source and target data,
#The decoder target data is a 3 dimensional array and the other two matrices are 2 dimensional array
import numpy as np
encoder_input_data=np.zeros((len(input_texts_train),6),dtype='float32')
decoder_input_data=np.zeros((len(target_texts_train), 13),dtype='float32')
decoder_target_data=np.zeros((len(target_texts_train), 13, num_target_tokens),dtype='float32')

In [13]:
#This code has been written to download the glove vector from the standford side.
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2019-04-19 13:52:54--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-04-19 13:52:54--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-04-19 13:54:27 (8.83 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [14]:
#This code is to unzip the glove vector data
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [15]:
#Here I have made a dictionary to get the embeddings from the glove vector reading the 50 dimensional data
embeddings_index = dict()
f = open('./glove.6B.50d.txt', 'r', encoding='utf8', errors='ignore')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [0]:
#With the help of the input token index dictionary, I have extracted all the embeddings of the words from glove vector and put it into embedding matrix
embedding_matrix = np.zeros((num_source_tokens, 50))
for word,i in input_token_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [26]:
#print(#print first line of embeddings here)
print(embedding_matrix[0])
#print(target_embedding[15])

[-0.58402002  0.39030999  0.65281999 -0.34029999  0.19493    -0.83489001
  0.11929    -0.57291001 -0.56844002  0.72988999 -0.56975001  0.53435999
 -0.38034001  0.22471     0.98031002 -0.29660001  0.126       0.55221999
 -0.62737    -0.082242   -0.085359    0.31514999  0.96077001  0.31986001
  0.87878001 -1.51890004 -1.78310001  0.35639     0.96740001 -1.54970002
  2.33500004  0.84939998 -1.23710001  1.06229997 -1.4267     -0.49056
  0.85465002 -1.28779995  0.60203999 -0.35962999  0.28586    -0.052162
 -0.50818002 -0.63459003  0.33888999  0.28415999 -0.2034     -1.23380005
  0.46715     0.78858   ]


In [0]:
#In the below code I have taken the data of train input and train target data. Then put into two different set named as input text and target text.
#As I am doing based on the word prediction, I have taken the sequence of the words and put it into the input matrix with the defining time steps as well.
#In the target decoder input part the same has been done for the target data.
#Now as the last layer which will come out from the target output will be based on softmax to give the probability distibuition of the targte words,
#I have defined one hot vector in the decoder targte data matrix for the index position time steps and the words index.
#This matices will be passed into the model to train the model
for i, (input_text,target_text) in enumerate(zip(input_texts_train,target_texts_train)):
  for t, word in enumerate(nltk.word_tokenize(input_text)):
        encoder_input_data[i, t] = input_token_index[word]
  for t, word in enumerate(nltk.word_tokenize(target_text)):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t] = target_token_index[word]
        #print(decoder_input_data[i, t,target_token_index[word.lower()]])
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[word]] = 1.

## Section 2 Translation Model training



---



**Task 3 (20 marks)**
* Provide code for the encoder using Keras LSTM (5 marks)
* Provide code for the decoder using Keras LSTM (5 marks)
* Train the sequence2sequence (encoder-decoder) model (10 marks)


In [0]:
#Setting the hyper parameters for the model
batch_size=64
epochs= 10
latent_dim =50 # latent dimensionality of the encoding space

In [0]:
#In this code I have defined the encoder model. Here I have used the glove embedding for my words as weight matrix and then put it into the keras 
#embedding layer. By default keras goes to train the embedding model as well. To stop that I have freez the embedding layer making that non-trainable.
#The input length has been defined as the maximum source sequence length.
#A LSTM layer has been added in this model.
#Only the states has been kept and output has been ommitted as the states will be used in decoder.
import numpy as np
import keras, tensorflow
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.layers import Embedding
# encoder code goes here
encoder_inputs = Input(shape=(None,))
en_x=  Embedding(num_source_tokens,
                            50,
                            weights=[embedding_matrix],
                            input_length=max_source_seq_length,
                            trainable=False)(encoder_inputs)
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(en_x)
encoder_states = [state_h, state_c]

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


In [0]:
# decoder code goes here
# In this code I have written the model of the decoder. The Decoder will have an input where the source data will be fitted in the training times.
#The Embedding keras inbuild layer has been used as I do not have the glove vector for the source language so I have passed through the keras in built embedding
#layer. The Layers has one LTSM layer with return state and sequence as true. Then the outputs and states has been recorded with the initial state as encoder
#hidden state as it consists of the activation of the source words. Then a dense layer has been given with the softmax which will give hte output probability
#to predict which words should be taken when we will decode the sequence.
decoder_inputs = Input(shape=(None,))

dex=  Embedding(num_target_tokens, 50) #, mask_zero=True

final_dex= dex(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(final_dex,
                                     initial_state=encoder_states)

decoder_dense = Dense(num_target_tokens, activation='softmax')

decoder_outputs = decoder_dense(decoder_outputs)

In [0]:
#Model has been made with categorical crossentropy as the loss as it is predicting like classification probability and optimizer is used rmsprop.
#The accuracy matrix has been used to check the training set and validation set accuracy.
model= Model(inputs=[encoder_inputs, decoder_inputs],outputs=decoder_outputs)
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['acc'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 6, 50)        105300      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 50)     229250      input_2[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LS

In [0]:
#Here we are fidding the input and output of the encoder and decoder to the model and spliting the data into 5% percent for the validation.
#The epoch has been made as 50 to train the model
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=50,
          validation_split=.05)  
#model.save('seq2seq_source_target.h5')

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 7599 samples, validate on 400 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7efe45409470>

# Section 3 Testing


---


** Task 4 (20 marks) **


---



*   Use the trained model to translate the text from source into target language (10 marks). Use the test/evaluation set (see Section 1) and perform an automatic evaluation with the BLEU metric (10 marks). use the NLTK library to calculate BLEU.




In [0]:
#Your code goes here
#In this code I have made the dummy model with the encoder input and encoder states. This model will be used to predict the target sentence
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, None)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 6, 50)             105300    
_________________________________________________________________
lstm_1 (LSTM)                [(None, 50), (None, 50),  20200     
Total params: 125,500
Trainable params: 20,200
Non-trainable params: 105,300
_________________________________________________________________


In [0]:
#Here a dummy same model has been made with the decoder input shape and different states.
#A LSTM layer as been added in the decoder model. The output has been made with the help of LSTM. Then the model has been build with the states and the inputs.
#Lastly I have reversed the dictionary of the input token index and the target token index as depending on the sequence the words will be predicted
#with the help of this token index matrix. So the sequence will be key and the words will be value.
decoder_state_input_h = Input(shape=(50,))
decoder_state_input_c = Input(shape=(50,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

final_dex2= dex(decoder_inputs)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

In [0]:
#This code has been built to write the decode sequence model where input will be given of the test data set sequence and based on the enoder model it will
#preduct the input sequence word. The target first character has been hardcoded with start to define the start of the sentence and then with the help of 
#decoder model the output word and states will be predicted. The condition will run it will reach  the maximum sequence or the end of sentence token.
#The word prediction will be taken based on the argmax as softmax is giving probability distirbution of the target words.
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['start_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_end' or
           len(decoded_sentence) > max_target_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [0]:
#This part is just to test whether decoder code is working or not with train data
for seq_index in range(10):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts_train[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)
    print('Original Sentence',target_texts_train[seq_index: seq_index + 1])
    

**Test Data Building and Test**

In [0]:
#Here I have build a zero matrix for the test dataset exactly same like training dataset
encoder_input_data_test=np.zeros((len(input_texts_test),6),dtype='float32')

In [0]:
#The matrix is populated with the seqeunce of the words from input token index which will be used latter to predict the next target sentence
for i, (input_text,target_text) in enumerate(zip(input_texts_test,target_texts_test)):
  for t, word in enumerate(nltk.word_tokenize(input_text)):
        encoder_input_data_test[i, t] = input_token_index[word]

In [0]:
#Here I am calling the decode function for the entire test set and giving the input the sequence of the words for test set.
#The model is giving the decoded output sequence from the same.
#I have used the NLTK library for the copus blue score calculation as a whole to be calucalted. This score is cumulative score.
#Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.
#The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
actual, predicted, ss = list(), list(),list()
for seq_index in range(len(encoder_input_data_test)):
    input_seq = encoder_input_data_test[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts_test[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)
    decoded_sentence = 'start_ '+decoded_sentence+' _end'
    print('Original Sentence',target_texts_test[seq_index: seq_index + 1])
    #str(l)[1:-1]
    actual.append([nltk.word_tokenize(str(target_texts_test[seq_index: seq_index + 1])[1:-1])])
    predicted.append(nltk.word_tokenize(str(decoded_sentence)))
    # calculate BLEU score
print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

-
Input sentence: ['we did it.']
Decoded sentence:  nous avons besoin
Original Sentence ['start_ nous avons réussi\u202f! _end']
-
Input sentence: ['did someone die?']
Decoded sentence:  qui a fait mal
Original Sentence ["start_ quelqu'un est-il mort ? _end"]
-
Input sentence: ['you scare me.']
Decoded sentence:  vous me prie .
Original Sentence ['start_ tu me fais peur. _end']
-
Input sentence: ["i'm exhausted."]
Decoded sentence:  je suis crevé
Original Sentence ['start_ je suis vannée. _end']
-
Input sentence: ["i didn't do it."]
Decoded sentence:  je ne suis pas
Original Sentence ["start_ je ne l'ai pas fait. _end"]
-
Input sentence: ['i was sober.']
Decoded sentence:  j'étais en train
Original Sentence ["start_ j'étais sobre. _end"]
-
Input sentence: ['are you bald?']
Decoded sentence:  êtes-vous énervé
Original Sentence ['start_ êtes-vous chauve ? _end']
-
Input sentence: ['we kept quiet.']
Decoded sentence:  nous avons perdu
Original Sentence ['start_ nous restâmes silencieux. _

# Section 4 Attention

---



** Task 5 (40 Marks) **sequence2sequence

* Extend the existing Seq2Seq model with an attention mechanism [Discussed in Class]
* Create sequence2sequence model with attention (15 marks)
* Train the model with the same data from Section 1 (10 marks)
* Translate the evaluation using the sequence2sequence attention model (10 marks)
* Evaluate the translations made with the sequence2sequence attention model and compare it with the model without attention using BLEU (5 marks)

**Attention Model**
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. It is very effective for the long sentence while using in encoder decoder. It works like the real time translator where it is not the case to get all the hidden state  of encoding the source sentence and then decode depending on that state.

**Advantage**
1. For the long sentences it is very effective it will concentrate on the selective part of the sentence to focus on.
2. Depending on the weight that is given it is very easy to understand which parts of the sentence will be given more stress.
3. There can be local attention or global attention depending on which translation can be done from source to target languages.

**Mechanism**
There are many mechanism which can be found to implement on the encoder and decoder model. Among them I have found one paper where the attention mechanism has been described with Global dot product based attention mechanism based on the previous hidden state of encoder and decoder.  I have referenced the paper in the reference section.
The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector.
Here are the steps which are followed to get the attention of the words where to focus on while translating:-

1.   First I have taken the output of the decoder and the output of the decoder state. These two states will be used to make where to give the attention with weight. Now there are multiple ways to do the weights. According to the paper as we are taking the softmax of the dot product of the outputs. Bascially these will give a probability distirbution weights to the each of the encoder output and the decoder output. These can be treated as attention score which will then be feed into context vector.
2.   After getting the attention score we are feeding this to a context vector which is getting the attention we got in the previous step and the encoder output.
3. Now the context vector is being concatenate with the decoder output which has been then passed into timedistributed dense layer having activation function "tanh".
4. Then this data has been fitted into timedistributed softmax function which will calculate the probability distirbution of the output words sequence based on which we will predict the outut words of the french sentence and then will return the output sentence.

**Decoder part for Test Data**
Once the model has been trained, the input sequence of test data has been passed sentence wise to get the decoder sequence of the target data. the idea behind the decoding is:- 
1. The sentence wise test data sequence has been passed into decode_sequence_attention function.
2. A encoder input matrix has been made which is same as the input sequence only shape has been defined.
3. A decode sequence matrix carries the first word as start to denote that that is starting of the sentence and then predict the sequence of the output and taking the maximum of it.
4. Then putting all together in the same array of decode matrix we use the reverse matrix to get the word out of the sequence.

In this way my decode sequence is working.

**Comparison with encoder decoder according to bleu score**
Now once the model has been build I have translated my whole test data set to get the translated sentences and then defined the blue score for that. On comparison I can see the comparison of the blue scores of the encoder- decoder model and with attention

Without Attention:-
BLEU-1: 0.317933
BLEU-2: 0.183142
BLEU-3: 0.120772
BLEU-4: 0.056879

With Attention:-
BLEU-1: 0.389593
BLEU-2: 0.274304
BLEU-3: 0.168985
BLEU-4: 0.076661

We can see that the 1-gram bleu score is high in the Attention encode-decoder model than the without attention encoder-decoder model. That means the words are better when predicting the translation with attention. Also the bigram matches are in high number in attention encoder-decoder model.

On comparison we can see that the bleu score for every n-gram is better in attention model than the encoder-decoder model without attention.



In [0]:
#This code has been written to make the encoder model. Here I have used the keras embedding layer and fixed them with the glove embedding matrix.
#There are total 50 cells taken for the lstm and I have sliced the encoder to take the last state which will be the input in the decoder.
import numpy as np
import keras, tensorflow
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.layers import Embedding
# encoder code goes here
encoder_inputs = Input(shape=(None,))
en_x=  Embedding(num_source_tokens,
                            50,
                            weights=[embedding_matrix],
                            input_length=max_source_seq_length,
                            trainable=False)(encoder_inputs)
encoder = LSTM(50, return_sequences=True)(en_x)
encoder_last = encoder[:,-1,:]

In [0]:
#This code has been written to make the decoder model. I have passed the target language data through the keras inbuilt embedding layer as I do not
#have the embedding for the French sentence. I have putter mask_zero=true as a part of 0 padding. This very helpful when the length of the 
#sequences are not the same. The number of cells fot the lstm is 50 which has been kept. The initial state of the decoder is the last hidden state and
#the cell state of the encider which has been passed.
decoder_input = Input(shape=(None,))

dex=  Embedding(num_target_tokens, 50, mask_zero=True)

final_dex= dex(decoder_input)
decoder = LSTM(latent_dim, return_sequences=True)(final_dex,initial_state=[encoder_last, encoder_last])


In [32]:
#In this part I have introduced the concept of attention and context vector. The attention has been made with the dot product of the encoder and decoder
#output and then passed it through softmax. The softmax function has given the probability distribution which has been taken as attention weights.
#The attention then multiplied with the encoder to get the context vector. Conext vector is then concatenated with the decoder vector which will be put into
#decoder output. Then with the help of timedistribution layer of keras and activation function tanh I have trained the model and taking the highest probability 
#word with the help of softmax 
from keras.layers import Activation, dot, concatenate
from keras.layers import Input, Embedding, LSTM, TimeDistributed, Dense
from keras.models import Model, load_model


attention = dot([decoder, encoder], axes=[2, 2])
print(attention)
attention = Activation('softmax', name='attention')(attention)
print('attention', attention)

context = dot([attention, encoder], axes=[2,1])
print('context', context)

decoder_combined_context = concatenate([context, decoder])
print('decoder_combined_context', decoder_combined_context)

# Has another weight + tanh layer as described in equation (5) of the paper
output = TimeDistributed(Dense(50, activation="tanh"))(decoder_combined_context)
output = TimeDistributed(Dense(num_target_tokens, activation="softmax"))(output)
print('output', output)

Tensor("dot_3/MatMul:0", shape=(?, ?, ?), dtype=float32)
attention Tensor("attention_1/truediv:0", shape=(?, ?, ?), dtype=float32)
context Tensor("dot_4/MatMul:0", shape=(?, ?, 50), dtype=float32)
decoder_combined_context Tensor("concatenate_2/concat:0", shape=(?, ?, 100), dtype=float32)
output Tensor("time_distributed_4/Reshape_1:0", shape=(?, ?, 4585), dtype=float32)


In [33]:
#Here I am making the model with the encoder input and decoder input. The loss is categorical crossentropy and the optimizer is rmsprop.
#The accuract matrix has been taken to see the accuracy in training and validation
model= Model(inputs=[encoder_inputs, decoder_input],outputs=output)
model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['acc'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, None, 50)     229250      input_4[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 6, 50)        105300      input_3[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LS

In [34]:
#Here I am fitting the data into model to train the model with the words
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=64,
          epochs=50,
          validation_split=0.05) 

Train on 7599 samples, validate on 400 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fb2ead92f98>

In [0]:
#To generate the words from the sequence I have written this reverse code
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

In [0]:
#This code I have written to get the decoded output sentence. The input is coming from the test data part and is fitting into enoced sequence matrix.
#The first word is hardcoded as "start_" to be predicted to denote the start of the sentence. Then with the help of trained model I am predicting the 
#sequence. The the decoder sequence input has been passed to get the words output till the word "_end" will come denoting the end of the sentence.
def decode_sequence_attention(input_seq):
    decoded_sentence = ''
    encoder_input = input_seq.reshape(1,max_source_seq_length)
    #print(encoder_input)
    decoder_input = np.zeros(shape=(len(encoder_input), max_target_seq_length))
    decoder_input[:,0] = target_token_index['start_']
    for i in range(1, max_target_seq_length):
        output = model.predict([encoder_input, decoder_input]).argmax(axis=2)
        decoder_input[:,i] = output[:,i]
    for k in decoder_input[0]:
        if (reverse_target_char_index[k] == '_end'):
            decoded_sentence += ' '+reverse_target_char_index[k]
            break
        decoded_sentence += ' '+reverse_target_char_index[k]
    return decoded_sentence

In [39]:
#Here I am calling the decode function for the entire test set and giving the input the sequence of the words for test set.
#The model is giving the decoded output sequence from the same.
#I have used the NLTK library for the copus blue score calculation as a whole to be calucalted. This score is cumulative score.
#Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.
#The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
actual, predicted, ss = list(), list(),list()
for seq_index in range(len(encoder_input_data_test)):
    input_seq = encoder_input_data_test[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence_attention(input_seq)
    print('-')
    print('Input sentence:', input_texts_test[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)
    print('Original Sentence',target_texts_test[seq_index: seq_index + 1])
    actual.append([nltk.word_tokenize(str(target_texts_test[seq_index: seq_index + 1])[1:-1])])
    predicted.append(nltk.word_tokenize(str(decoded_sentence)))
    # calculate BLEU score
print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

-
Input sentence: ["let's pray."]
Decoded sentence:  start_ je le voir . _end
Original Sentence ['start_ prions ! _end']
-
Input sentence: ['when was that?']
Decoded sentence:  start_ est-il fait ? _end
Original Sentence ["start_ c'était quand\xa0? _end"]
-
Input sentence: ['i have no idea.']
Decoded sentence:  start_ je n'ai pas une . _end
Original Sentence ["start_ je n'en ai aucune idée. _end"]
-
Input sentence: ['be a man.']
Decoded sentence:  start_ soyez un voiture ! ! ! ! ! ! ! ! !
Original Sentence ['start_ sois un homme ! _end']
-
Input sentence: ["i'm so hungry."]
Decoded sentence:  start_ je suis très heureux . _end
Original Sentence ["start_ j'ai tellement faim. _end"]
-
Input sentence: ["i'm a child."]
Decoded sentence:  start_ je suis un chien . _end
Original Sentence ['start_ je suis un enfant. _end']
-
Input sentence: ['tom is weak.']
Decoded sentence:  start_ tom est en train de travailler . _end
Original Sentence ['start_ tom est faible. _end']
-
Input sentence: ["i'l

In [0]:
#This code I have written to show how the individual predicted sentences blue scores are. I have taken the sentence blue funtion of nltk and shown sentence
#wise. The bleu score is cumulative score.
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
actual, predicted, ss = list(), list(),list()
for seq_index in range(10):
    input_seq = encoder_input_data_test[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts_test[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)
    print('Original Sentence',target_texts_test[seq_index: seq_index + 1])
    actual.append([nltk.word_tokenize(str(target_texts_test[seq_index: seq_index + 1])[1:-1])])
    predicted.append(nltk.word_tokenize(str(decoded_sentence)))
    a = [nltk.word_tokenize(str(target_texts_test[seq_index: seq_index + 1])[1:-1])]
    b = nltk.word_tokenize(str(decoded_sentence))
    print(a)
    print(b)
    # calculate BLEU score
    print('BLEU-1: %f' % sentence_bleu(a, b, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % sentence_bleu(a, b, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % sentence_bleu(a, b, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % sentence_bleu(a, b, weights=(0.25, 0.25, 0.25, 0.25)))

-
Input sentence: ["i'll pray hard."]
Decoded sentence:  start_ je vous ai laissé tomber . _end
Original Sentence ['start_ je prierai de toutes mes forces. _end']
[["'start_", 'je', 'prierai', 'de', 'toutes', 'mes', 'forces', '.', '_end', "'"]]
['start_', 'je', 'vous', 'ai', 'laissé', 'tomber', '.', '_end']
BLEU-1: 0.292050
BLEU-2: 0.180257
BLEU-3: 0.323673
BLEU-4: 0.374679
-
Input sentence: ['come over!']
Decoded sentence:  start_ venez ! ! ! ! ! ! ! ! ! ! !
Original Sentence ['start_ venez\u202f! _end']
[["'start_", 'venez\\u202f', '!', '_end', "'"]]
['start_', 'venez', '!', '!', '!', '!', '!', '!', '!', '!', '!', '!', '!']
BLEU-1: 0.076923
BLEU-2: 0.277350
BLEU-3: 0.463252
BLEU-4: 0.526640


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


-
Input sentence: ['we were lucky.']
Decoded sentence:  start_ nous nous sommes nous sommes . _end
Original Sentence ['start_ nous étions chanceux. _end']
[["'start_", 'nous', 'étions', 'chanceux', '.', '_end', "'"]]
['start_', 'nous', 'nous', 'sommes', 'nous', 'sommes', '.', '_end']
BLEU-1: 0.375000
BLEU-2: 0.231455
BLEU-3: 0.415604
BLEU-4: 0.481098
-
Input sentence: ["i'm not strong."]
Decoded sentence:  start_ je ne suis pas un ? _end
Original Sentence ['start_ je ne suis pas forte. _end']
[["'start_", 'je', 'ne', 'suis', 'pas', 'forte', '.', '_end', "'"]]
['start_', 'je', 'ne', 'suis', 'pas', 'un', '?', '_end']
BLEU-1: 0.551561
BLEU-2: 0.456736
BLEU-3: 0.427511
BLEU-4: 0.322601
-
Input sentence: ["i'm well."]
Decoded sentence:  start_ je suis en train de travailler . _end
Original Sentence ['start_ je vais bien. _end']
[["'start_", 'je', 'vais', 'bien', '.', '_end', "'"]]
['start_', 'je', 'suis', 'en', 'train', 'de', 'travailler', '.', '_end']
BLEU-1: 0.333333
BLEU-2: 0.204124
BLEU

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


-
Input sentence: ["tom'll quit."]
Decoded sentence:  start_ tom va les chiens . _end
Original Sentence ['start_ tom va arrêter. _end']
[["'start_", 'tom', 'va', 'arrêter', '.', '_end', "'"]]
['start_', 'tom', 'va', 'les', 'chiens', '.', '_end']
BLEU-1: 0.571429
BLEU-2: 0.436436
BLEU-3: 0.608068
BLEU-4: 0.660633
-
Input sentence: ["i can't go out."]
Decoded sentence:  start_ je ne peux pas partir . _end
Original Sentence ['start_ je ne peux pas sortir. _end']
[["'start_", 'je', 'ne', 'peux', 'pas', 'sortir', '.', '_end', "'"]]
['start_', 'je', 'ne', 'peux', 'pas', 'partir', '.', '_end']
BLEU-1: 0.661873
BLEU-2: 0.577730
BLEU-3: 0.492248
BLEU-4: 0.362824
-
Input sentence: ['drive carefully.']
Decoded sentence:  start_ continuez à nouveau . _end
Original Sentence ['start_ conduis avec prudence. _end']
[["'start_", 'conduis', 'avec', 'prudence', '.', '_end', "'"]]
['start_', 'continuez', 'à', 'nouveau', '.', '_end']
BLEU-1: 0.282161
BLEU-2: 0.218561
BLEU-3: 0.375656
BLEU-4: 0.430125
-
Inp

**References**
1.https://arxiv.org/pdf/1508.04025.pdf

2.https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/

3.https://medium.com/@dev.elect.iitd/neural-machine-translation-using-word-level-seq2seq-model-47538cba8cd7

4.https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

5.Lab 5 and Lab 7 from the lectures

6.https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

7.https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/

8.https://wanasit.github.io/attention-based-sequence-to-sequence-in-keras.html