<h1>ENCoder Decoder based machine translation tool</h1>
In this project a structure based on encoder-decoder is developed for machine tranlation. The source language will be english and the target language will be once spanish and then persian. The main source that the following structure has been inspired from is shown as bellow:

https://towardsdatascience.com/word-level-english-to-marathi-neural-machine-translation-using-seq2seq-encoder-decoder-lstm-model-1a913f2dc4a7

<h3> Downloading the dataset from google drive and making it ready</h3>

In [0]:
from google.colab import drive
import numpy as np
drive.mount("/content/drive/")

!cp -rd /content/drive/My\ Drive/MSc_Projects/ANN-HW7/Dataset /content
!unzip /content/Dataset/SpEn.zip -d /content/dataset

file = open("/content/dataset/spa.txt", 'r+')

total_text = file.read()
texts_list = total_text.split("\n")[:-1]

print("This is just to check if the dataset has been downloaded properly:\n%s"%texts_list[np.random.randint(0, len(texts_list))])


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
Archive:  /content/Dataset/SpEn.zip
replace /content/dataset/_about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: This is just to check if the dataset has been downloaded properly:
I do not play tennis as much as I used to.	No juego tanto al tenis como antes.


<h1>Model implementation</h1>
The following cell contains the definition of the model. In this model two main steps has been implemented:
<ol>
  <li>training phase: regarding that the targets are available. Teacher forcing is used for training, thus the model is simple.</li>
  <li>Inference phase: in this phase the only available data is the test data. in this phase the current outputs of the decoder(LSTM) including cell, hidden state and output are used as the next inuts of the decoder. This process is done inside a while loop, untill the generated sequence reaches a maximum limit or the "\n", interpreted as the end of sentence, is generated.


In [0]:
from __future__ import print_function
from keras.callbacks import TensorBoard
from keras.models import Model, load_model
from keras.layers import Input, LSTM, Dense
from keras.optimizers import RMSprop
import numpy as np

batch_size = 64
epochs = 40
latent_dim = 256
num_samples = 30000
data_path = '/content/dataset/spa.txt'
learning_rate = 0.01

LOAD = False

In [11]:
from sklearn.model_selection import train_test_split

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])


labels = {}
for i in range(len(input_texts)):
  result = []
  for j in range(-10, 10):
    if i + j >= 0 and i + j < len(input_texts) and input_texts[i] == input_texts[i + j]:
      result.append(target_texts[i + j])
  labels[input_texts[i][1:-1]] = result
  
train_x, test_x, train_y, test_y = train_test_split(input_texts, target_texts, shuffle=False, random_state=12, train_size=0.8)

Number of samples: 30000
Number of unique input tokens: 76
Number of unique output tokens: 93
Max sequence length for inputs: 22
Max sequence length for outputs: 70


In [0]:
def embed_data(input_texts, target_texts):
  
  encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
  decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
  decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

  for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
      for t, char in enumerate(input_text):
          encoder_input_data[i, t, input_token_index[char]] = 1.
      for t, char in enumerate(target_text):
          decoder_input_data[i, t, target_token_index[char]] = 1.
          if t > 0:
              decoder_target_data[i, t - 1, target_token_index[char]] = 1.
              
  return encoder_input_data, decoder_input_data, decoder_target_data

In [13]:
tbc = TensorBoard(log_dir='/content/logs/layer-2/', histogram_freq=0, 
                                  write_graph=True, write_images=True)


encoder_input_data, decoder_input_data, decoder_target_data = embed_data(train_x, train_y)
                            
encoder_inputs = Input(shape=(None, num_encoder_tokens))

#layer 1
encoder = LSTM(latent_dim, return_state=True, return_sequences=True)
encoder_outputs, state_h1, state_c1 = encoder(encoder_inputs)

#layer 2
encoder1 = LSTM(latent_dim, return_state=True)
_, state_h, state_c = encoder1(encoder_outputs)

encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

if not LOAD:

  model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

  print(model.summary())
  
  opt = RMSprop(lr=learning_rate)
  # Run training
  model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
  model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
            batch_size=batch_size,
            epochs=epochs,
            validation_split=0.2,
            callbacks = [tbc])
  # Save model
  
  
if LOAD:
  !cp /content/drive/My\ Drive/MSc_Projects/ANN-HW7/layers/2/layer-2.h5 /content/
  model = load_model("layer-2.h5")
  print(model.summary())


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, None, 76)     0                                            
__________________________________________________________________________________________________
lstm_8 (LSTM)                   [(None, None, 256),  340992      input_8[0][0]                    
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, None, 93)     0                                            
__________________________________________________________________________________________________
lstm_9 (LSTM)                   [(None, 256), (None, 525312      lstm_8[0][0]                     
__________________________________________________________________________________________________
lstm_10 (L

KeyboardInterrupt: ignored

In [16]:
if not LOAD:
  model.save('layer-2.h5')

encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence

decoded_sentences = []
reference_sents = []

for seq_index in range(5):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentences.append(decode_sequence(input_seq)[:-1])
    reference_sents.append(target_texts[seq_index][1:-1])
    print('-')
    print('Input sentence:', test_x[seq_index])
    print('Decoded sentence:', decoded_sentences[seq_index])
    print('reference sentence: ', reference_sents[seq_index])

  '. They will not be included '


-
Input sentence: I didn't ask for you.
Decoded sentence: Vaya.
reference sentence:  Ve.
-
Input sentence: I didn't believe you.
Decoded sentence: Vaya.
reference sentence:  Vete.
-
Input sentence: I didn't do anything.
Decoded sentence: Vaya.
reference sentence:  Vaya.
-
Input sentence: I didn't do anything.
Decoded sentence: Vaya.
reference sentence:  Váyase.
-
Input sentence: I didn't do it alone.
Decoded sentence: Hola.
reference sentence:  Hola.


In [17]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!cp /content/layer-2.h5  /content/drive/My\ Drive/MSc_Projects/ANN-HW7/layers/2/

Mounted at /content/drive


<h1> Calculating the BLEU metric</h1>


In [0]:
from nltk.tokenize import word_tokenize

def tokenize(sentences):
  temp = []
  for sent in sentences:
    temp.append(word_tokenize(sentence))
  return temp

In [19]:
from nltk.translate.bleu_score import sentence_bleu
import nltk
nltk.download("punkt")

encoder_input_data, _, _ = embed_data(test_x, test_y)
sentences_bleu = []

for i, sentence in enumerate(test_x[:1000]):
  if int(i % 200) == 0:
    print("%d%%"%(i *100 / len(test_x[:1000])))
  input_seq = encoder_input_data[i: i + 1]
  decoded_sentence = decode_sequence(input_seq)[:-1]
  temp = tokenize(labels[sentence[1:-1]])
  sentences_bleu.append(sentence_bleu(temp, word_tokenize(decoded_sentence)))

print("The BLEU value calculated: %.2f" % np.average(sentences_bleu))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
0%


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


20%
40%
60%
80%
The BLEU value calculated: 0.52


<h1>Ploting the losses using Tensorboard</h1>

In [0]:
# %load_ext tensorboard
!kill 2525
%tensorboard --logdir /content/logs