<a href="https://colab.research.google.com/github/kiranraou/Python-Projects/blob/main/Sequence_Learning_English_Hindi_Translation_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sequence Learning - Assignment 3**

**Problem Statement 2**

**Machine Translation is one of the widely used use-cases for the 
sequence to sequence learning models.
Using existing English-Hindi Translation data, build an encoderdecoder model to predict Hindi sentences from a given English 
sentence.**

**Dataset Description.**

Dataset contains two major columns - english_sentence & hindi_sentence,which are given translations. There 
are many sources available for this data - but you have to use "ted" for this assignment.

Dataset downloaded in google drive from below details

**!wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/81.0.4044.138 Safari/537.36" --header="Accept: 
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" -
-header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggledata-sets/200079/441417/bundle/archive.zip?GoogleAccessId=web-data@kaggle161607.iam.gserviceaccount.com&Expires=1589739508&Signature=oCMrIr4qYblbSyR28S%2F5pDf
V1hBlxa901ghZ8bKmasWC9msw%2BHmyAooohSK4f0y2hVLU9pmhnEq7%2FZ3ncWZJTjWd4NK0Hhygpk43fkAv
pvNhTQcAiExtojT%2FRfXrRR6ZR%2FzEyqH1nh1ywqcLnTqJRwjqzV0PbCgnmNIczOO533FVgkZJwZk59kNwJ
uIOU98NIA1zhSxd0q%2ByTGDATwNFNmYalISRyCCFlUtsyjP%2Fk8zUHeQ2gU5lFGLyQeE7584D1uzD6klWV6
Ng%2BhJzIWaq3laUZNOuoD7Sm4dxy2t4Ip%2Fc%2BPLIC5ZdZHU%2F9I8LGsbfTJUcdujy72%2BN1hBo2Ts3r
w%3D%3D&response-content-disposition=attachment%3B+filename%3Dhindienglishcorpora.zip" -c -O 'hindienglish-corpora.zip*

**Mounting Google Drive.**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Import libraries**

In [7]:
import os
import string
import numpy as np
import pandas as pd
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import re
import logging
import tensorflow as tf
tf.executing_eagerly()
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
logging.getLogger('tensorflow').setLevel(logging.FATAL)
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata
import io
import time
import warnings
import sys

        
PATH = "/content/drive/MyDrive/NIT Warangal _Industry Project/Hindi_English_Truncated_Corpus.csv"

**Preprocess English and Hindi sentences.**

In [8]:
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    w = w.rstrip().strip()
    return w

def hindi_preprocess_sentence(w):
    w = w.rstrip().strip()
    return w

In [9]:
def create_dataset(path=PATH):
    lines=pd.read_csv(path,encoding='utf-8')
    lines=lines.dropna()
    lines = lines[lines['source']=='ted']
    en = []
    hd = []
    for i, j in zip(lines['english_sentence'], lines['hindi_sentence']):
        en_1 = [preprocess_sentence(w) for w in i.split(' ')]
        en_1.append('<end>')
        en_1.insert(0, '<start>')
        hd_1 = [hindi_preprocess_sentence(w) for w in j.split(' ')]
        hd_1.append('<end>')
        hd_1.insert(0, '<start>')
        en.append(en_1)
        hd.append(hd_1)
    return hd, en

In [10]:
def max_length(tensor):
    return max(len(t) for t in tensor)

**Tokenization of the data.**

In [11]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(lang)
  tensor = lang_tokenizer.texts_to_sequences(lang)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post')
  return tensor, lang_tokenizer

In [12]:
def load_dataset(path=PATH):
    targ_lang, inp_lang = create_dataset(path)
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [13]:
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(PATH)
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)

**Create Train and Test dataset.**

In [14]:
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

31904 31904 7977 7977


In [15]:
def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))
    
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
5 ----> to
269 ----> ask
71 ----> them
5 ----> to
215 ----> always
19 ----> have
3 ----> the
139 ----> right
569 ----> answer
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
77 ----> उन्हें
246 ----> हमेशा
165 ----> केवल
189 ----> सही
628 ----> उत्तर
255 ----> देने
4 ----> के
118 ----> लिये
2245 ----> प्रोत्साहित
47 ----> करने
2 ----> <end>


In [16]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 128
units = 256
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

Encoder Decoder with Attention Model

**Encoder**

In [17]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

**Attention Mechanism**

In [18]:
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    hidden_with_time_axis = tf.expand_dims(query, 1)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))
    attention_weights = tf.nn.softmax(score, axis=1)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    return context_vector, attention_weights

**Decoder**

In [19]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    context_vector, attention_weights = self.attention(hidden, enc_output)
    x = self.embedding(x)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
    output, state = self.gru(x)
    output = tf.reshape(output, (-1, output.shape[2]))
    x = self.fc(output)
    return x, state, attention_weights

decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

**Optimizer**

In [20]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)
  mask = tf.cast(mask, dtype=loss_.dtype)
#   print(type(mask))
  loss_ *= mask
  return tf.reduce_mean(loss_)

In [21]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)


**Training**

In [22]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0
  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
    # Teacher forcing
    for t in range(1, targ.shape[1]):
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
      loss += loss_function(targ[:, t], predictions)
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))
  variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, variables)
  optimizer.apply_gradients(zip(gradients, variables))      
  return batch_loss

In [23]:
EPOCHS = 20

for epoch in range(EPOCHS):
  start = time.time()
  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0
  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss
    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                     batch,
                                                     batch_loss.numpy()))
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 2.7969
Epoch 1 Batch 100 Loss 1.9372
Epoch 1 Batch 200 Loss 2.0015
Epoch 1 Batch 300 Loss 1.9136
Epoch 1 Batch 400 Loss 1.9209
Epoch 1 Loss 1.9469
Time taken for 1 epoch 204.45168614387512 sec

Epoch 2 Batch 0 Loss 1.6501
Epoch 2 Batch 100 Loss 1.8804
Epoch 2 Batch 200 Loss 1.5748
Epoch 2 Batch 300 Loss 1.5459
Epoch 2 Batch 400 Loss 1.6075
Epoch 2 Loss 1.7396
Time taken for 1 epoch 154.15194725990295 sec

Epoch 3 Batch 0 Loss 1.6908
Epoch 3 Batch 100 Loss 1.5453
Epoch 3 Batch 200 Loss 1.5844
Epoch 3 Batch 300 Loss 1.6004
Epoch 3 Batch 400 Loss 1.5142
Epoch 3 Loss 1.6322
Time taken for 1 epoch 154.0784261226654 sec

Epoch 4 Batch 0 Loss 1.4988
Epoch 4 Batch 100 Loss 1.4972
Epoch 4 Batch 200 Loss 1.5666
Epoch 4 Batch 300 Loss 1.4768
Epoch 4 Batch 400 Loss 1.4433
Epoch 4 Loss 1.5411
Time taken for 1 epoch 154.55977177619934 sec

Epoch 5 Batch 0 Loss 1.4095
Epoch 5 Batch 100 Loss 1.4374
Epoch 5 Batch 200 Loss 1.4059
Epoch 5 Batch 300 Loss 1.4108
Epoch 5 Batch 400 Loss 

In [24]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    sentence = preprocess_sentence(sentence)
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                           maxlen=max_length_inp,
                                                           padding='post')
    inputs = tf.convert_to_tensor(inputs)
    result = ''
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                             dec_hidden,
                                                             enc_out)
        predicted_id = tf.argmax(predictions[0]).numpy()
        result += targ_lang.index_word[predicted_id] + ' '
        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence
        dec_input = tf.expand_dims([predicted_id], 0)
    return result, sentence

In [25]:
def translate(sentence):
    result, sentence = evaluate(sentence)
    print('Input: %s' % (sentence))
    print('Predicted translation: {}'.format(result))

In [26]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f87e7447a90>

In [27]:
translate(u'politicians do not have permission to do what needs to be done.')

Input: politicians do not have permission to do what needs to be done .
Predicted translation: जो कि कैसे उपलब्ध नहीं है जो कि पालन करना है जो कि पालन करना है जो कि पालन करना है जो कि पालन करना है जो कि पालन करना है जो कि 
