# Creating a mind map before getting into action

So after coming across from the mistakes before this notebook, let's plan what all things we need to do so that we perform the crucial steps needed for the NMT to make with all the proper components and the preprocessed data.

**Data Preprocessing Actions**:
1. Shift the target texts.
2. Pad the text with proper annotations.

**Components of the model**:
1. Create *text vectors* of both source and target language separately using TensorFlow `TextVectorization` layer.
2. Create *embedding layers* of both source and target language separately using TensorFlow `Embedding` layer.
3. Create a class known *Encoder* which inherits the properties of TensorFlow class `Layer`. In the sub-class method of creating a layer, we are going to develop the encoder architecture of the NMT. (The architecture will be discussed while we are developing the layer.)
4. Create a class known as *Decoder* which inherits the properties of TensorFlow class `Layer`. In this sub-class method of creating a layer, we are going to develop the decoder architecture of the NMT. (The architecture will be discussed while we are developing the layer.)
5. After creating all the components, we are going to create a class called *EncoderDecoder* which will inherit from TensorFlow class `Model`. In this class we will assemble all of the layers that we have prepared in the above steps and then create a custom call function which will allow us to train the decoder layer as we expect it to do.
6. Create an instance of the *EncoderDecoder* class with all the parameters passed in its constructor and compile the model with `tf.keras.losses.SparseCategoricalCrossentropy()` as loss function and `tf.keras.optimizers.RMSprop()` as the optimizer.
7. We will then create the *train_dataset* and *valid_dataset* using TensorFlow `tf.data` API for better performance of the model training.
8. We will fit the model with the training data with 5 epochs and *callbacks* such as `tf.keras.callbacks.ModelCheckPoint`, `tf.keras.callbacks.EarlyStopping`, `tf.keras.callbacks.ReduceLROnPlateau` and the validation dataset.
9. After training the model we will evaluate the model using different metrics system and predict with an unseen data and observe the quality of the prediction.

# Importing all the necessary libraries

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import os
import tarfile
import random

# Start the project

As all machine learning project has two phases, this is project is no exception for it. We will first work on the *Data Preprocessing* and then *Model Development*

## Data Preprocessing

In [2]:
# Creating a constant which contains the starting path of the project so that
START_PATH = str(os.getcwd()) + '/'
COMP_DATA_PATH = os.path.join(START_PATH, 'wiki-titles.tgz')
DATA_URL = 'https://www.statmt.org/wmt14/wiki-titles.tgz'
DATA_DIR = os.path.join(START_PATH, 'wiki/hi-en/wiki-titles.hi-en')

### Loading Data

In [3]:
if not os.path.exists(COMP_DATA_PATH):
    print(f'Downloading data from {DATA_URL}')
    !wget https://www.statmt.org/wmt14/wiki-titles.tgz
else:
    print("Data already downloaded")

Data already downloaded


In [4]:
if not os.path.exists(COMP_DATA_PATH):
    print("File does not exist")
else:
    print("The file exists")

The file exists


In [5]:
with tarfile.open(COMP_DATA_PATH, 'r') as tar_ref:
    tar_ref.extractall()
    print("File extracted")

File extracted


### Extracting data from file 

In [6]:
# Extracting the lines from the file of the dataset
with open(DATA_DIR, 'r') as f:
    lines = f.readlines()

In [7]:
# Splitting the data into two list of source language and target language
src_senteces = [line.split('|||')[1][1:-1] for line in lines]
trg_sentences = [line.split('|||')[0][:-1] for line in lines]
src_senteces[:10], trg_sentences[:10], len(src_senteces), len(trg_sentences)

(['January 0',
  'March 0',
  '1000',
  '1001',
  '1002',
  '1003',
  '1004',
  '1005',
  '1006',
  '1007'],
 ['० जनवरी',
  '० मार्च',
  '१०००',
  '१००१',
  '१००२',
  '१००३',
  '१००४',
  '१००५',
  '१००६',
  '१००७'],
 32863,
 32863)

### Visualise the data

In [8]:
def visualise_random_sentences(src_sent, trg_sent):    
    random_idx = random.randint(0, len(src_sent))
    
    print(f"Source sentence: {src_sent[random_idx]}")
    print(f"Target sentence: {trg_sent[random_idx]}")
    
visualise_random_sentences(src_sent= src_senteces,
                    trg_sent= trg_sentences)

Source sentence: Jiangnan
Target sentence: जियांगनान


### Shift the target data with one token

In [9]:
trg_sentences_preprocessed = ['<SOS> ' + sentence + ' <EOS>' for sentence in trg_sentences]
visualise_random_sentences(src_sent= src_senteces,
                           trg_sent= trg_sentences_preprocessed)

Source sentence: Nokia 1011
Target sentence: <SOS> नोकिया १०११ <EOS>


In [10]:
# Calculate the max length of the text for each language lists
src_word_per_sentence = [len(line.split()) for line in src_senteces]
trg_word_per_sentence = [len(line.split()) for line in trg_sentences_preprocessed]

max_src_len = max(src_word_per_sentence)
max_trg_len = max(trg_word_per_sentence)

print(f"Max source sentence length: {max_src_len}")
print(f"Max target sentence length: {max_trg_len}")

Max source sentence length: 13
Max target sentence length: 17


>**Note**: Padding of the data will be done after vectorising the text data

## Model development

We have completed the data preprocessing of the data and are ready to develop the model. Let's discuss the steps we are going to take for it:
1. Create individual components
2. Assemble the model
3. Create Data Pipeline
4. Fit the model
5. Evaluate the model
6. Predict using the model

### Creating Components

Before starting with the model, first we need to create some building blocks for the Encoder Decoder archtiecture of NMT. The components that we are going to make here are:
1. Encode layer:
        * *Inheritance*: `tf.keras.layers.Layer`
        * *Constructor Input*: 
        * *Call function Input*: src_sentences
        * *Return*: Encoder output, RNN layer states
2. Decoder layer:
        * *Inheritance*: `tf.keras.layers.Layer`
        * *Constructor Input*: 
        * *Call function Input*: trg_sentences_preprocessed, context, encoder_states
        * *Return*: RNN output

In [94]:
# Creation of Encoder layer class
class Encoder(tf.keras.layers.Layer):
    '''
    This class creates a custom encoder layer based on the Google's research paper of NMT.
    For more reference please view https://arxiv.org/pdf/1609.08144.pdf%20(7
    
    This class is an inheritied class from `tf.keras.layers.Layer`. In this class, we will create the constructor
    of the encoder layer and then define the call function so that we can set the working of the encoder layer using 
    Functional API.
    '''
    
    def __init__(self, units, embedding_size, dropout_rate, num_layers, **kwargs):
        '''
        Constructs the encoder
        
        Parameters:
            units: Nuumber of neurons required per LSTM layer
            embedding_size: to give the first layer an input shape
            dropout_rate: to set the dropout rate
            num_layers: to set the number of encoding layer
        Returns:
            An instance of Encoder class which works as the encoder described in the research paper
        '''
        
        # calling the super method to initialise
        super().__init__(**kwargs)
        
        # initialising all the object variables
        self.units = units
        self.embedding_size = embedding_size
        self.dropout_rate = dropout_rate
        self.num_layers = num_layers
        
        # initialise the layers
        self.bi_lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units= self.units // 2,
                                                                                return_sequences=True,
                                                                                return_state=True,
                                                                                dropout=self.dropout_rate),
                                                           name= 'encoder_bi_lstm_layer')
        num1 = self.embedding_size
        num2 = self.units
        
        self.lstm_recurrent_layers = []
        for i in range(self.num_layers):
            temp = num1 + num2
            self.lstm_recurrent_layers.append(tf.keras.layers.LSTM(units= temp,
                                                         return_sequences=True,
                                                         return_state=True,
                                                         dropout=self.dropout_rate,
                                                         name= f'encoder_lstm_recurrent_layer_{i + 1}'))
        self.concatenate_layer = tf.keras.layers.Concatenate(name= 'concatenate_layer')
    
    def call(self, inputs):
        print(inputs.shape)
        x, h1, c1, h2, c2 = self.bi_lstm_layer(inputs)
        print(self.bi_lstm_layer.name, x.shape)
        x = self.concatenate_layer([inputs, x])
        print(self.concatenate_layer.name, x.shape)
        encoder_memory, encoder_carry_state = None, None
        for layer in self.lstm_recurrent_layers:
            lstm_output, encoder_memory, encoder_carry_state = layer(x)
            print(layer.name, lstm_output.shape)
            x = self.concatenate_layer([x, lstm_output])
            print(self.concatenate_layer.name, x.shape)
        
        return x, encoder_memory, encoder_carry_state
    
encoder_layer = Encoder(units= 512, 
                        embedding_size= 128,
                        dropout_rate= 0.5,
                        name= 'encoder_layer',
                        num_layers= 8)
encoder_layer.get_config()

{'name': 'encoder_layer',
 'units': 512,
 'embedding_size': 128,
 'dropout_rate': 0.5,
 'num_layers': 8,
 'trainable': True,
 'dtype': 'float32'}

In [100]:
# Checking the functionality of the layer using dummy values
encoder_output, memory_state, carry_state = encoder_layer(tf.random.uniform(shape=(1, 10, 128)))

(1, 10, 128)
encoder_bi_lstm_layer (1, 10, 512)
concatenate_layer (1, 10, 640)
encoder_lstm_recurrent_layer_1 (1, 10, 640)
concatenate_layer (1, 10, 1280)
encoder_lstm_recurrent_layer_2 (1, 10, 640)
concatenate_layer (1, 10, 1920)
encoder_lstm_recurrent_layer_3 (1, 10, 640)
concatenate_layer (1, 10, 2560)
encoder_lstm_recurrent_layer_4 (1, 10, 640)
concatenate_layer (1, 10, 3200)
encoder_lstm_recurrent_layer_5 (1, 10, 640)
concatenate_layer (1, 10, 3840)
encoder_lstm_recurrent_layer_6 (1, 10, 640)
concatenate_layer (1, 10, 4480)
encoder_lstm_recurrent_layer_7 (1, 10, 640)
concatenate_layer (1, 10, 5120)
encoder_lstm_recurrent_layer_8 (1, 10, 640)
concatenate_layer (1, 10, 5760)


In [98]:
# Create Decoder class
class Decoder(tf.keras.layers.Layer):
    '''
    This class creates is for an instance of Decoder layer. This inherits the properties of `tf.keras.layers.Layer`.
    We are going to create the constructor and the call function which will contain the Functional API structure
    of computing the values which is the input.
    
    This layer is refered from the above mentioned paper.
    '''
    
    def __init__(self, units, num_layers, dropout_rate, **kwargs):
        '''
        Constructor of the decoder class which helps initialize the variables and create the layer
        Parameters:
            units: number of neurons in the LSTM layer
            num_layers: number of layers expected in the decoder layer
            embedding_size: to set the input value of the layer
            dropout_rate: to set the dropout rate of the LSTM layer
            
        Returns:
            An instance of the decoder class which will act as a layer
        '''
        
        # calling the super function
        super().__init__(**kwargs)
        
        # initialising all the variables
        self.units = units
        self.num_layers = num_layers
        self.dropout_rate = dropout_rate
        self.encoder_layer = encoder_layer
        
        # initialising the layers
        self.lstm_cell = tf.keras.layers.LSTMCell(units= self.units,
                                                  name= 'decoder_lstm_cell')
        self.lstm_layer = tf.keras.layers.LSTM(units= self.units,
                                              return_sequences=True,
                                              name= 'decoder_lstm_layer')
    
    def call(self, inputs, context):
        x, h, c = self.lstm_layer(inputs, initial_state= context)
        return self.lstm_cell(x, initial_state= [h, c])
        

In [99]:
decoder_layer = Decoder(units= 512,
                        num_layers= 8,
                        dropout_rate= 0.5)
decoder_layer.get_config()

{'name': 'decoder_1',
 'units': 512,
 'num_layers': 8,
 'dropout_rate': 0.5,
 'trainable': True,
 'dtype': 'float32'}

In [102]:
decoder_layer(encoder_output, context= [memory_state, carry_state])

1. The `call()` method of your layer may be crashing. Try to `__call__()` the layer eagerly on some test input first to see if it works. E.g. `x = np.random.random((3, 4)); y = layer(x)`
2. If the `call()` method is correct, then you may need to implement the `def build(self, input_shape)` method on your layer. It should create all variables used by the layer (e.g. by calling `layer.build()` on all its children layers).
Exception encountered: ''Exception encountered when calling LSTMCell.call().

[1mDimensions must be equal, but are 640 and 512 for '{{node decoder_lstm_layer_1/lstm_cell_1/MatMul_1}} = MatMul[T=DT_FLOAT, grad_a=false, grad_b=false, transpose_a=false, transpose_b=false](decoder_lstm_layer_1/lstm_cell_1/MatMul_1/a, decoder_lstm_layer_1/lstm_cell_1/Cast_1/mul_1)' with input shapes: [1,640], [512,2048].[0m

Arguments received by LSTMCell.call():
  • inputs=tf.Tensor(shape=(1, 5760), dtype=float32)
  • states=('tf.Tensor(shape=(1, 640), dtype=float32)', 'tf.Tensor(shape=(1

InvalidArgumentError: Exception encountered when calling LSTMCell.call().

[1m{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:CPU:0}} Matrix size-incompatible: In[0]: [1,640], In[1]: [512,2048] [Op:MatMul] name: [0m

Arguments received by LSTMCell.call():
  • inputs=tf.Tensor(shape=(1, 5760), dtype=float32)
  • states=('tf.Tensor(shape=(1, 640), dtype=float32)', 'tf.Tensor(shape=(1, 640), dtype=float32)')
  • training=False

### Assemble Components into Model

### Create Data Pipeline

### Fit the model

### Evaluate model Performance

### Making Predictions using the best model trained