# Creating a mind map before getting into action

So after coming across from the mistakes before this notebook, let's plan what all things we need to do so that we perform the crucial steps needed for the NMT to make with all the proper components and the preprocessed data.

**Data Preprocessing Actions**:
1. Shift the target texts.
2. Pad the text with proper annotations.

**Components of the model**:
1. Create *text vectors* of both source and target language separately using TensorFlow `TextVectorization` layer.
2. Create *embedding layers* of both source and target language separately using TensorFlow `Embedding` layer.
3. Create a class known *Encoder* which inherits the properties of TensorFlow class `Layer`. In the sub-class method of creating a layer, we are going to develop the encoder architecture of the NMT. (The architecture will be discussed while we are developing the layer.)
4. Create a class known as *Decoder* which inherits the properties of TensorFlow class `Layer`. In this sub-class method of creating a layer, we are going to develop the decoder architecture of the NMT. (The architecture will be discussed while we are developing the layer.)
5. After creating all the components, we are going to create a class called *EncoderDecoder* which will inherit from TensorFlow class `Model`. In this class we will assemble all of the layers that we have prepared in the above steps and then create a custom call function which will allow us to train the decoder layer as we expect it to do.
6. Create an instance of the *EncoderDecoder* class with all the parameters passed in its constructor and compile the model with `tf.keras.losses.SparseCategoricalCrossentropy()` as loss function and `tf.keras.optimizers.RMSprop()` as the optimizer.
7. We will then create the *train_dataset* and *valid_dataset* using TensorFlow `tf.data` API for better performance of the model training.
8. We will fit the model with the training data with 5 epochs and *callbacks* such as `tf.keras.callbacks.ModelCheckPoint`, `tf.keras.callbacks.EarlyStopping`, `tf.keras.callbacks.ReduceLROnPlateau` and the validation dataset.
9. After training the model we will evaluate the model using different metrics system and predict with an unseen data and observe the quality of the prediction.

# Importing all the necessary libraries

In [5]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import os
import tarfile

# Start the project

As all machine learning project has two phases, this is project is no exception for it. We will first work on the *Data Preprocessing* and then *Model Development*

## Data Preprocessing

In [15]:
# Creating a constant which contains the starting path of the project so that
START_PATH = str(os.getcwd()) + '/'
COMP_DATA_PATH = os.path.join(START_PATH, 'wiki-titles.tgz')
DATA_URL = 'https://www.statmt.org/wmt14/wiki-titles.tgz'
COMP_DATA_PATH

'/Users/klsharma22/PycharmProjects/EncoderDecoderExp/wiki-titles.tgz'

In [4]:
!wget https://www.statmt.org/wmt14/wiki-titles.tgz

--2024-04-04 16:29:22--  https://www.statmt.org/wmt14/wiki-titles.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.32.28
Connecting to www.statmt.org (www.statmt.org)|129.215.32.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8168057 (7.8M) [application/x-gzip]
Saving to: ‘wiki-titles.tgz’


2024-04-04 16:29:43 (413 KB/s) - ‘wiki-titles.tgz’ saved [8168057/8168057]



In [22]:
if not os.path.exists(COMP_DATA_PATH):
    print("File does not exist")
else:
    print("The file exists")

The file exists


In [29]:
tarfile.TarFile().extractall(path=COMP_DATA_PATH)

TypeError: expected str, bytes or os.PathLike object, not NoneType