seq2seq architecture with Python's Keras library for deep learning

Bidirectional Encoder Representations from Transformers (BERT) is a text representation technique like Word Embeddings. a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers. 

# library

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert
import pandas as pd
import numpy as np
import re

# read data

In [2]:
movie_reviews = pd.read_csv("E:\\gitlab\\machine-learning\\NLP\\dataset\\IMDB Dataset.csv")
movie_reviews.isnull().values.any()
movie_reviews.shape

(50000, 2)

# text manipulation

In [3]:
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    # Remove single characters from the start
    sentence = re.sub(r'^[a-zA-Z]\s+', ' ', sentence) 
    #removes spaces from the beginning 
    sentence = re.sub(r"^\s+", "", sentence)
    #removes spaces from at the end 
    sentence = re.sub(r"\s+$", "", sentence)
    return sentence

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

In [4]:
reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

In [5]:
print(movie_reviews.columns.values)

['review' 'sentiment']


In [6]:
movie_reviews.sentiment.unique()

array(['positive', 'negative'], dtype=object)

# labeling

In [7]:
y = movie_reviews['sentiment']
y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

# Creating a BERT Tokenizer

In order to use BERT text embeddings as input to train text classification model, we need to tokenize our text reviews. Tokenization refers to dividing a sentence into individual words. To tokenize our text, we will be using the BERT tokenizer.

# example BERT

In [11]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In the script above we first 

1. create an object of the FullTokenizer class from the bert.bert_tokenization module. 
2. create a BERT embedding layer by importing the BERT model from hub.KerasLayer. 
    The trainable parameter is set to False, which means that we will not be training the BERT embedding. 
3. create a BERT vocabulary file in the form a numpy array. 
4. set the text to lowercase 
5. pass our vocabulary_file and to_lower_case variables to the BertTokenizer object.

In [12]:
# tokenize a random sentence
tokenizer.tokenize("don't be so judgmental")
# get the ids of the tokens using the convert_tokens_to_ids() of the tokenizer object
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so judgmental"))

[2123, 2102, 2022, 2061, 8689, 2389]

In [21]:
def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))

### to actually tokenize all the reviews in the input dataset

In [22]:
tokenized_reviews = [tokenize_reviews(review) for review in reviews]

# Preparing Data For Training

To train the model, the input sentences should be of equal length. 
To create sentences of equal length, one way is to pad the shorter sentences by 0s. 
However, this can result in a sparse matrix contain large number of 0s. 
The other way is to pad sentences within each batch. Since we will be training the model in batches, 
we can pad the sentences within the training batch locally depending upon the length of the longest sentence.

In [23]:
# list contains tokenized review, the label of the review and 
#length of the review:
reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]

In [25]:
# shuffle the review
import random
random.shuffle(reviews_with_len)

In [26]:
# sort the data by the length of the reviews
#sort base on third columns
reviews_with_len.sort(key=lambda x: x[2])

In [27]:
#remove the length attribute from all the reviews
sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

# convert dataset for train TensorFlow 2.0 models

In [28]:
# train TensorFlow 2.0 models
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

# pad  dataset for each batch

The batch size we are going to use is 32 which means that after processing 32 reviews, the weights of the neural network will be updated and pad the reviews locally with respect to batches

In [29]:
BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

In [30]:
# print the first batch and see how padding has been applied to it
next(iter(batched_dataset))

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,  2305,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2023,  3185,  2003,  6659,  2021,  2009,  2038,  2070,  2204,
    

From the last five reviews, you can see that the total number of words in the largest sentence were 21.

in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21. The padding for the next batch will be different depending upon the size of the largest sentence in the batch.

# divide the dataset into test and training sets

In [32]:
import math
TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

1. find the total number of batches by dividing the total records by 32.
2. 10% of the data is left aside for testing. use the take() method of batched_dataset() object to store 10% of the data in the test_data variable. 
3. The remaining data is stored in the train_data object for training using the skip() method.

# Creating the Model

model will consist of three convolutional neural network layers.

In [33]:
class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
# In the constructor of the class, we initialze some attributes 
# with default values. These values will be replaced later on by 
# the values passed when the object of the TEXT_MODEL class is created.
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
# three convolutional neural network layers have been initialized 
# with the kernel or filter values of 2, 3, and 4
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")

# inside the call() function, global max pooling is applied to 
# the output of each of the convolutional neural network layer.
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 

# the three convolutional neural network layers are concatenated together 
# and their output is fed to the first densely connected neural network. 
# The second densely connected neural network is used to predict the output 
# sentiment since it only contains 2 classes. In case you have more classes 
# in the output, you can updated the output_classes variable accordingly.
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

In the constructor of the class, we initialze some attributes with default values. These values will be replaced later on by the values passed when the object of the TEXT_MODEL class is created.

three convolutional neural network layers have been initialized with the kernel or filter values of 2, 3, and 4, respectively

inside the call() function, global max pooling is applied to the output of each of the convolutional neural network layer

the three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network.

The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes

# hyper parameters 

In [34]:
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

# input hyper parameter

In [37]:
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

# compile model

In [38]:
if OUTPUT_CLASSES == 2: #biner classification
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else: # multiple classification
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

# training data

In [39]:
history = text_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x2259c557788>

# testing data / evaluate data

In [42]:
results = text_model.evaluate(test_data)
print(results)

    156/Unknown - 7s 43ms/step - loss: 0.6046 - accuracy: 0.8874[0.6045952109166254, 0.8874199]
