## Transformers For Text Classification

https://blog.paperspace.com/transformers-text-classification/

In [10]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization, Dropout, Layer
from tensorflow.keras.layers import Embedding, Input, GlobalAveragePooling1D, Dense
from tensorflow.keras.datasets import imdb 
from tensorflow.keras.models import Sequential, Model

import numpy as np
import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)


**Creating Transformer blocks and positional embedding:**

We will now proceed to construct the transformer block, which follows a similar build as discussed in the previous section of the article. While I am going to utilize the multi-head attention layer that is available in the TensorFlow/Keras deep learning frameworks, you can modify the code and build your own custom multi-head layer to grant further control and access to the numerous parameters involved.

In the first function of the transformer block, we will initialize the required parameters, namely the attention layer, the batch normalization and dropout layers, and the feed-forward network. In the call function of the transformer block, we will define the layers accordingly, as discussed in the architecture overview section of the transformer block.

In [19]:
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = Sequential(
            [Dense(ff_dim, activation="relu"), 
             Dense(embed_dim),]
        )
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In the next code block, we will define another function that will be useful for managing the positional embeddings that are specified in the research paper. We are creating two embedding layers, namely for the tokens and the token index positions. The below code block describes how to create a class with two functions. In the first function, we initialize the token and positional embeddings, and in the second function, we will call them and encode the respective embeddings accordingly. With this step completed, we can proceed to prepare the dataset and develop the transformer model for text classification.

In [20]:
class TokenAndPositionEmbedding(Layer):
  def __init__(self, maxlen, vocab_size, embed_dim):
    super(TokenAndPositionEmbedding, self).__init__()
    self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
    self.pos_emb  = Embedding(input_dim=maxlen, output_dim=embed_dim)

  def call(self, x):
    maxlen = tf.shape(x)[-1]
    positions = tf.range(start=0, limit=maxlen, delta=1)
    positions = self.pos_emb(positions)
    x = self.token_emb(x)

    return x + positions

**Preparing the data:**

For this particular task, we will be referring to the IMDB dataset available with TensorFlow and Keras. The dataset contains a total of 50,000 reviews, out of which we will split the data into 25000 training sequences and 25000 testing sequences. It also contains an even split total of 50% positive reviews and 50% negative reviews segregated accordingly. In the pre-processing step, our objective is to manipulate each of these words into a set of integers, so they can be used when we proceed to construct the transformers architecture to validate the results as desired.

In [21]:
vocab_size = 20000
maxlen = 200

(x_train, y_train), (x_val, y_val) = imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")


25000 Training sequences
25000 Validation sequences


In the next code snippet, we will look at the labels assigned to the first five testing sequences. These labels give us an intuition on what to expect from the data we are looking at. In a later part of this section, we will make predictions on the first five data elements to check how accurate our model is performing on these datasets.

In [22]:
y_val[:5]

array([0, 1, 1, 0, 1])

After viewing the labels for the first five elements, we will proceed to pad the sequences for both the training and validation data, as shown in the below code block. Once this procedure is completed, we will start developing our transformer model.

In [23]:
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = tf.keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

**Developing the model:**

There are several ways in which the developers can implement the transformer model for the text classification task. It is usually common practice to use a separate encoder and decoder class to perform these actions separately. In this section, we will leverage a fairly simple method to develop our model, and utilize it accordingly for the task of text classification. We will declare our embedding dimensions for each token, the number of attention heads to use, and the size of the layers of the feed-forward network in the transformer. Then, with the help of the utility that we created by the previous transformer blocks and the positional embedding class, we will develop the model.

It is notable that we are using both the Sequential and Functional API models, allowing us to have more significant control over the model architecture. We will give an input containing the vectors of the sentence, for which we create an embedding and pass it through a transformer block. Finally, we have a global average pooling layer, a dropout, and a dense layer to return the probabilities of the possibilities of the sentence. We can use the Argmax function in numpy to obtain the correct result. Below is the code block to develop the model.

In [24]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer


inputs = Input(shape=(maxlen,))

embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)

transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)

x = transformer_block(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(20, activation='relu')(x)
x = Dropout(0.1)(x)

outputs = Dense(2, activation='softmax')(x)

model = Model(inputs=inputs, outputs=outputs)

**Compiling and fitting the model:**

In the next steps, we will proceed to compile the transformer model that we have constructed. For the compilation procedure, we will make use of the Adam optimizer, sparse categorical cross-entropy loss function, and also assign the accuracy metrics to be computed accordingly. We will then proceed to fit the model and train it for a couple of epochs. The code snippet shows how these operations are performed.

In [25]:
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=64,
                    epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


Once the training procedure is performed, we can proceed to save the weights that we computed during the fitting process. The process to do so is as mentioned below.


In [26]:
model.save_weights("predict_class.h5")

**Evaluating the model:**

While we have a snapshot idea of how well the model trained during the fitting process, it is still essential to evaluate the model and analyze the performance of this model on the testing data. Hence, we will evaluate the testing values alongside their labels to obtain the results. The model will make certain predictions on the testing data to predict the respective label, which is compared to the original labels. We will then receive a final value corresponding to the accuracy of the model.

In [27]:
results = model.evaluate(x_val, y_val, verbose=2)

782/782 - 6s - loss: 0.3299 - accuracy: 0.8651 - 6s/epoch - 8ms/step


In [28]:
for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

loss: 0.330
accuracy: 0.865


If we recollect that in one of the previous parts of this section, we had printed the values of the first five testing labels. We can proceed to make the respective predictions by using our trained model with the help of the predict function. Below is a screenshot representing the results that I was able to predict from my trained model.

In [29]:
np.argmax(model.predict(x_val[:1]))

0