This is a project to build a Transformer model that performs sentiment analysis on data made up of text sentences. The aim is to predict whether the sentiment behind the texts is positive or negative.

# About Transformers
Transformers are deep learning models used for processing sequential data, but they are far more powerful and can handle longer sequences than RNNs & member models(LSTMs & GRUs) without the gradient vanishing or exploding problem, this is because unlike RNNs that process data at one point in time with respect to data at the previous point, transformers tend to analyse data at particular with respect to all points in the sequence both previous & forward(ie in a sentence it would find the importance of each word with respect to other words both before & after it)<br>
Transformers are most times used for sequence to sequence predictions so they would have an Encoder & Decoder parts,During training the vectorised word sequences are passed through an embedding layer which contains 2 stages,the normal input embedding that converts the vectorised words into embedding vectors which is the standard for Natural language processing wuth deep learning and then a positional embedding to mark the position of each word in the sequence so as to reveal context(ie how the position of the word would affect the entire sentence) this is then moved to the decoder in which the model tries to find the importance of each word in the input sequence relative to other words in the sequence(this process is known as Attention), moving to the Encoder,the model initially uses a twikked form of the attention method to find the importance of each word in the output sequence relative to other words that came before it in the sequence,meaning that it will block words in the future/later part of the sequence(this method is called "Masking/Masked Attention" & it's done so that during inference/testing the model doesn't expect any future word in the output & hence would predict output words on it's own), it then takes the results of this operation & uses the normal attention method to find the importance of each word in the input sequence(already processed in the Decoder) relative to each the each word in the output sequence(which has been initially processed by Encoder), with this it can be able to establish the relationship between words in the input & output sequences simultaenously.<br>
But since we're doing simple text classificaton in this project, we would be using only the embedding layer & the decoder part of the Transformer which would then be given to a dense layer(s),this is just to simply learn the relationship between words in the input sequences relative to their respective classes.
The model architecture for this project is divided into 4 parts
1. **The Embeddings layers :**  containing the standard & positonal embedding layers & the feeds into the next layer
2. **The Multi-headed attention layer :** this is part where the attention is peformed after taking the embedded sequences, the layer contains more than one attention head in other to give diversity to how the model analyses each word relative to the other words in the input sequence,the results from the different attention layers are then normalised/averaged & the final result is added to the original embbeding input
3. **The feed forward layer :** this takes the processed sequence from the attention layer through a normal feed forward network that usually has a Relu activation function
4. **The final dense layer :** this is the final dense layer that takes the sequences from the previous feed-forward layer for final classifcation,it would have either a sigmoid or softmax activation function.

In [None]:
import pandas as pd, numpy as np # import necessary libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load the data (this data was made from 1.6million tweets tat were classified as either positive or negative)
data = pd.read_csv('drive/MyDrive/Portfolio resources/Sentiment analysis & explication dataset/product_review.csv')
data = data.dropna()
data = data.sample(frac=1)
data.head()

Unnamed: 0.1,Unnamed: 0,text,overall
110262,259810,After several months of use we're looking for ...,0
595029,85357,Need to contact Megan. Megan not replying. No ...,0
48944,2022-03-09 14:03:47,"The app is good, but I can't play my liked son...",2
371353,1507730,"I think Defender would have been better, I've ...",1
110056,106578,"Cheap and flimsy, Served the purpose, but I ca...",2


In [None]:
data['overall'].value_counts()

0    200000
2    199999
1    199999
Name: overall, dtype: int64

In [None]:
data.info() # data info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 599998 entries, 110262 to 575494
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  599998 non-null  object
 1   text        599998 non-null  object
 2   overall     599998 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 18.3+ MB


In [None]:
data.describe() # describe data

Unnamed: 0,overall
count,599998.0
mean,0.999998
std,0.816498
min,0.0
25%,0.0
50%,1.0
75%,2.0
max,2.0


In [None]:
# import libraries for model development
import tensorflow as tf,keras
from sklearn.model_selection import train_test_split
from keras import layers
from keras.callbacks import ModelCheckpoint
from keras.layers import TextVectorization

In [None]:
import string
stg = string.punctuation.replace("'",'')
def custom_standardization(input_string):
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "\n", " ")
    return tf.strings.regex_replace(stripped_html, f"([{stg}])", r"")

In [None]:
import pickle
from_disk = pickle.load(open("drive/MyDrive/Portfolio resources/product_rev_vectorizer_weights", "rb"))
#del from_disk['config']['encoding']
new_v = TextVectorization.from_config(from_disk['config'])
new_v.set_weights(from_disk['weights'])

In [None]:
X = new_v(data['text'])
y = data['overall']
X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y, test_size=0.1, random_state=42)

In [None]:
# creating the Embedding layer
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim,**kwargs):
        super().__init__(**kwargs)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim) # standard embedding layer
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim) # another standard embedding layer(to be later used for positional embedding)

    def call(self, x):
        maxlen = tf.shape(x)[-1] # getting the sequence length
        positions = tf.range(start=0, limit=maxlen, delta=1) # using the sequence length to create empty vectors for positional embedding
        positions = self.pos_emb(positions)# merging the positonal vectors with the2nd standard embeeding to create the positional embedding layer
        x = self.token_emb(x) # putting the vectorised sequences in the 1st embedding layer for standard embedding
        return x + positions # returning the sum of the standard & positional embddings


In [None]:
# creating the transformer layer
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1,**kwargs):
        super().__init__(**kwargs)
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim) # multi-head attention layer
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )# feed forward layer
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)# first normalisaton layer to normalise the sum of the multi-head attention layer with the embbding layer output
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) # 2nd normalisaton layer to normalise the sum of the feed forward layer with the multi-head attention output
        self.dropout1 = layers.Dropout(rate) # dropout layer1
        self.dropout2 = layers.Dropout(rate) # dropout layer2

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs) # giving the attention layer the inputs to use for the queries & keys (if not given the layer would use whatever is given as the key for the key-values) to perform attention
        attn_output = self.dropout1(attn_output, training=training)  # putting the result of the attention layer within a dropout layer to reduce overfitiing(training is set to itself to indicate true)
        out1 = self.layernorm1(inputs + attn_output) # adding the results of the attention layer with its inputs(the embedded layer result) & then normalising it
        ffn_output = self.ffn(out1) # giving the product of the normalisation to the feed forward layer
        ffn_output = self.dropout2(ffn_output, training=training) # putting the result of the feed forward layer within a dropout layer to reduce overfitiing(training is set to itself to indicate true)
        return self.layernorm2(out1 + ffn_output) # adding the results of the feed forward layer with its inputs(the attention layer result) & then normalising it


In [None]:
# building the entire model
embed_dim = 32  # Embedding size for each token
num_heads = 3  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer
maxlen=101
vocab_size=25000
inputs = layers.Input(shape=(maxlen,)) # input for vectorised model
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x) # pooling the result of the feed forward layer
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x) # additional dense layer
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(3, activation="softmax")(x) # final dense layer
model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# creating a model checkpoint callback
filepath="drive/MyDrive/Collab Models/review_sent_model.hdf5"
checkpoint = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]

In [None]:
# training the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    X_train, y_train, batch_size=512, epochs=5, validation_data=(X_test, y_test),callbacks=callbacks_list)

Epoch 1/5
Epoch 1: val_loss improved from inf to 0.62220, saving model to drive/MyDrive/Collab Models/review_sent_model.hdf5
Epoch 2/5
Epoch 2: val_loss did not improve from 0.62220

Epoch 2: val_loss did not improve from 0.62220
Epoch 3/5
Epoch 3/5
Epoch 3: val_loss improved from 0.62220 to 0.61038, saving model to drive/MyDrive/Collab Models/review_sent_model.hdf5

Epoch 3: val_loss improved from 0.62220 to 0.61038, saving model to drive/MyDrive/Collab Models/review_sent_model.hdf5
Epoch 4/5
Epoch 4/5
Epoch 4: val_loss did not improve from 0.61038

Epoch 4: val_loss did not improve from 0.61038
Epoch 5/5
Epoch 5/5
Epoch 5: val_loss did not improve from 0.61038

Epoch 5: val_loss did not improve from 0.61038


In [None]:
# creating function for model prediction
def predict(dt,mdl):
    try:
        x = new_v(dt['text'].values)
        pred = mdl.predict(x)
    except Exception as err:
        pred = mdl.predict(dt)
    return pred

In [None]:
# predicting text values
pred = predict(X_test,model)
pred = np.argmax(pred,axis=1)
pred



array([0, 2, 2, ..., 0, 2, 1])

array([0, 2, 2, ..., 0, 2, 1])

In [None]:
# show prediction metric
from sklearn.metrics import classification_report
print(classification_report(pred,y_test))

              precision    recall  f1-score   support

           0       0.82      0.70      0.75     23507
           1       0.78      0.84      0.81     18703
           2       0.61      0.68      0.64     17790

    accuracy                           0.74     60000
   macro avg       0.74      0.74      0.74     60000
weighted avg       0.75      0.74      0.74     60000

