<a href="https://colab.research.google.com/github/packetech/Arcada/blob/main/sentiment_analysis_arcada_v14_ndifreke_okorie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK1

### Chosen Topic and models: Sentiment Analysis 
Using LSTM and Bi-directional LSTM Models

# TASK2 

### Implementing the model

#  Introduction

Dataset: 
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification. We are going to build and train RNN models to analyze and predict sentiments either as positive or negative given users reviews.

Since it's a large dataset, we are going to use only 3000 samples to traing our model, to be able to iterate severally to tune our model faster




#### Installing some modules 

In [1]:
# Install Tensorflow version greater than 2.0

#!pip uninstall tensorflow
#!pip install tensorflow==2.7.0

In [2]:
# Load libraries

import pandas as pd 
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.models import Model, Sequential
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import gc # python garbage collector
import timeit
import warnings
warnings.filterwarnings("ignore")
import io

In [3]:
print(tf.__version__)

2.7.0


# [2] EDA

In [4]:

sentimentDF = pd.read_csv('https://raw.githubusercontent.com/apomohab/imdb_dataset_csv/main/IMDB_Dataset.csv', delimiter=',')
sentimentDF.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# check for null values in the dataset (we have no null value !!!)
print(sentimentDF.isnull().any(axis = 0))

review       False
sentiment    False
dtype: bool


In [6]:
# lets see our dataframe shape
sentimentDF.shape

(50000, 2)

In [7]:
# Lets convert the sentiment column values to 0(negative) and 1(positive)
labelReplace = {"sentiment": {"negative": 0, "positive": 1}}
sentimentDF.replace(labelReplace, inplace=True)
sentimentDF.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
# checking imbalance in label ( Data has no imbalance)
sentimentDF.groupby(sentimentDF["sentiment"]).count()

Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
0,25000
1,25000


In [9]:
# The runtime is crashing because of the great amount of samples am running/ higher batch size, I will have to use afew samples
sentimentDF = sentimentDF.head(3000)

# checking to see that there is no imbalance in our sample sizes (sample is not greatly imbalanced)
sentimentDF.groupby(sentimentDF["sentiment"]).count()

Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
0,1492
1,1508


#### Get the Length of each line and find the maximum length
As different lines are of different length. We need to pad the our sequences using the max length

In [10]:
# Get the Length of each line
sentimentDF['length'] = sentimentDF['review'].apply(lambda x: len(x))
sentimentDF.head()

Unnamed: 0,review,sentiment,length
0,One of the other reviewers has mentioned that ...,1,1761
1,A wonderful little production. <br /><br />The...,1,998
2,I thought this was a wonderful way to spend ti...,1,926
3,Basically there's a family where a little boy ...,0,748
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,1317


In [11]:
# lets see 3 maximum length records
sentimentDF.nlargest(3, ['length'])

Unnamed: 0,review,sentiment,length
1531,Warning: Does contain spoilers.<br /><br />Ope...,1,8180
556,I thought that ROTJ was clearly the best out o...,1,7134
310,"Some people drift through life, moving from on...",1,5862


#### Set various parameters for preprocessing and the model

In [12]:

vocab_size = 10000
maxlen = 8180  ## This is the length of the longest review
embedding_dim = 32
epochs = 50
batch_size = 100
trunc_type='post'
oov_tok = "<OOV>"

#### Apply Keras Tokenizer to the 'review' column of dataset

In [13]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(sentimentDF['review'])

#### Defining X and y for the model (i.e. the input sample and label)

In [14]:
X = tokenizer.texts_to_sequences(sentimentDF['review'])
X = pad_sequences(X, maxlen = maxlen, truncating=trunc_type)
y = np.asarray(sentimentDF['sentiment'])

print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])


Number of Samples: 3000
[   0    0    0 ...  117 4859  589]
Number of Labels:  3000
1


#### The sample padded sentence

In [15]:
word_index = tokenizer.word_index


reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# print commented out, so as not to litter our notebook
#print(decode_review(X[0]))
#print("  ")
#print(sentimentDF['review'][0])

# [3] Modeling

In [16]:
# Split data

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)
#print(X_train.shape,y_train.shape)
#print(X_test.shape,y_test.shape)

In [17]:
# LSTM model building

model1 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=maxlen),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Flatten(), 
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# Let's print the model summary:
model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 8180, 32)          320000    
                                                                 
 lstm (LSTM)                 (None, 128)               82432     
                                                                 
 flatten (Flatten)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 402,561
Trainable params: 402,561
Non-trainable params: 0
_________________________________________________________________


In [18]:
# The following script prints the architecture of our neural network:
from tensorflow.keras.utils import plot_model
#plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

#### Fit model

In [19]:
# From the figure above, you can see that the output layer only contains 1 dense layer with 2 neurons. Let's now train our model:


history1 = model1.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=0, validation_split=0.2)


#### Model Evaluation

In [20]:
# Let's now evaluate our model on the test set:


_, train_acc1 = model1.evaluate(X_train, y_train, verbose=0)
_, test_acc1 = model1.evaluate(X_test, y_test, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc1, test_acc1))


Train: 0.958, Test: 0.747


# TASK 3

### 1. Tuning the presented model

#### Early Stopping callback

In [21]:
#  Adding Early Stoping callback parameter to give the best number of iterations

#callbacks = EarlyStopping(monitor='val_loss', mode='min', patience=10, verbose=1)

callbacks = EarlyStopping(monitor='val_accuracy', mode='max', min_delta=1, patience=15, verbose=1)



In [22]:
# Adding dropout will improve the model as it will work againt overfitting


model2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=maxlen),
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model2.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 8180, 32)          320000    
                                                                 
 lstm_1 (LSTM)               (None, 8180, 128)         82432     
                                                                 
 dense_1 (Dense)             (None, 8180, 128)         16512     
                                                                 
 dropout (Dropout)           (None, 8180, 128)         0         
                                                                 
 flatten_1 (Flatten)         (None, 1047040)           0         
                                                                 
 dense_2 (Dense)             (None, 1)                 1047041   
                                                                 
Total params: 1,465,985
Trainable params: 1,465,985
No

In [23]:
history2 = model2.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=0, validation_split=0.2, callbacks=[callbacks])

Epoch 00016: early stopping


In [24]:
# Let's also now evaluate our tuned model on the test set:


_, train_acc2 = model2.evaluate(X_train, y_train, verbose=0)
_, test_acc2 = model2.evaluate(X_test, y_test, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc2, test_acc2))


Train: 0.958, Test: 0.820


### 2. Trying Another RNN Model (Bi-directional LSTM)

In [25]:
# Let see what this model can do without changing anything else but just the LSTM to Bidirectional LSTM

model3 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=maxlen),
    tf.keras.layers.Bidirectional(LSTM(128, return_sequences=True)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model3.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model3.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 8180, 32)          320000    
                                                                 
 bidirectional (Bidirectiona  (None, 8180, 256)        164864    
 l)                                                              
                                                                 
 dense_3 (Dense)             (None, 8180, 128)         32896     
                                                                 
 dropout_1 (Dropout)         (None, 8180, 128)         0         
                                                                 
 flatten_2 (Flatten)         (None, 1047040)           0         
                                                                 
 dense_4 (Dense)             (None, 1)                 1047041   
                                                      

In [26]:
model3.fit(X_train, y_train, validation_data=(X_test,y_test), epochs = epochs, batch_size=batch_size, verbose = 0, callbacks=[callbacks])

Epoch 00016: early stopping


<keras.callbacks.History at 0x7f221570fe10>

In [27]:

_, train_acc3 = model3.evaluate(X_train, y_train, verbose=0)
_, test_acc3 = model3.evaluate(X_test, y_test, verbose=0)
#print('Train: %.3f, Test: %.3f' % (train_acc3, test_acc3))

# save the model to disk
filename = 'bestmodel.h5'
tf.keras.models.save_model(model3, filename)

# [4] Results

In [28]:
print('LSTM Model Result')
print('Train: %.3f, Test: %.3f' % (train_acc1, test_acc1))
print('')

print('Tunned LSTM Model Result')
print('Train: %.3f, Test: %.3f' % (train_acc2, test_acc2))
print('')

print('Bi-directional LSTM Model Result')
print('Train: %.3f, Test: %.3f' % (train_acc3, test_acc3))


LSTM Model Result
Train: 0.958, Test: 0.747

Tunned LSTM Model Result
Train: 0.958, Test: 0.820

Bi-directional LSTM Model Result
Train: 1.000, Test: 0.832


# 5. Conclusions
1. For fast training, the dataset was filtered to 3000 samples
2. The First model (LSTM) test set accuracy was 74.7 % after 50 epochs
3. The tuned model, improved to 82 % after just 16 epochs (an improvement of 7.3 %); due to adding dropout of 0.5 (which removed some complexity from the  model, countering overfitting) and early stopping, which helps to catch the best accuracy epoch
4. Using another RNN model (Bi-directional LSTM): The performed got better (at 83.2 %) than the tuned model. This is an improvement of 1.2% .This is because Bi-directional is a better model with forward and backward pass and should be able to predict better.
5. The projector diagram (https://github.com/packetech/Arcada/blob/main/projector_image.PNG) shows how easy the Bi-directional LSTM can approximate words to several nearest neighbors.