<a href="https://colab.research.google.com/github/peenalGupta/Data-Analytics-3-Labs/blob/main/10_Sentiment_Analysis_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup the Environment

In [1]:
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, SimpleRNN, Activation, Dropout, Conv1D
from tensorflow.keras.layers import Embedding, Flatten, LSTM, GRU
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

import pandas as pd
import numpy as np
import spacy
from sklearn.metrics import classification_report

In [2]:
# Fix Colab bug: https://github.com/googlecolab/colabtools/issues/3409
import locale
locale.getpreferredencoding = lambda do_setlocale: "UTF-8"

## Exploratory Data Analysis

In [3]:
data = pd.read_csv("https://storage.googleapis.com/adsa-data/sentiment-analysis/tweeter.csv", header=None, encoding='latin-1')
data.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
# Check for missing values
data.isnull().any()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
5,False


## Preparing Data

We only care about the tweet text and tweet sentiment information, which stored in the 5th column and 0th column in the dataset. In the sentiment column, 0 represents negative, and 1 represents positive.

We organize the data as data_X contains all the tweet text, data_y contains the labels.

The following code will convert the tweet text data_X to sequence format that will be feed into RNNs

In [5]:
data_X = data[5]
print(data_X)

0        @switchfoot http://twitpic.com/2y1zl - Awww, t...
1        is upset that he can't update his Facebook by ...
2        @Kenichan I dived many times for the ball. Man...
3          my whole body feels itchy and like its on fire 
4        @nationwideclass no, it's not behaving at all....
                               ...                        
19995    Just woke up. Having no school is the best fee...
19996    TheWDB.com - Very cool to hear old Walt interv...
19997    Are you ready for your MoJo Makeover? Ask me f...
19998    Happy 38th Birthday to my boo of alll time!!! ...
19999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: 5, Length: 20000, dtype: object


#### Label:
*   0 -> NEGATIVE
*   2 -> NEUTRAL
*   4 -> POSITIVE

In [6]:
data_y = pd.get_dummies(data[0]).to_numpy()
print(data_y)

[[ True False]
 [ True False]
 [ True False]
 ...
 [False  True]
 [False  True]
 [False  True]]


Splitting Data for Training

In [11]:
# TODO: Split data into train and valid sets
from sklearn.model_selection import train_test_split

# Splits Dataset into Training and Testing set
train_X, valid_X, train_y, valid_y = train_test_split(data_X, data_y, test_size=0.2, random_state=42)

# print("Train Data size:", len(train_data))
# print("Test Data size", len(test_data))

## Tokenization

In [9]:
MAX_VOCAB = 18000
MAX_LEN = 150
EMBED_SIZE = 200

In [12]:
# TODO: Tokenize inputs
tokenizer = Tokenizer(num_words=MAX_VOCAB)
tokenizer.fit_on_texts(train_X)

train_X = tokenizer.texts_to_sequences(train_X)
valid_X = tokenizer.texts_to_sequences(valid_X)

word_index = tokenizer.word_index

In [13]:
# TODO: Text padding
train_X = pad_sequences(train_X, maxlen=MAX_LEN)
valid_X = pad_sequences(valid_X, maxlen=MAX_LEN)

In [14]:
train_X

array([[   0,    0,    0, ...,  687, 2036,  337],
       [   0,    0,    0, ...,  780,  130,   36],
       [   0,    0,    0, ...,  688,  108,   96],
       ...,
       [   0,    0,    0, ..., 1190,  424,    9],
       [   0,    0,    0, ...,    2,  105,  257],
       [   0,    0,    0, ...,    2,  437,   14]], dtype=int32)

## Preparing Word Embeddings using the GloVe Model

In [26]:
import locale
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [22]:
!pip install gensim




In [27]:
import gensim.downloader as api

# Load the twitter embeddings model. This model is trained on 2 billion tweets, which contains 27 billion tokens, 1.2 million vocabs.
# might take a while
glove_model = api.load("glove-twitter-200")



In [28]:
# calcultaete number of words
nb_words = len(word_index) + 1
print('All words: ', nb_words)

# obtain the word embedding matrix
embedding_matrix = np.zeros((nb_words, EMBED_SIZE))
for word, i in word_index.items():
    if word in glove_model:
        embedding_matrix[i] = glove_model[word]

print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

All words:  26001
Null word embeddings: 10327


**Explanation of the steps performed till now**

Tweets: Is upset that he can't update his Facebook..

Expected Input to RNN model -
Is - Embeddings [200] (32)

upset - Embeddings [200] (450)

that - Embeddings [200] (43)

he - Embeddings [200] (56)

1. Vocabulary of all tweets: 30257 unique tokens
2. Unique token IDs: ID (1, 2, 3, 4... for all the 30257 tokens)
3. Tweets represented as the sequence of IDs [32 450 43 56 ...]

Padding:
"Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective." - [link](https://stackoverflow.com/questions/46298793/how-does-choosing-between-pre-and-post-zero-padding-of-sequences-impact-results)

Padding for RNNs - [Link](https://datascience.stackexchange.com/questions/49168/padding-sequences-for-neural-sequence-models-rnns)

[Paper](https://arxiv.org/abs/1903.07288)





## Training and Evaluation


Train and evaluate the SimpleRNN, LSTM, and GRU networks on our prepared dataset.

We are using the pre-trained word embeddings from the glove.twitter.27B.200d.txt data. Using the pre-trained word embeddings as weights for the Embedding layer leads to better results and faster convergence.

We set each models to run 20 epochs, but we also set EarlyStopping rules to prevent overfitting. The results of the SimpleRNN, LSTM, GRU models can be seen below.

In [29]:
model_rnn = Sequential()
model_rnn.add(Embedding(nb_words, EMBED_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable = False))

# TODO: Add a SimpleRNN layer
model_rnn.add(SimpleRNN(64))
model_rnn.add(Dense(2, activation='softmax'))
model_rnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model_rnn.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=3))

predictions = model_rnn.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))



Epoch 1/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 36ms/step - accuracy: 0.6141 - loss: 0.6586 - val_accuracy: 0.7075 - val_loss: 0.5621
Epoch 2/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.7251 - loss: 0.5414 - val_accuracy: 0.7535 - val_loss: 0.5158
Epoch 3/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 21ms/step - accuracy: 0.7541 - loss: 0.5037 - val_accuracy: 0.7615 - val_loss: 0.5054
Epoch 4/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 21ms/step - accuracy: 0.7691 - loss: 0.4820 - val_accuracy: 0.6672 - val_loss: 0.6528
Epoch 5/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - accuracy: 0.7493 - loss: 0.5118 - val_accuracy: 0.7268 - val_loss: 0.5417
Epoch 6/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.7804 - loss: 0.4647 - val_accuracy: 0.7558 - val_loss: 0.5117
[1m125/125[0m [32m━

## LSTM and GRUs

In [30]:
# TODO: Train a LSTM model by replacing the SimpleRNN layer with a LSTM layer
model_lstm = Sequential()
model_lstm.add(Embedding(nb_words, EMBED_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable=False))

# Add a LSTM layer with 64 units
model_lstm.add(LSTM(64))

model_lstm.add(Dense(2, activation='softmax'))
model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model_lstm.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max', patience=3))

predictions = model_lstm.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))

# TODO: Print a classification report for the model
predictions = model_lstm.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions)) # This line prints the classification report

Epoch 1/20




[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.6737 - loss: 0.5950 - val_accuracy: 0.7602 - val_loss: 0.5004
Epoch 2/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.7607 - loss: 0.4931 - val_accuracy: 0.7692 - val_loss: 0.4828
Epoch 3/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.7732 - loss: 0.4624 - val_accuracy: 0.7722 - val_loss: 0.4761
Epoch 4/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.7854 - loss: 0.4512 - val_accuracy: 0.7533 - val_loss: 0.4983
Epoch 5/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.8026 - loss: 0.4262 - val_accuracy: 0.7772 - val_loss: 0.4742
Epoch 6/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.8172 - loss: 0.4043 - val_accuracy: 0.7695 - val_loss: 0.4874
Epoch 7/20
[1m134/134[0m [32m━━

In [31]:
# TODO: Train a GRU model by replacing the SimpleRNN layer with a GRU layer
model_gru = Sequential()
model_gru.add(Embedding(nb_words, EMBED_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable=False))

# Add a GRU layer with 64 units
model_gru.add(GRU(64))

model_gru.add(Dense(2, activation='softmax'))
model_gru.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model_gru.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max', patience=3))


# TODO: Print a classification report for the model
predictions = model_gru.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))

Epoch 1/20




[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.6340 - loss: 0.6220 - val_accuracy: 0.7602 - val_loss: 0.4956
Epoch 2/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.7640 - loss: 0.4868 - val_accuracy: 0.7692 - val_loss: 0.4799
Epoch 3/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.7837 - loss: 0.4636 - val_accuracy: 0.7540 - val_loss: 0.5099
Epoch 4/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.7886 - loss: 0.4462 - val_accuracy: 0.7753 - val_loss: 0.4737
Epoch 5/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.8005 - loss: 0.4285 - val_accuracy: 0.7797 - val_loss: 0.4751
Epoch 6/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.8082 - loss: 0.4111 - val_accuracy: 0.7750 - val_loss: 0.4727
Epoch 7/20
[1m134/134[0m [32m━━━━━━

## Evaluation

In [32]:
import time

def predict(model, text):
    start_at = time.time()
    # Tokenize text
    x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=MAX_LEN)
    # Predict
    score = model.predict([x_test])[0]

    return {"NEGATIVE": score[0], "POSITIVE": score[1],
       "elapsed_time": time.time()-start_at}

In [33]:
# TODO: Try few sentences to check the models
predict(model_lstm, "I feel not so good today")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 141ms/step


{'NEGATIVE': 0.9528242,
 'POSITIVE': 0.04717582,
 'elapsed_time': 0.19471144676208496}

In [34]:
sentences = [
    "I feel not so good today",
    "This movie is absolutely fantastic!",
    "I'm having a terrible day",
    "The food was delicious and the service was excellent",
    "I'm feeling really happy and excited"
]

for sentence in sentences:
    print(f"Sentence: {sentence}")
    print("SimpleRNN Prediction:", predict(model_rnn, sentence))
    print("LSTM Prediction:", predict(model_lstm, sentence))
    print("GRU Prediction:", predict(model_gru, sentence))
    print("-" * 20)

Sentence: I feel not so good today
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 327ms/step
SimpleRNN Prediction: {'NEGATIVE': 0.71729815, 'POSITIVE': 0.28270188, 'elapsed_time': 0.37378668785095215}
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
LSTM Prediction: {'NEGATIVE': 0.9528242, 'POSITIVE': 0.04717582, 'elapsed_time': 0.05945849418640137}
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step
GRU Prediction: {'NEGATIVE': 0.82278013, 'POSITIVE': 0.17721982, 'elapsed_time': 0.13068175315856934}
--------------------
Sentence: This movie is absolutely fantastic!
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
SimpleRNN Prediction: {'NEGATIVE': 0.34722605, 'POSITIVE': 0.6527739, 'elapsed_time': 0.05412936210632324}
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
LSTM Prediction: {'NEGATIVE': 0.13861674, 'POSITIVE': 0.8613832, 'elapsed_time': 0.05575966835021973}
[1m1/1[0m [32

## Pre-trained Word Embeddings

Try training the RNNs with word embeddings but without the pre-trained weight and compare the results with the pre-trained model.


In [35]:
model_rnn = Sequential()
model_rnn.add(Embedding(nb_words, EMBED_SIZE, input_length=MAX_LEN, trainable = False))

# TODO: Add a SimpleRNN layer
model_rnn.add(SimpleRNN(64))
model_rnn.add(Dense(2, activation='softmax'))
model_rnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model_rnn.fit(train_X, train_y, epochs=20, batch_size=120,
          validation_data=(valid_X, valid_y), callbacks=EarlyStopping(monitor='val_accuracy', mode='max',patience=3))

predictions = model_rnn.predict(valid_X)
predictions = predictions.argmax(axis=1)
print(classification_report(valid_y.argmax(axis=1), predictions))

Epoch 1/20




[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 37ms/step - accuracy: 0.5156 - loss: 0.6970 - val_accuracy: 0.5412 - val_loss: 0.6873
Epoch 2/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 23ms/step - accuracy: 0.5824 - loss: 0.6742 - val_accuracy: 0.6008 - val_loss: 0.6596
Epoch 3/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 22ms/step - accuracy: 0.6224 - loss: 0.6469 - val_accuracy: 0.6033 - val_loss: 0.6632
Epoch 4/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 21ms/step - accuracy: 0.6480 - loss: 0.6258 - val_accuracy: 0.6313 - val_loss: 0.6386
Epoch 5/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.6635 - loss: 0.6116 - val_accuracy: 0.6263 - val_loss: 0.6499
Epoch 6/20
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.6780 - loss: 0.5990 - val_accuracy: 0.6275 - val_loss: 0.6480
Epoch 7/20
[1m134/134[0m [32m━