**Sentiment Analysis on Text using LSTM**

**Paper Reference:**
Sentiment analysis using deep learning approach: https://www.techscience.com/jai/v2n1/39512/pdf

***-> Use GPU runtime before running the code***

**Downloading imdb dataset from google drive folder**

In [None]:
!gdown --id 10uJu7ap6dponwrdViRUn5Yl86GY9ZHD4

Downloading...
From: https://drive.google.com/uc?id=10uJu7ap6dponwrdViRUn5Yl86GY9ZHD4
To: /content/IMDB Dataset.csv
100% 66.2M/66.2M [00:00<00:00, 140MB/s] 


In [None]:
import numpy as np
import pandas as pd

**Importing necessary layers and model that are required for this code**

In [None]:
from tensorflow.keras.layers import Dense, LSTM,Embedding, SpatialDropout1D
# from keras.utils.np_utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer #to vectorize a text corpus, by turning each text into either a sequence of integers
from tensorflow.keras.preprocessing.sequence import pad_sequences #for padding if sentence lengths are not equal
from nltk.corpus import stopwords
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

In [None]:
import re   #for regular expression
from sklearn.model_selection import train_test_split
from collections import Counter

**Downloading stopwords from nltk library**


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Cell for GPU connection**

In [None]:
import tensorflow as tf
isGPU=0
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print('GPU device not found')
else:
  print('Found GPU at: {}'.format(device_name))
  isGPU=1

GPU device not found


**Reading the IMDB dataset**

In [None]:
tf.debugging.set_log_device_placement(True)
if isGPU:
  with tf.device('/GPU:0'):
    data=pd.read_csv('/content/IMDB Dataset.csv')
else:
  data=pd.read_csv('/content/IMDB Dataset.csv')

**How dataset look like**

In [None]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Checking the shape of the dataset i.e. no. of rows and no. of columns**



In [None]:
data.shape

(50000, 2)

**The dataset has following columns**

In [None]:
data.columns

Index(['review', 'sentiment'], dtype='object')

**Preprocessing and cleaning the data**

**Next step is to make a set of all stop words in the dataset. Stop words are those words which are not of much relevance in the query statement. For ex. 'the', 'of', 'for' etc.**

In [None]:
english_stops = set(stopwords.words('english'))

**Now we'll split the columns and save them separately**

In [None]:
review = data['review']
sentiment = data['sentiment'] 

In [None]:
print(review.head())
print(sentiment.head())

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object
0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object


In [None]:
review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

**Next step is to get rid of special characters from the review. Replacing HTML tags with space from the reviews.**

In [None]:
review = review.replace({'<.*?>': ''}, regex = True) 

In [None]:
review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

**Replacing those characters that are not alphabets with space.**

In [None]:
review = review.replace({'[^A-Za-z]': ' '}, regex = True)

**First review after removing non-alpha characters**

In [None]:
review[0]

'One of the other reviewers has mentioned that after watching just   Oz episode you ll be hooked  They are right  as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence  which set in right from the word GO  Trust me  this is not a show for the faint hearted or timid  This show pulls no punches with regards to drugs  sex or violence  Its is hardcore  in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary  It focuses mainly on Emerald City  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  Em City is home to many  Aryans  Muslims  gangstas  Latinos  Christians  Italians  Irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away I would say the main appeal of the show is due to the fact that it goes where other shows wo

**A function for removing stopwords and convert the words into lowercase**

In [None]:
def remove_stop_word_and_lower(sent):
  tokens=sent.split()
  tokens = [w for w in tokens if not w in english_stops]
  tokens = [w.lower() for w in tokens]
  return " ".join(tokens)


**Number of total reviews**

In [None]:
len(review.values)

50000

**Taking review as input data and sentiment as label for that review**

In [None]:
x_data=review
y_data=sentiment

In [None]:
x_data

0        One of the other reviewers has mentioned that ...
1        A wonderful little production  The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there s a family where a little boy ...
4        Petter Mattei s  Love in the Time of Money  is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot  bad dialogue  bad acting  idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I m going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

**Tokenizing sentences into list of words and converting words into lowercase**

In [None]:
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])
x_data = x_data.apply(lambda review: [w.lower() for w in review]) 

In [None]:
review.values

array(['One of the other reviewers has mentioned that after watching just   Oz episode you ll be hooked  They are right  as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence  which set in right from the word GO  Trust me  this is not a show for the faint hearted or timid  This show pulls no punches with regards to drugs  sex or violence  Its is hardcore  in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary  It focuses mainly on Emerald City  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  Em City is home to many  Aryans  Muslims  gangstas  Latinos  Christians  Italians  Irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away I would say the main appeal of the show is due to the fact that it goes where other s

**Positive review as 1 and negative review as 0**

In [None]:
y_data = y_data.replace('positive', 1)
y_data = y_data.replace('negative', 0)

In [None]:
y_data

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64

In [None]:
type(x_data)

pandas.core.series.Series

In [None]:
print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


**Splitting data into training and testing part in 80:20 ratio**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
15297    [written, wind, irresistible, wonderfully, kin...
19407    [it, got, action, fantasy, mixed, together, wa...
29818    [visconti, masterpiece, i, admit, i, unfamilia...
2361     [this, movie, proof, go, redbox, read, descrip...
9685     [over, acted, heavy, handed, full, speeches, p...
                               ...                        
39460    [i, would, liked, write, story, i, would, like...
8247     [this, really, bad, waste, time, i, would, pro...
24467    [very, literate, intelligent, drama, group, in...
32326    [my, wife, i, saw, theater, first, came, there...
25630    [if, i, look, hard, enough, flaws, found, film...
Name: review, Length: 40000, dtype: object 

14537    [robert, duvall, direct, descendent, confedera...
20100    [if, real, story, early, baroque, painter, art...
9378     [while, i, never, fan, original, scooby, doo, ...
46634    [this, movie, surprised, the, box, misleading,...
32114    [i, always, found, betsy, drake, rather, creep...
 

In [None]:
print(len(x_train))
print(len(x_test))

40000
10000


**Finding the length of review i.e. size of list of tokens of review and using the maximum value for padding sequence**

In [None]:
review_length = []
for sent in x_train:
  review_length.append(len(sent))
max_length =int(np.ceil(np.mean(review_length)))

In [None]:
max_length

130

**Tokenizer converts list of words into list of integers. Basically it converts tokens into vector for learning(mathematical) purpose**

**x_train and x_test is converted into integers using texts_to_sequences method**

In [None]:
token = Tokenizer(lower=False)
token.fit_on_texts(x_train)

In [None]:
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

**Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using pad_sequences**

In [None]:
x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

In [None]:
total_words = len(token.word_index) + 1

In [None]:
total_words

92433

In [None]:
print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[  303  1773  9223 ...     0     0     0]
 [    7    99   114 ...     0     0     0]
 [ 6270   828     1 ...   714  2774  3565]
 ...
 [  812 10880   995 ...     0     0     0]
 [  210   225     1 ...     0     0     0]
 [   55     1    77 ...   107  8793  1224]] 

Encoded X Test
 [[  517  6069  1377 ...     0     0     0]
 [   55    64    15 ...     0     0     0]
 [  366     1    40 ...     0     0     0]
 ...
 [  210    23  1281 ...  2240   454  1227]
 [  169  4787 18036 ...     0     0     0]
 [    1  5151   535 ...     0     0     0]] 

Maximum review length:  130


**Defining the LSTM model used from training in GPU**


In [None]:
# EMBED_DIM = 32
# LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, 32, input_length = max_length))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Sub in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Mul in device /job

**The optimizer we are using is Adam. There are other opimizers as well like SGD, however adam turns out to be rather efficient than most other optimizers.**

**Since the output is binary in nature, we've used binary crossentropy as our loss function and as an evaluation matrics, we went with accuracy.**

**Compiling the defined model**

In [None]:
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Fill in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Fill in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0


**Printing the description of the model**

In [None]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 130, 32)           2957856   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 2,982,753
Trainable params: 2,982,753
Non-trainable params: 0
_________________________________________________________________
None


**We used a callback called checkpoint to save the model locally for every epoch if its accuracy improved from the previous epoch.**

In [None]:
checkpoint = ModelCheckpoint(
    '/content/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

**TRAINING**

**For training, We only need to fit our x_train (i.e. input) and y_train (i.e. output/label) data.**

**For this training, we've used a mini-batch learning method with a batch_size of 128 and 5 epochs.**

In [None]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])

Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in

<keras.callbacks.History at 0x7f7dfb1e4350>

**The training accuracy of the model turned out to be 0.9863**

**Now predicting against the testing data**

In [None]:
# y_pred = model.predict(x_test, batch_size = 128)
y_pred=(model.predict(x_test) > 0.7).astype("int32")

Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ZipDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Option

In [None]:
y_pred

array([[1],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]], dtype=int32)

**Displaying Total correct and incorrect prediction that our model has made while using the unknown testing data (i.e. x_test)**

In [None]:
true = 0
for i, y in enumerate(y_test):
  if y == y_pred[i][0]:
    true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 8673
Wrong Prediction: 1327
Accuracy: 86.72999999999999


To evaluate the model, we need to predict the sentiment using our x_test data and comparing the predictions with y_test (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. **The accuracy on testing data turns out to be 86.14%**

In [None]:
# !gdown --id 1EkuKCXOWe7dtIo2fLW0ylyIG4vxgLpV_

**Loading the model that we saved earlier.**

In [None]:
loaded_model = load_model('/content/LSTM.h5')

Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RandomUniform in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Sub in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Mul in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AddV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AssignVariableOp in device /job:

**Now we will test our model against a sample review created by us to check it's end result performance.**

In [None]:
sample_review = "Hi, the movie was bad and i did not like the action scenes."

**Applying necessary cleaning and filtering.**

In [None]:
regex = re.compile(r'[^a-zA-Z\s]')
sample_review = regex.sub('', sample_review)
print('Cleaned: ', sample_review)

Cleaned:  Hi the movie was bad and i did not like the action scenes


In [None]:
words = sample_review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]
print('Filtered: ', filtered)

Filtered:  ['hi movie bad like action scenes']


**Converting text to tokens.**

In [None]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[5175    3   19    6  114   60    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0]]


**Predicting if the review is positive or negative using our trained model. We have used a threshold of 0.7, i.e. all values below 0.7 would be marked as 0 and values above it would be labelled 1.**

In [None]:
result = (loaded_model.predict(tokenize_words) > 0.7).astype("int32")
# result = loaded_model.predict(tokenize_words)

Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ZipDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Option

**If the confidence score is 0, then the statement or review is negative. On the other hand, if the confidence score is 1, then the statement is positive.**

In [None]:
if result == 1 :
  print('positive')
else :
  print('negative')

negative


Hence, the review is correctly predicted as negative.