<a href="https://colab.research.google.com/github/mishra-atul5001/Data-Science-and-ML-insights-Projects/blob/master/Sentiment_Analysis_Keras_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hi!

Welcome to another example of my **Weekend Projects**, where the problem statement is like: **Perform Sentiment Analysis on the Keras IMDB DataSet using a RNN Model**.

Firstly i would like to thank: [Sentiment Analysis with RNN from Susan](https://github.com/susanli2016/NLP-with-Python/blob/master/Sentiment%20Analysis%20with%20RNN.ipynb) as this repo helped me in taking the first step. I would be almost replicating the same initially, but in order to achieve better performance, i'll be playing around with **Regex,StopWords,Lemmatization** which involves filtering the text to a greater depth.

This should help, and if it doesn't then no worry!, becasue we'll be trying out multiple models too!

Let's begin!.

In [39]:
import numpy as np
import pandas as pd
import re,os
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# well i just thought that since, we are doing NLP Problem, let's try ML Models too then! I'lll define the approach too!

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, GRU

In [40]:
# let's take Vocabulary Size to be 6000 words so that we can make our model robust. We'll also be checking the MAX Len and MIN Len of the sentence too!

vocab_size = 6000
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=vocab_size)

print('X Train Size:', X_train.shape)
print('y Train Size:', y_train.shape)
print('X Test Size:', X_test.shape)
print('y Test Size:', y_test.shape)

X Train Size: (25000,)
y Train Size: (25000,)
X Test Size: (25000,)
y Test Size: (25000,)


In [41]:
#So, we see that we have 25K rows. Let's print on the review then i guess

print(X_train[24])

[1, 4, 204, 2, 20, 16, 93, 11, 2, 19, 2, 4390, 6, 55, 52, 22, 849, 4227, 119, 7, 5259, 961, 178, 6, 1018, 221, 20, 1184, 2, 2, 29, 7, 265, 16, 530, 17, 29, 220, 210, 468, 8, 30, 11, 32, 7, 27, 102, 5910, 3634, 17, 3278, 1881, 16, 6, 2, 7, 1262, 190, 4, 20, 122, 2353, 8, 79, 6, 117, 196, 11, 1370, 12, 127, 24, 847, 33, 4, 1062, 7, 4, 2, 310, 131, 12, 9, 6, 253, 20, 15, 144, 30, 110, 33, 222, 280]


In [42]:
# label for 24th review
print(y_train[24])

1


But, we have numbers only, what does this mean?

This means that each word has been assigned with a number and already numericalled for us. So what we can do is the same, to call the word index and then print the **contextual review**.

In [43]:
context_word = imdb.get_word_index()
review_we_need = {}
review_list = []
for word,indx in context_word.items():
  review_we_need.update({indx:word})
print(review_we_need)



In [44]:
for i in X_train[24]:
  review_list.append(review_we_need.get(i,' '))

print('Review ->')
print(review_list)
print('It"s Label ->')
print(y_train[24])

Review ->
['the', 'of', "i've", 'and', 'on', 'with', 'way', 'this', 'and', 'film', 'and', 'mann', 'is', 'time', 'very', 'you', 'de', '3rd', 'did', 'br', 'phil', 'total', 'want', 'is', 'married', 'done', 'on', 'project', 'and', 'and', 'all', 'br', 'screen', 'with', 'themselves', 'movie', 'all', 'family', 'point', 'turn', 'in', 'at', 'this', 'an', 'br', 'be', 'characters', 'kolchak', 'glover', 'movie', 'bo', 'moon', 'with', 'is', 'and', 'br', 'frank', 'take', 'of', 'on', 'off', 'roy', 'in', 'also', 'is', 'over', 'both', 'this', 'details', 'that', 'end', 'his', 'learn', 'they', 'of', "'the", 'br', 'of', 'and', 'house', 'these', 'that', 'it', 'is', 'played', 'on', 'for', 'real', 'at', 'life', 'they', "there's", 'true']
It"s Label ->
1


In [45]:
# Cool Right! Let's check out the Maximum and Minimum Review Len

max_len = (len(max((X_train + X_test),key = len)))
min_len = (len(min((X_train + X_test),key = len)))

print('Max Len of Review: ', max_len)
print('Min Len of Review: ', min_len)

Max Len of Review:  2697
Min Len of Review:  70


In [46]:
# Now we will use Padding, because we want to keep the shape of the Review uniform. So what padding will do is add 0 for shorter review and truncate longer reviews
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

Now, we'll do the RNN Building using LSTM. Yes, we haven't prepared the data by removing **StopWords or Bringing some of the words to their Root Word, but LSTM is smart enough to capture these and not use while training. but we have to do this preprocessing for greater good!**

In [47]:
rnn_model = Sequential()
rnn_model.add(Embedding(vocab_size,32,input_length=max_words))
rnn_model.add(LSTM(100))
rnn_model.add(Dense(1,activation = 'sigmoid'))

print(rnn_model.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 500, 32)           192000    
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 245,301
Trainable params: 245,301
Non-trainable params: 0
_________________________________________________________________
None


To Summarize:


1.   1 Embedding Layer
2.   1 LSTM Layer
3.   245,301 Total Parameters!

Let's train the model!



In [48]:
rnn_model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])
batch_size = 64
num_epochs = 3

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

rnn_model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x7f4159cfcc88>

Hyperparameters I want to try Out!

*   Changing Optimizer to **nadam** as this is more enhanced version and takes Learning rate into Account.
*   Increasing Epochs and Batch Size
*   Increasing the Embedding Layers and Changing to **GRU** model.



In [49]:
# Optimizer to NADAM and Batch Size = 60, Epoch to 10

rnn_model.compile(loss='binary_crossentropy', 
             optimizer='nadam', 
             metrics=['accuracy'])
batch_size = 60
num_epochs = 10

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

rnn_model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 24940 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f41502e0f60>

In [50]:
# GRU Model

rnn_model_gru = Sequential()
rnn_model_gru.add(Embedding(vocab_size,32,input_length=max_words))
rnn_model_gru.add(GRU(100))
rnn_model_gru.add(Dense(1,activation = 'sigmoid'))

print(rnn_model_gru.summary())

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 500, 32)           192000    
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               39900     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 232,001
Trainable params: 232,001
Non-trainable params: 0
_________________________________________________________________
None


In [51]:
# Optimizer to NADAM and Batch Size = 60, Epoch to 10

rnn_model_gru.compile(loss='binary_crossentropy', 
             optimizer='nadam', 
             metrics=['accuracy'])
batch_size = 60
num_epochs = 10

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

rnn_model_gru.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 24940 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f413c72af60>

In [52]:
rnn_model_gru.compile(loss='binary_crossentropy', 
             optimizer='nadam', 
             metrics=['accuracy'])
batch_size = 64
num_epochs = 3

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

rnn_model_gru.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x7f4140e8feb8>