# Sentiment Analysis from movie reviews using RNN
We will be using IMDb dataset which consists of user movie reviews of whether the user liked the movie or not.
We are going to understand the language written and keep them in memory over time and predict the sentiment of the user in each review.
Here we are using LSTM cells because we dont want to forget the words too quickly. So that words early in a sentence will affect meaning of that sentence significantly.

In [0]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.datasets import imdb

Lets load our train and test data. Lets limit the popular words to 50000 words loaded into our training and test data. for some reason we have 25000 training reviews and 25000 testing reviews.

In [2]:
(x_train, y_train),(x_test,y_test) = imdb.load_data(num_words=50000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Lets look at how the data is

In [3]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

Well thats just numbers :P. Well keras have already convert the each unique word to numbers and we have limited those numbers to 50000 so that we have only those popular words to work with to determine the sentiment.
We cannot read these reviews but it saves a lot of time during pre processing.
Lets look at the corresponding y_train. It will have 0 or 1 which corresponds to the dislike and like of that movie 

In [4]:
y_train[0]

1

Now we are going to limit the review to 150 words per review as RNNs are very resource intensive. 

In [0]:
#We are going to trim the reviews to first 150 words.
x_train = sequence.pad_sequences(x_train, maxlen=150)
x_test = sequence.pad_sequences(x_test, maxlen=150)

Lets build the model and take a look how it is programmed. 
We will start with a Embedding layer, this is a step that converts the input data into dense vectors of fixed size that's better suited for a NN. So our 150 word review which has only 50000 frequent words will be converted/ funneled into a dense vector of 128 neurons.
Then we will set up our RNN with a LSTM of 128 recurrent neuron and a dropout of 20% to prevent overfitting.
Finally we will have a single output neuron which choses out our binary sentiment classification of 0 or 1.

In [6]:
model = Sequential()
model.add(Embedding(50000, 128)) #Reducing our words to vectors of 128 neurons
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))#LSTM with 128 recurrent neuron and a dropout of 20%
model.add(Dense(1, activation='sigmoid'))#OP layer
model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         6400000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 6,531,713
Trainable params: 6,531,713
Non-trainable params: 0
_________________________________________________________________


Since its a binary classification we will use binary_crossentropy loss function and adam optimizer.

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Lets train the model. This will take a lot of time.
We will have a batch size of 32 and 15 epochs 

In [9]:
model.fit(x_train, y_train, batch_size=32, epochs=15, verbose=2, validation_data=(x_test,y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
25000/25000 - 258s - loss: 0.4631 - acc: 0.7823 - val_loss: 0.4802 - val_acc: 0.7729
Epoch 2/15
25000/25000 - 255s - loss: 0.2853 - acc: 0.8877 - val_loss: 0.3894 - val_acc: 0.8394
Epoch 3/15
25000/25000 - 255s - loss: 0.1749 - acc: 0.9366 - val_loss: 0.4098 - val_acc: 0.8494
Epoch 4/15
25000/25000 - 254s - loss: 0.1199 - acc: 0.9573 - val_loss: 0.4580 - val_acc: 0.8314
Epoch 5/15
25000/25000 - 251s - loss: 0.0785 - acc: 0.9728 - val_loss: 0.5004 - val_acc: 0.8413
Epoch 6/15
25000/25000 - 252s - loss: 0.0491 - acc: 0.9837 - val_loss: 0.5894 - val_acc: 0.8394
Epoch 7/15
25000/25000 - 258s - loss: 0.0356 - acc: 0.9881 - val_loss: 0.6645 - val_acc: 0.8417
Epoch 8/15
25000/25000 - 254s - loss: 0.0257 - acc: 0.9920 - val_loss: 0.7935 - val_acc: 0.8312
Epoch 9/15
25000/25000 - 256s - loss: 0.0210 - acc: 0.9930 - val_loss: 0.7761 - val_acc: 0.8356
Epoch 10/15
25000/25000 - 250s - loss: 0.0154 - acc: 0.9949 - val_loss: 0.8851 - val_a

<tensorflow.python.keras.callbacks.History at 0x7fed4f798438>

Lets evaluate our test data

In [10]:
score, acc = model.evaluate(x_test, y_test, batch_size=32, verbose=2)
print('Test Score: ',score)
print('Test Accuracy: ', acc)

25000/25000 - 28s - loss: 1.0146 - acc: 0.8298
Test Score:  1.014570419074893
Test Accuracy:  0.82976
