## Sentiment Analysis of Movie Reviews from IMDB dataset 

### Recurring Neural Network with Keras

#### We are going to use IMDB dataset that consists of user generated movie reviews and classification of whether the user liked the movie or not based on its associated rating.

**We are using RNN to do some sentiment analysis on full-text movie reviews!**

**Steps to follow:**

To train an artificial network how to "read" movie reviews and guess whether the author liked the movie or not from them  

We need a recurrent neural network to keep a "memory" of the words that have come before as it "reads" sentences over the time because understanding written language demands keeping track of all the words in a sentence 

Words that have been used early on in a sentence can affect the meaning of the sentence so to avoid such issues we will use (Long Short-Term Memory) cells

### Importing Packages

In [46]:
import tensorflow
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
from keras import optimizers


Now import  our training and testing data. we specify that we only care about the 20000 most popular words in the dataset in order to keep things somewhat manageable. The data includes 5000 training reviews and 25000 testing reviews for some reason.

In [47]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=20000,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

Lets get an idea of how and what this data looks like. the training feature should represent a written movie review.

In [48]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,

That doesnot look like a movie review but this dataset has spared us a lot of trouble. they have already converted words to integer based indices. the actual letters that made up a word dont really matter as far as our model is concerned. what matters is the word themselves and our model needs numbers to work with, not letters.
hence each number represent some specific words in the training features. its a bummer that we cant just read the reviews in english as a gut check to see if sentiment analysis is really working though

In [49]:
y_train[0]

1

They are just 0 or 1, which indicates whether the reviewer said they liked the movie or not. bunch of movie reviews have been converted into vectors of words represented by integers, and a binary sentiment classification to learn from. RNN can blow up quickly, so again to keep things manageable on our little PC lets limit the reviews to first 100 words.


In [51]:
from keras.preprocessing import sequence
from keras.models import Sequential
x_train = sequence.pad_sequences(x_train, maxlen=100)
x_test = sequence.pad_sequences(x_test, maxlen=100)

Lets set up our neural network model!we will start with an embedding layer- 

This is just a step that converts the input data into dense vectors of fixed size thats better suited for a neural network you generally see this in conjunction with index based text data like we have here. the 20000 indicates the vocabulary size(remember we said we only wanted the top 20000 words)and 128 is the output of 128 units.

Next we just have to set up a LSTM layer for the RNN itself. its that easy we specify 128 to match the output size of the embedding layer and dropout terms to avoid overfitting, which rnn's  are particularly prone to. 

Finally we just need to boil down it down to a single neuron with a sigmoid activation function to choose our binary sentiment classificatiom of 0 or 1

In [52]:
review = Sequential()
review.add(Embedding(20000,128))
review.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
review.add(Dense(1, activation='sigmoid'))


As this is a binary classification problem, we'll use the binary_crossentropy loss function.

In [53]:
review.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Now its time to train our model review. RNN is resource heavy. Keeping the batch size relatively small is the best way to run on pc.

In [54]:
review.fit(x_train, y_train, batch_size = 32, epochs = 15, verbose = 2, validation_data = (x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
 - 261s - loss: 0.4662 - acc: 0.7798 - val_loss: 0.3977 - val_acc: 0.8188
Epoch 2/15
 - 272s - loss: 0.3125 - acc: 0.8731 - val_loss: 0.3657 - val_acc: 0.8471
Epoch 3/15
 - 255s - loss: 0.2232 - acc: 0.9121 - val_loss: 0.3822 - val_acc: 0.8357
Epoch 4/15
 - 245s - loss: 0.1617 - acc: 0.9392 - val_loss: 0.4182 - val_acc: 0.8395
Epoch 5/15
 - 246s - loss: 0.1147 - acc: 0.9577 - val_loss: 0.5426 - val_acc: 0.8331
Epoch 6/15
 - 245s - loss: 0.0893 - acc: 0.9688 - val_loss: 0.5729 - val_acc: 0.8314
Epoch 7/15
 - 244s - loss: 0.0605 - acc: 0.9796 - val_loss: 0.6191 - val_acc: 0.8299
Epoch 8/15
 - 255s - loss: 0.0448 - acc: 0.9847 - val_loss: 0.7623 - val_acc: 0.8294
Epoch 9/15
 - 273s - loss: 0.0366 - acc: 0.9874 - val_loss: 0.8108 - val_acc: 0.8221
Epoch 10/15
 - 259s - loss: 0.0332 - acc: 0.9896 - val_loss: 0.8003 - val_acc: 0.8286
Epoch 11/15
 - 245s - loss: 0.0230 - acc: 0.9928 - val_loss: 0.8536 - val_acc: 0.8284
Epoch 12/15
 

<keras.callbacks.History at 0x2b4a504def0>

OK, Let's evaluate our model's accuracy:

In [55]:
score, acc = review.evaluate(x_test, y_test, batch_size=32, verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 1.1509044100207277
Test accuracy: 0.82256


## Conclusion

Initially when i started working on it, i didnot set maxlen in sequencepad and got accuracy close to 83%, i was worried for the data to blow up without any limits so i had set it to 120 then eventually 100 which seemed optimum to me. Adam is usually a good optimizer so i decided to go with it for tuning the model. Choosing number of epochs and running code took me forever. However i changed the no. of epochs, observed the results and compared them. i have changed it from 15 to 20, 20 to 50, and lastly 50 to 15. The results due to the change in no. of epochs was'nt producing a major change in the accuracy. Hence i decided to go with 15 epochs and managed to score an accuracy of 82.256%

**Considering we limited ourselves to 100 words of each review, accuracy is not bad.
with such amount of minimal  code we have a neural network that can read reviews and deduce whether the author liked the movie or not based on that text. And it takes the context of each word and its position in the review into account. It's pretty incredible what we can do with keras**
