# Analyzing IMDB Data in Keras

In [94]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

## 0. About
Goal: Build Neural Network Model in Keras, training and evaluating it, to analyse real data that consists of thousands of movie reviews from IMDB, and the challenge is to Predict the Sentiment Analysis of a review
based on the words in it. Use methods such as Dropout, 
Regularisation, and Optimisers to achieve a Model accuracy of 
at least 85%.

The data is a dataset of 25,000 IMDB reviews obtained from the Keras dataset list https://keras.io/datasets, each of which contains a Label where:
- Label 0 - Negative Review
- Label 1 - Positive Review

The data Input is already preprocessed where each review is 
encoded as Sequence https://keras.io/preprocessing/sequence/ of indexes corresponding to words in the review. Words are ordered
by their Frequency, where index 1 is most frequent word (i.e. "the"), index 2 is second most frequent word, and index 0 
is for unknown words.

The sentence is converted in a Vector by concatenating the integer Indexes. (i.e. given Sentence "to be or not to be" then 
the indices of the words are:
"to": 5
"be": 8
"or": 21
"not": 3

... and the Sentence is encoded as Vector [5, 8, 21, 3, 5, 8]

## 1. Loading the data
This dataset comes preloaded with Keras. The following command splits the words into training and testing data sets and labels. There is a parameter for how many words we want to look at. We've set it at 1000, but feel free to experiment.

Note that:
- `num_words` is amount of most frequent words to consider (lower value will exclude obscure words like 'ultracrepidrian')
- `skip_top` is amount of most frequence words to ignore (i.e. to ignore the word "the" which would not add value to the review, we'd set `skip_top` to a value of 2 or higher)

In [95]:
# Loading the data (it's preloaded in Keras)
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
    num_words=1000,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3)

print(x_train.shape)
print(x_test.shape)

(25000,)
(25000,)


## 2. Examining the data
Notice that the data has been already pre-processed, where all the words have numbers, and the reviews come in as a vector with the words that the review contains. For example, if the word 'the' is the first one in our dictionary, and a review contains the word 'the', then there is a 1 in the corresponding vector.

The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

In [96]:
print(x_train[0])
print(y_train[0])

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
1


## 3. One-hot encoding the output
Here, we'll pre-process the data by turning the input vectors into (0,1)-vectors. 

Example 1: if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.

Example 2: if we have 10 words in our vocabulary, and the vector is (4,1,8), we'll turn it into the vector (1,0,0,1,0,0,0,1,0,0)

In [97]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])

[ 0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  1.  0.  0.  1.  1.  0.  1.  0.  1.  0.  1.  1.  0.  1.
  1.  0.  1.  1.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.
  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.  0.
  1.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

And we'll also one-hot encode the output.

In [98]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## 4. Building the  model architecture
Follow this guide https://keras.io/getting-started/sequential-model-guide/. Build a model here using sequential. Feel free to experiment with different layers and sizes! Also, experiment adding dropout to reduce overfitting.

In [99]:
# Build the model architecture
model = Sequential()

# Add layer instances to constructor
# Inform first layer the input shape. Subsequent layers infer the shape
model.add(Dense(512, activation='relu', input_dim=1000))
model.add(Dropout(0.5))
# model.add(Activation('tanh'))
model.add(Dense(num_classes, activation='softmax'))
model.summary()

# Multi-class classification config.
# Compile the model using a loss (error) function and an optimizer.
# Configure learning process before training the model
# Optimisers: https://keras.io/optimizers/

# # Option #1 - RMSProp
# model.compile(loss='categorical_crossentropy',
#               optimizer='rmsprop',
#               metrics=['accuracy'])

# Option #2 - RMSProp
rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(loss='categorical_crossentropy', 
              optimizer=rmsprop,
              metrics=['accuracy'])

# # Option #3 - Stochastic gradient descent optimizer.
# # All parameter gradients clipped to max of 1
# sgd = keras.optimizers.SGD(lr=0.01, 
#                      decay=1e-6, 
#                      momentum=0.9, 
#                      nesterov=True,
#                      clipvalue=1.
#                     )
# model.compile(loss='mean_squared_error', optimizer=sgd)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_25 (Dense)             (None, 512)               512512    
_________________________________________________________________
dropout_13 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_26 (Dense)             (None, 2)                 1026      
Total params: 513,538
Trainable params: 513,538
Non-trainable params: 0
_________________________________________________________________


## 5. Training the model
Run the model here. Experiment with different batch_size, and number of epochs!

In [100]:
# Train model iterating on the data in batches of 32 samples
# Fit: https://keras.io/models/sequential/
hist = model.fit(x_train, y_train,
          batch_size=32,
          epochs=2,
          validation_data=(x_test, y_test), 
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
9s - loss: 0.3966 - acc: 0.8249 - val_loss: 0.3421 - val_acc: 0.8552
Epoch 2/2
8s - loss: 0.3341 - acc: 0.8671 - val_loss: 0.3427 - val_acc: 0.8632


## 6. Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [101]:
print(type(model.evaluate(x_test, y_test, verbose=0)))
score = model.evaluate(x_test, y_test, verbose=0)
print("Model metric names: ", model.metrics_names)
print(score)
print("Accuracy: ", score[1])

<class 'list'>
Model metric names:  ['loss', 'acc']
[0.3426602667713165, 0.86316000000000004]
Accuracy:  0.86316
