# Analyzing IMDB Data in Keras

In [1]:
# Imports
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

Using TensorFlow backend.


## 1. Loading the data
This dataset comes preloaded with Keras, so one simple command will get us training and testing data. There is a parameter for how many words we want to look at. We've set it at 1000, but feel free to experiment.

In [2]:
# Loading the data (it's preloaded in Keras)
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000)

print(x_train.shape)
print(x_test.shape)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
(25000,)
(25000,)


## 2. Examining the data
Notice that the data has been already pre-processed, where all the words have numbers, and the reviews come in as a vector with the words that the review contains. For example, if the word 'the' is the first one in our dictionary, and a review contains the word 'the', then there is a 1 in the corresponding vector.

The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

In [3]:
print(x_train[1])
print(y_train[1])

[1, 48, 25, 942, 72, 4, 86, 31, 16, 66, 128, 31, 168, 33, 2, 2, 2, 59, 9, 147, 384, 2, 250, 168, 33, 2, 2, 59, 9, 43, 117, 2, 2, 187, 59, 9, 164, 84, 92, 2, 41, 333, 2, 16, 2, 5, 893, 11, 86, 20, 150, 29, 9, 896, 393, 65, 9, 24, 15, 52, 5, 13, 81, 24, 391, 138, 161, 36, 97, 14, 31, 86, 12, 9, 4, 454, 2, 156, 164, 19, 65, 14, 9, 24, 2, 14, 9, 395, 86, 31, 47, 128, 156, 128, 65, 5, 94, 384, 13, 104, 15, 4, 228, 9, 128, 11, 2, 2, 300, 5, 4, 228, 9, 128, 11, 2, 2, 342, 12, 9, 24, 4, 249, 20, 13, 219, 21, 11, 2, 19, 86, 31, 94, 31, 194, 194, 194, 164]
0


## 3. One-hot encoding the output
Here, we'll turn the input vectors into (0,1)-vectors. For example, if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.

In [4]:
# One-hot encoding the output into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])

[ 0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  0.  0.  1.  1.  1.
  1.  1.  0.  1.  1.  0.  0.  0.  1.  0.  1.  1.  1.  1.  1.  0.  1.  0.
  1.  1.  1.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  0.  1.  1.  1.  0.
  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  1.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  1.  1.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  1.  0.  0.  1.  1.  1.  0.  1.
  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  0.  0.  1.  0.
  1.  0.  1.  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  1.  0.  1.  0.  0.
  1.  0.  0.  0.  0.  1.  0.  1.  1.  0.  1.  0.  1.  0.  0.  1.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  1.  1.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0

And we'll also one-hot encode the output.

In [5]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## 4. Building the  model architecture
Build a model here using sequential. Feel free to experiment with different layers and sizes! Also, experiment adding dropout to reduce overfitting.

In [96]:
from keras import optimizers

sentiment = Sequential()

sentiment.add(Dense(80,input_dim=1000))
sentiment.add(Dropout(0.25))
sentiment.add(Activation("sigmoid"))
sentiment.add(Dense(2))
sentiment.compile(loss = 'categorical_crossentropy',optimizer = "RMSprop", metrics = ['accuracy'])

# TODO: Build the model architecture

# TODO: Compile the model using a loss function and an optimizer.


## 5. Training the model
Run the model here. Experiment with different batch_size, and number of epochs!

In [99]:
hist = sentiment.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          validation_data=(x_test, y_test), 
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/15
 - 3s - loss: 0.8801 - acc: 0.4085 - val_loss: 0.4004 - val_acc: 0.1502
Epoch 2/15
 - 3s - loss: 1.1538 - acc: 0.2313 - val_loss: 0.5791 - val_acc: 0.7898
Epoch 3/15
 - 3s - loss: 0.5780 - acc: 0.8361 - val_loss: 0.3953 - val_acc: 0.8573
Epoch 4/15
 - 3s - loss: 0.9961 - acc: 0.7579 - val_loss: 0.4725 - val_acc: 0.1806
Epoch 5/15
 - 3s - loss: 0.6709 - acc: 0.1636 - val_loss: 0.4643 - val_acc: 0.1432
Epoch 6/15
 - 3s - loss: 1.3422 - acc: 0.2383 - val_loss: 0.5519 - val_acc: 0.8046
Epoch 7/15
 - 3s - loss: 0.6539 - acc: 0.8396 - val_loss: 0.4074 - val_acc: 0.8573
Epoch 8/15
 - 3s - loss: 1.0460 - acc: 0.6241 - val_loss: 0.4407 - val_acc: 0.1522
Epoch 9/15
 - 3s - loss: 1.2363 - acc: 0.4454 - val_loss: 0.3773 - val_acc: 0.8546
Epoch 10/15
 - 3s - loss: 0.8094 - acc: 0.8598 - val_loss: 2.6323 - val_acc: 0.8546
Epoch 11/15
 - 3s - loss: 0.9651 - acc: 0.2819 - val_loss: 0.4162 - val_acc: 0.1438
Epoch 12/15
 - 3s - loss: 0.6652 - 

## 6. Evaluating the model
This will give you the accuracy of the model, as evaluated on the testing set. Can you get something over 85%?

In [98]:
score = sentiment.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.85292
