In [1]:
import numpy as np
import pandas as pd

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Conv1D, Dense, Dropout, Activation
from keras.layers import Embedding, SpatialDropout1D
from keras.layers.pooling import GlobalMaxPooling1D
from keras.datasets import imdb

Using TensorFlow backend.


## Data Preparation

### Some example reviews

In [2]:
id_to_word = {i:w for (w,i) in imdb.get_word_index().items()}
f = np.load('/home/helen/.keras/datasets/imdb.npz')
for i in range(3):
    print(' '.join([id_to_word[id] for id in f['x_train'][i]]) + '\n')

bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't

homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most pe

### Load Pre-processed IMDB data

In [3]:
max_features = 5000
print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


### Pad sequences with 0s
so that all sequences have the same length

In [4]:
maxlen = 400

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Pad sequences (samples x time)
X_train shape: (25000, 400)
X_test shape: (25000, 400)


In [5]:
np.vstack([X_train, X_test]).shape

(50000, 400)

## Convolutional Neural Networks for sentiment classification

In [6]:
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(input_dim=max_features, output_dim=50, input_length=maxlen))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Conv1D(filters=250, kernel_size=3, activation='relu', padding='valid'))

# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(250))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

### Visualization of the CNN model
<img src="http://www.samyzaf.com/ML/imdb/cnn4.png"/>

1. The input layer is a sequence of word IDs.
2. The first hidden layer is a word embedding layer (a.k.a. word2vec). This layer encodes a word to a real vector of size 50.
3. The next layer is a 1D convolutional layer, in which the convolutional kernel slides over the words.
4. Next, the max pooling layer selects the most notable feature (across the output sequence) for each feature map.
5. The next layer is a fully connected layer, as in a regular feed forward network. The output of this layer is a vector that encodes the whole tweet.
 - Steps 1-5 form a Sentence-2-vec model.
6. A logistic regression layer is applied on top for the final classification.

### Train model

In [7]:
batch_size = 32
epochs = 5

In [8]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


batch_size = 32
epochs = 2
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(X_test, y_test));

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


## Retrain the model on the full dataset

In [9]:
X = np.concatenate((X_train, X_test))
y = np.concatenate((y_train, y_test))
model.fit(X, y, batch_size=batch_size, epochs=epochs);

Epoch 1/2
Epoch 2/2


## Save the model

In [10]:
model.save('model.h5')

## Conclusions
1. With Convolutional Neural Networks (CNNs), we obtained a very good movie review sentiment classifier (89% accuracy), even with a not-big dataset (with just 25000 examples).
- CNNs are powerful, not just in Computer Vision but also in Text Understanding.