## Convolution Neural Networks for text classification

### IMDB Movie sentiment dataset

This example demonstrates the use of Convolution1D for text classification - taken from Keras github source.
Original code claims to get 0.89 test accuracy after 2 epochs.
90s/epoch on Intel i5 2.4Ghz CPU.
10s/epoch on Tesla K40 GPU.


### Additional Task!

From the following code, you may get various test accuracy (can be also lower than original claim of this code). 

Try to investigate the performance by changing arbitrarily parameters:

* epochs
* maxlen
* max_features
* kernel_size
* batch_size

and also hyperparameters:

* comment dropout layer
* change non-linear function in activation layer

And discuss why these parameters influence the accuracy performance of test set

In [1]:
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
from keras.callbacks import Callback

Using TensorFlow backend.


In [2]:
import time
import _pickle as cPickle

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [3]:
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

In [4]:
class TrainingHistory(Callback):

    def on_train_begin(self, logs={}):
        # loss per batch
        self.loss = []
        self.acc = []
        # loss per epoch
        self.ep_loss = []
        self.ep_acc = []
        self.ep_val_loss = []
        self.ep_val_acc = []       
        self.i = 0
        self.i_ep = 0
        
    def on_epoch_end(self, batch, logs={}):
        self.ep_loss.append(logs.get('loss'))
        self.ep_acc.append(logs.get('acc'))
        self.ep_val_loss.append(logs.get('val_loss'))
        self.ep_val_acc.append(logs.get('val_loss'))
        self.i_ep += 1

    def on_batch_end(self, batch, logs={}):
        self.loss.append(logs.get('loss'))
        self.acc.append(logs.get('acc'))
        self.i += 1

In [5]:
history = TrainingHistory()

### 1. Read data

#### IMDB Movie reviews sentiment classification

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

In [6]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


Returns 2 tuples:

* x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
* y_train, y_test: list of integer labels (1 or 0).

In [8]:
x_train.shape

(25000,)

In [9]:
y_train.shape

(25000,)

In [10]:
x_test.shape

(25000,)

In [11]:
y_test.shape

(25000,)

In [7]:
x_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

class label 0 is `negative` sentiment, while 1 is `positive` sentiment

In [12]:
y_train[0]

1

### Look-up dictionary

In [13]:
word_index = imdb.get_word_index()

In [14]:
len(word_index)

88584

In [15]:
list(word_index.items())[:5]

[('asshats', 62855),
 ('death', 338),
 ('jesues', 68543),
 ('scandalously', 73137),
 ('nyman', 38432)]

In [16]:
reversed_word_index= dict((v,k) for (k,v) in word_index.items())

In [17]:
list(reversed_word_index.items())[:5]

[(1, 'the'), (2, 'and'), (3, 'a'), (4, 'of'), (5, 'to')]

### Padding integer sequences to fixed length vector

(we will feed this fixed length vector as input of our model)

In [18]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Pad sequences (samples x time)
x_train shape: (25000, 400)
x_test shape: (25000, 400)


### Conv-Nets model

In [24]:
print('Build model...')
model = Sequential()
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen, name='embedding_layer'))
model.add(Dropout(0.2, name='dropout_1'))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1, name='conv_layer'))
# we use max pooling:
model.add(GlobalMaxPooling1D(name='max_pooling'))
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims, name='dense_layer'))
model.add(Dropout(0.2, name='dropout_2'))
model.add(Activation('relu', name='relu_dense'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1, name='prediction_layer'))
model.add(Activation('sigmoid', name='sigm_prediction'))

Build model...


In [25]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [26]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 400, 50)           250000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv_layer (Conv1D)          (None, 398, 250)          37750     
_________________________________________________________________
max_pooling (GlobalMaxPoolin (None, 250)               0         
_________________________________________________________________
dense_layer (Dense)          (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
_________________________________________________________________
relu_dense (Activation)      (None, 250)               0         
__________

In [29]:
start = time.time() # Start time

In [30]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test),
          callbacks=[history]
         )

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f67c9312908>

In [31]:
end = time.time() # End time
elapsed = end - start
print("Training time: %.2f seconds" %elapsed)

Training time: 1733.23 seconds


In [32]:
import os
import sys

HISTORY_PATH = 'history'

In [33]:
# saving file into pickle format
def savePickle(dataToWrite,pickleFilename):
    f = open(pickleFilename, 'wb')
    cPickle.dump(dataToWrite, f)
    f.close()

In [34]:
# reading file in pickle format
def readPickle(pickleFilename):
    f = open(pickleFilename, 'rb')
    obj = cPickle.load(f)
    f.close()
    return obj

In [35]:
# Save training history
savePickle(history.loss, os.path.join(HISTORY_PATH,'imdb_cnn_batch_loss'))
savePickle(history.acc, os.path.join(HISTORY_PATH,'imdb_cnn_batch_accuracy'))
savePickle(history.ep_loss, os.path.join(HISTORY_PATH,'imdb_cnn_epoch_loss'))
savePickle(history.ep_acc, os.path.join(HISTORY_PATH,'imdb_cnn_epoch_accuracy'))
savePickle(history.ep_val_loss, os.path.join(HISTORY_PATH,'imdb_cnn_epoch_val_loss'))
savePickle(history.ep_val_acc, os.path.join(HISTORY_PATH,'imdb_cnn_epoch_val_accuracy'))

In [36]:
batch_loss = readPickle(os.path.join(HISTORY_PATH,'imdb_cnn_batch_loss'))
batch_accuracy  = readPickle(os.path.join(HISTORY_PATH,'imdb_cnn_batch_accuracy'))

In [37]:
X_batch_loss = [ i for i in range(len(batch_loss))]
Y_batch_loss = batch_loss

In [38]:
X_batch_accuracy = [ i for i in range(len(batch_accuracy))]
Y_batch_accuracy = batch_accuracy

In [39]:
loss = go.Scatter(
	x = X_batch_loss,
	y = Y_batch_loss,
	mode = 'lines'
	)

In [40]:
acc = go.Scatter(
	x = X_batch_accuracy,
	y = Y_batch_accuracy,
	mode = 'lines'
	)


In [41]:
data_loss = [loss]

In [42]:
data_acc = [acc]

In [43]:
layout_loss = go.Layout(
		title = 'Training error loss',
		xaxis = dict(
			 	title ='Batch-epoch'
			),
		yaxis = dict(
				title = 'Error'
			)
		)

In [44]:
layout_acc = go.Layout(
		title = 'Training accuracy',
		xaxis = dict(
			 	title ='Batch-epoch'
			),
		yaxis = dict(
				title = 'Accuracy'
			)
		)

In [45]:
fig_loss = go.Figure(data=data_loss, layout=layout_loss)

In [None]:
# if you want to save image file (offline version), un-comment this
# plot(fig_loss, filename= os.path.join(HISTORY_PATH,'imdb_cnn_batch_loss.html'), image='png')

In [49]:
init_notebook_mode(connected=True)
iplot(fig_loss, filename='imdb_cnn_batch_loss')

In [50]:
fig_acc = go.Figure(data=data_acc, layout=layout_acc)

In [None]:
# if you want to save image file (offline version), un-comment this
# plot(fig_acc, filename= os.path.join(HISTORY_PATH,'imdb_cnn_batch_accuracy.html'), image='png')

In [51]:
init_notebook_mode(connected=True)
iplot(fig_acc, filename='imdb_cnn_batch_accuracy')

In [56]:
error_score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', error_score)
print('Test accuracy:', acc)

Test score: 0.262518066916
Test accuracy: 0.89004
