# Text Classification using CNN

Example modified from
https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py

In this tutorial we are going to use Keras(keras.io) a high level deep learning framework.

## 1. Data 
### 1.1 Data Description

The data set consists in movie reviews that were categorized according into positive and negative. Originally is a binary sentiment classification task.

The full description of the data set can be found here

http://ai.stanford.edu/~amaas/data/sentiment/

### 1.2 Data Loading
The data can be loaded directly using Keras. The text data is already processed and transform into integer indices. Each integer represents a unique word in the dataset. If the first occuring word were "some" each time that word appeared the number 1 will be assigned to it.

In [1]:
# Import the function
from keras.datasets import imdb

# max_num_words is the maximum number of unique words we want in the data set
max_num_words = 5000

# Read the data set. It is already splitted into training and testing
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_num_words)

Using TensorFlow backend.


Lets see how many samples do we have in testing and training

In [2]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

25000 train sequences
25000 test sequences


### 1.3 Data inspection
Now let's familiarize ourselves with the data. 

In [3]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 2,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 2,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 2,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 2,
 18,
 51,
 36,
 

As mentioned before each sample is composed of integer indeces. Now lets retrieve the dictionary to see the actual words

In [4]:
dictionary = {index:word for word, index in imdb.get_word_index().items()}

Let's create an auxiliary function to help us see all the words in a given sample

In [5]:
def get_sentence(sample_indices, dictionary):
    sentence = []
    for index in sample_indices:
        sentence.append(dictionary.get(index))
    return sentence

In [6]:
# the words in the sample above
get_sentence(x_train[0], dictionary)

['the',
 'as',
 'you',
 'with',
 'out',
 'themselves',
 'powerful',
 'lets',
 'loves',
 'their',
 'becomes',
 'reaching',
 'had',
 'journalist',
 'of',
 'lot',
 'from',
 'anyone',
 'to',
 'have',
 'after',
 'out',
 'atmosphere',
 'never',
 'more',
 'room',
 'and',
 'it',
 'so',
 'heart',
 'shows',
 'to',
 'years',
 'of',
 'every',
 'never',
 'going',
 'and',
 'help',
 'moments',
 'or',
 'of',
 'every',
 'chest',
 'visual',
 'movie',
 'except',
 'her',
 'was',
 'several',
 'of',
 'enough',
 'more',
 'with',
 'is',
 'now',
 'current',
 'film',
 'as',
 'you',
 'of',
 'mine',
 'potentially',
 'unfortunately',
 'of',
 'you',
 'than',
 'him',
 'that',
 'with',
 'out',
 'themselves',
 'her',
 'get',
 'for',
 'was',
 'camp',
 'of',
 'you',
 'movie',
 'sometimes',
 'movie',
 'that',
 'with',
 'scary',
 'but',
 'and',
 'to',
 'story',
 'wonderful',
 'that',
 'in',
 'seeing',
 'in',
 'character',
 'to',
 'of',
 '70s',
 'and',
 'with',
 'heart',
 'had',
 'shadows',
 'they',
 'of',
 'here',
 'that'

The labels are 0/1. Positive review is assigned 1 while a Negative review is assigned 0

In [7]:
# Number of samples in each category
import collections
collections.Counter(y_train)

Counter({0: 12500, 1: 12500})

We need to transform the labelss into vectors to be able to use the softmax

In [8]:
from keras.utils import np_utils
num_clases = 2
y_train = np_utils.to_categorical(y_train,num_clases)
y_test = np_utils.to_categorical(y_test,num_clases)

### 1.4 Data pre-processing
Given that each review can be of different length, we need to normalize the length of all the reviews so they are the same. This procedure is known as padding. 
If we decide that the maximum length of a review are 400 words any review with less than that will be added zeros at the begining. Any review with more than that will be truncated

In [9]:
max_length = 400
from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train, maxlen=max_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_length)

Visualize the new shape of training and testing data

In [10]:
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 400)
x_test shape: (25000, 400)


In [11]:
# same sample but with zeros added
x_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## 2. Model Building
In Keras it is neccesary to specify the model architecture/topology prior to ffed the data into it. The model is built by adding layers one by one. The layer that is added first is the first one to be executed and the second will be  executed ,in general, using the output from the first.

In [12]:
# we start defining sequential 
# which is the empty canvas where all the layers are stacked
from keras.models import Sequential
model = Sequential()

### 2.1 Embeddings Layer
The first layer will be an embedding layer. This layer maps each word to a real value vector. The size of the vector is called the embeddings_dim

In [13]:
from keras.layers import Embedding
embedding_dims = 50
model.add(Embedding(max_num_words,
                    embedding_dims,
                    input_length=max_length))

### 2.2 Convolution Layer
This layer perform the convolution along the embeddings. We need to specify the number of filters and the size of each one. In this case all the filters have the same size.

In [14]:
from keras.layers import Conv1D
filters = 50
kernel_size = 3
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))

### 2.3 Pooling Layer
This layer will do max pooling over each one of the filters. For the output of each filter only the maximum value will be kept

In [15]:
from keras.layers import GlobalMaxPooling1D
model.add(GlobalMaxPooling1D())

### 2.4 Fully Connected Hidden Layer
This layer is what a traditional neural network looks like. In Keras is called Dense, since it is a fully connected layer. this layer ask for the dimmension of the output. the simmensions of the input are infferred from the previous layer.

In [16]:
from keras.layers import Dense, Activation
# we only create a fully connected layer with the dimmension 
# of the output equal to the dimmensions os the input
hidden_neurons = filters
model.add(Dense(hidden_neurons))
# we add the activation of the output neurons as a separate layer
model.add(Activation('relu'))

### 2.5 Fully Connected Output Layer
This is the final layer and we will use the softmax as the output

In [17]:
model.add(Dense(2))
model.add(Activation('softmax'))

### 2.6. Compiling the model 
Finally we compile the model and check that everything is ok. In the compilation we specify the loss function, the optimizer and the output metric we want to see

In [18]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## 3. Training and Testing the model
This is where we finally train and test the model. Since the optimizer is using stochastic gradient descent we need to specify the batch size. One epoch occur when all the training data is used once for updating the weights. Since we are using SGD this occurs after sample_size/batch_size iterations

In [20]:
batch_size = 32
epochs = 2
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
         validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f885a912550>