# Text Classification with Neural Networks

The goal of this project is to develop a **classification model to predict the positive/negative labels** of movie reviews.

We'll be using the **large movie review dataset**, https://ai.stanford.edu/~amaas/data/sentiment/, compiled by Maas et al. (https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf). This dataset can be loaded directly via the Keras imdb.load_data() method.

#### 1. Perform initial imports

In [1]:
import keras
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.layers import Embedding
from keras.callbacks import ModelCheckpoint

import os

from sklearn.metrics import roc_auc_score, roc_curve

import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


#### 2. Load data

In [117]:
# values used in Maas et al.:
#"We build a fixed dictionary of the 5,000 most frequent tokens, 
#but ignore the 50 most frequent terms from the original full vocabulary."

n_unique_words = 5000 #number of most frequent words to consider
n_words_to_skip = 0 #50 #number of most frequent words to ignore

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=n_unique_words, 
                                                      skip_top=n_words_to_skip)

#### 3. Check data

In [7]:
#check 3 first reviews of the training data

x_train[0:3]

array([list([2, 2, 2, 2, 2, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 2, 173, 2, 256, 2, 2, 100, 2, 838, 112, 50, 670, 2, 2, 2, 480, 284, 2, 150, 2, 172, 112, 167, 2, 336, 385, 2, 2, 172, 4536, 1111, 2, 546, 2, 2, 447, 2, 192, 50, 2, 2, 147, 2025, 2, 2, 2, 2, 1920, 4613, 469, 2, 2, 71, 87, 2, 2, 2, 530, 2, 76, 2, 2, 1247, 2, 2, 2, 515, 2, 2, 2, 626, 2, 2, 2, 62, 386, 2, 2, 316, 2, 106, 2, 2, 2223, 2, 2, 480, 66, 3785, 2, 2, 130, 2, 2, 2, 619, 2, 2, 124, 51, 2, 135, 2, 2, 1415, 2, 2, 2, 2, 215, 2, 77, 52, 2, 2, 407, 2, 82, 2, 2, 2, 107, 117, 2, 2, 256, 2, 2, 2, 3766, 2, 723, 2, 71, 2, 530, 476, 2, 400, 317, 2, 2, 2, 2, 1029, 2, 104, 88, 2, 381, 2, 297, 98, 2, 2071, 56, 2, 141, 2, 194, 2, 2, 2, 226, 2, 2, 134, 476, 2, 480, 2, 144, 2, 2, 2, 51, 2, 2, 224, 92, 2, 104, 2, 226, 65, 2, 2, 1334, 88, 2, 2, 283, 2, 2, 4472, 113, 103, 2, 2, 2, 2, 2, 178, 2]),
       list([2, 194, 1153, 194, 2, 78, 228, 2, 2, 1463, 4369, 2, 134, 2, 2, 715, 2, 118, 1634, 2, 394, 2, 2, 119, 954, 189, 102, 2, 20

Each token is represented by an integer, following this convention:
* **0** is the **padding token**
* **1** is the **starting token**, indicating the beginning of a review
* **2** is the **unknown token**, used to identify the out-of-vocabulary (OOV) words 
* **3** is the **most frequent word** in the corpus
* **4** is the **second most frequent word** in the corpus, and so on

In [120]:
# integer 3 is not used
n_3=0
n_4=0

for index in range(len(x_train)):
    n_3 += x_train[index].count(3)
    n_4 += x_train[index].count(4)

print(n_3, n_4)

0 336148


In [9]:
# check length of the 3 first reviews of the training data

for x in x_train[0:3]:
    print(len(x))

218
189
141


As expected, the reviews have different lengths.

In [11]:
# check labels of the 3 first reviews of the training data

y_train[0:3]

array([1, 0, 0], dtype=int64)

The first review is positive and the second and third reviews are negative.

In [15]:
# check length of the training and test set

len(x_train), len(x_test)

(25000, 25000)

We have 25000 reviews in the training set and 25000 reviews in the test set.

#### 4. Check reviews as a sequence of words (and not integers)

Instead of having a sequence of integers for each review, we can also check their original content using Keras imdb.get_word_index() method. 

In [131]:
word_index = imdb.get_word_index()

for key, value in word_index.items():
    if (value == 0) or (value == 1) or (value == 2):
        print(key, value)

the 1
and 2


In [132]:
print(min(word_index, key=word_index.get), word_index[min(word_index, key=word_index.get)])

the 1


As we can see, the first integers are not reserved for the special cases we've mentioned before (and the values start in 1).

In [133]:
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["PAD"] = 0
word_index["START"] = 1
word_index["UNK"] = 2
#word_index["<UNUSED>"] = 3

In [134]:
# 3 is not used!!!
for key, value in word_index.items():
    if value == 3:
        print(key, value)

In [136]:
# the most common word is "the"
word_index['the']

4

In [140]:
# inverting the word_index dictionary

index_word = {v:k for k,v in word_index.items()}

index_word[4]

'the'

In [141]:
# first review of the training set as a sequence of words

' '.join(index_word[id] for id in x_train[0])

"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert UNK is an amazing actor and now the same being director UNK father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for UNK and would recommend it to everyone to watch and the fly UNK was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also UNK to the two little UNK that played the UNK of norman and paul they were just brilliant children are often left out of the UNK list i think because the stars that play them all grown up are such a big UNK for the whole film but these children are amazing and should be UNK for what they have done don't you 

This is the first review of the training set. Since we've excluded some words with the parameters `num_words` and `skip_top` when loading the reviews, those words are identified by the string 'UNK'.

Let's view this first original review.

#### 5. Check original reviews as a sequence of words

In [142]:
(original_x_train,_), (_,_) = imdb.load_data()

In [143]:
' '.join(index_word[id] for id in original_x_train[0])

"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and shou

#### 6. Preprocess data

In [144]:
# make each review the same length using 0 as the padding value

max_review_length = 100 #maximum review length of 100 words
pad_type = 'pre' #add padding characters to the start of every review < 100 words
trunc_type = 'pre' #remove words from the beginning of every review > 100 words

x_train = pad_sequences(x_train, maxlen=max_review_length, 
                        padding=pad_type, truncating=trunc_type, value=0)

x_test = pad_sequences(x_test, maxlen=max_review_length,
                       padding=pad_type, truncating=trunc_type, value=0)

In [145]:
# check length of the 3 first reviews of the training data

for x in x_train[0:3]:
    print(len(x))

100
100
100


All the reviews have now the **same length of 100 words**.

In [147]:
# first review of the training set as a sequence of words after preprocessing

' '.join(index_word[id] for id in x_train[0])

"cry at a film it must have been good and this definitely was also UNK to the two little UNK that played the UNK of norman and paul they were just brilliant children are often left out of the UNK list i think because the stars that play them all grown up are such a big UNK for the whole film but these children are amazing and should be UNK for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was UNK with us all"

As we can see, this review was truncated in order to keep the last 100 words.

#### 7. Design a dense neural network

In [193]:
# output directory name:
output_dir = 'text_classification_NN/dense'

# training:
epochs = 4
batch_size = 128

n_dim = 64 #number of dimensions of our word-vector space

# neural network architecture: 
n_dense = 64 #number of neurons in dense layer
dropout = 0.5

model = Sequential(name='model_dense')
model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
model.add(Flatten())
model.add(Dense(n_dense, activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(1, activation='sigmoid')) #single output neuron; we use sigmoid because we only have 2 classes

In [194]:
model.summary()

Model: "model_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 100, 64)           320000    
_________________________________________________________________
flatten_6 (Flatten)          (None, 6400)              0         
_________________________________________________________________
dense_11 (Dense)             (None, 64)                409664    
_________________________________________________________________
dropout_6 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 65        
Total params: 729,729
Trainable params: 729,729
Non-trainable params: 0
_________________________________________________________________


We have designed our model. It's now time to compile it!

#### 8. Compile model and save model parameters

In [199]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Our model is now compiled. We've used `binary_crossentropy` as our loss function since we have a binary classifier.

In [200]:
# save model parameters after each epoch of training

modelcheckpoint = ModelCheckpoint(filepath=output_dir+"/weights.{epoch:02d}.hdf5")

In [201]:
# create folder if it does not exist

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### 9. Train model

In [202]:
model.fit(x_train, y_train, 
          batch_size=batch_size, epochs=epochs, verbose=1, 
          validation_data=(x_test, y_test), 
          callbacks=[modelcheckpoint])

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.callbacks.History at 0x181d3bb5c08>

The **highest validation accuracy** and **lowest validation loss** is achieved in **epoch 2**.

#### 10. Make predictions with the test set

In [203]:
# load parameters of epoch 2

model.load_weights(output_dir+"/weights.02.hdf5")

In [205]:
# in Keras, for this specific case, we can use both predict() and predict_proba() to get the predicted probabilities

predictions = model.predict(x_test)
predictions_proba = model.predict_proba(x_test)

In [206]:
predictions[0:3]

array([[0.03830723],
       [0.97975916],
       [0.8893984 ]], dtype=float32)

In [207]:
predictions_proba[0:3]

array([[0.03830723],
       [0.97975916],
       [0.8893984 ]], dtype=float32)

In [208]:
# if we want to get the classes directly, we can use predict_classes()

predictions_class = model.predict_classes(x_test)

In [209]:
predictions_class[0:3]

array([[0],
       [1],
       [1]])

#### 11. Evaluate model

We can start by comparing the first 10 predictions with the real labels.

In [212]:
predictions_class[0:10].T

array([[0, 1, 1, 0, 1, 1, 1, 0, 1, 1]])

In [210]:
y_test[0:10]

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1], dtype=int64)

For the first 10 reviews, only one seems to be misclassified.

In [228]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [229]:
# confusion matrix

print(confusion_matrix(y_test, predictions_class))

[[10273  2227]
 [ 1732 10768]]


In [230]:
# classification report

print(classification_report(y_test, predictions_class))

              precision    recall  f1-score   support

           0       0.86      0.82      0.84     12500
           1       0.83      0.86      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000



In [231]:
# accuracy score

print(accuracy_score(y_test, predictions_class))

0.84164


As we've seen before when we trained it, our dense model correctly classifies **84,2%** of our reviews as positive or negative.