# Catching the sentiment

Let's see how well deep learning handles text stuff.


Load the IMDB sentiment dataset:

In [4]:
from keras.datasets import imdb
from keras.preprocessing import sequence


(X_train, y_train), (X_test, y_test) = imdb.load_data()


Let's examine a document:


In [5]:
print(X_train[10])

[1, 785, 189, 438, 47, 110, 142, 7, 6, 7475, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 30432, 883, 10789, 23, 8, 4, 136, 13772, 11631, 4, 7475, 43, 1076, 21, 1407, 419, 5, 5202, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 13772, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 17547, 34, 4, 321, 487, 5, 116, 15, 6584, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 14799, 1412, 8, 1172, 18, 7865, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 5341, 2140, 20250, 648, 1430, 10136, 8914, 5, 27, 3000, 1432, 7130, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 7740, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 14040, 103, 19100, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448,

Not quite what we expected... Keras has already replaced each word with its index.

Since tensorflow and keras do not support dynamic graphs (yet?), we have to pad the documents (and possibly truncate the longer documents):

In [8]:
# num_words -> consider only the top 10000 most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

In [9]:
print(X_train[10])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    1  785  189  438   47  110
  142    7    6 7475  120    4  236  378    7  153   19   87  108  141
   17 1004    5    2  883    2   23    8    4  136    2    2    4 7475
   43 1076   21 1407  419    5 5202  120   91  682  189 2818    5    9
 1348   31    7    4  118  785  189  108  126   93    2   16  540  324
   23    6  364  352   21   14    9   93   56   18   11  230   53  771
   74   31   34    4 2834    7    4   22    5   14   11  471    9    2
   34    4  321  487    5  116   15 6584    4   22    9    6 2286    4
  114 2679   23  107  293 1008 1172    5  328 1236    4 1375  109    9
    6  132  773    2 1412    8 1172   18 7865   29    9  276   11    6
 2768   19  289  409    4 5341 2140    2  648 1430    2 8914    5   27
 3000 

So, we are ready to extract the sentiment from the documents!!! We will use a simple word embedding-based MLP for the classification:


In [10]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, AveragePooling1D
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam


model = Sequential()
# Number of unique words, embedding dimension, number of words per document
model.add(Embedding(10000, 32, input_length=500))
# Just flatten the embedding vector (does not takes into account the padding!)
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 500)               8000500   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 501       
Total params: 8,321,001
Trainable params: 8,321,001
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f43abe27470>

Usually, using just the mean embedding vector works equally good!

In [12]:
from keras.layers import GlobalAveragePooling1D
model = Sequential()
model.add(Embedding(10000, 32, input_length=500))

# Calculate the mean embedding
model.add(GlobalAveragePooling1D())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 500)               16500     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 501       
Total params: 337,001
Trainable params: 337,001
Non-trainable params: 0
_________________________________________________________________
None


The number of parameters are greatly reduced. Let's examine the performance of the model.

In [13]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f439d0e10f0>

It actually works better (this is expected since the flattening operator keeps too much temporal information that the used MLP cannot use). Also, let's try to ignore the padded words (masking):

In [14]:
from keras.layers import Masking

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(500,)))
model.add(Embedding(10000, 32, input_length=500))

# Calculate the mean embedding
model.add(AveragePooling1D(pool_size=500))

model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_1 (Masking)          (None, 500)               0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
average_pooling1d_1 (Average (None, 1, 32)             0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 500)               16500     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 501       
Total params: 337,001
Trainable params: 337,001
Non-trainable params: 0
_________________________________________________________________
None

<keras.callbacks.History at 0x7f439ee98940>

Masking does not seem to significantly impact the performance of the model. We can also, use a CNN for text classification!

In [15]:
from keras.layers import Conv1D, GlobalAveragePooling1D, GlobalMaxPool1D, Dropout

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(500,)))
model.add(Embedding(10000, 32, input_length=500))
model.add(Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=3))
model.add(GlobalMaxPool1D())

model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
model.summary()

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=1)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_2 (Masking)          (None, 500)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 498, 32)           3104      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 500)               16500     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 501       
Total para

<keras.callbacks.History at 0x7f43a36acd30>

Recurrent models, such as LSTMs and GRUs, can be also very easily used!

In [16]:
from keras.layers import CuDNNLSTM

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(500,)))
model.add(Embedding(10000, 32, input_length=500))
model.add(Dropout(0.3))
model.add(CuDNNLSTM(128))
model.add(Dropout(0.3))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
model.summary()

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=1)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_3 (Masking)          (None, 500)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 500, 32)           320000    
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 128)               82944     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 500)               64500     
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 501       
Total para

<keras.callbacks.History at 0x7f439c545fd0>

With some hyper-parameter tunning you might be able to further increase the accuracy.