## Sentiment Analysis with Stanford IMDB dataset
Using deep neural networks to predict sentiment of movie reviews in IMDB to see if the sentiment is positive/negative.

## Import statements

In [4]:
from __future__ import unicode_literals
from keras.models import Model, Sequential
from keras.layers import Convolution1D, merge, Input, Embedding, MaxPooling1D
from keras.optimizers import Adam, RMSprop
from keras.layers import Dense, Dropout, Flatten
from keras.datasets import imdb
from keras.utils.data_utils import get_file
import pickle
import numpy as np
from qrnn import QRNN

## Load data file

In [5]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, y_train), (x_test, y_test) = pickle.load(f)

## Explore the dataset

In [6]:
idx = imdb.get_word_index()

In [7]:
idxArr = sorted(idx, key=idx.get)

In [8]:
idxArr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

In [9]:
idx2word = {v: k for k,v in idx.items()}

In [10]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

In [11]:
' '.join([idx2word[o] for o in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

In [12]:
idx2word[1]

'the'

In [13]:
x_train[0]

[23022,
 309,
 6,
 3,
 1069,
 209,
 9,
 2175,
 30,
 1,
 169,
 55,
 14,
 46,
 82,
 5869,
 41,
 393,
 110,
 138,
 14,
 5359,
 58,
 4477,
 150,
 8,
 1,
 5032,
 5948,
 482,
 69,
 5,
 261,
 12,
 23022,
 73935,
 2003,
 6,
 73,
 2436,
 5,
 632,
 71,
 6,
 5359,
 1,
 25279,
 5,
 2004,
 10471,
 1,
 5941,
 1534,
 34,
 67,
 64,
 205,
 140,
 65,
 1232,
 63526,
 21145,
 1,
 49265,
 4,
 1,
 223,
 901,
 29,
 3024,
 69,
 4,
 1,
 5863,
 10,
 694,
 2,
 65,
 1534,
 51,
 10,
 216,
 1,
 387,
 8,
 60,
 3,
 1472,
 3724,
 802,
 5,
 3521,
 177,
 1,
 393,
 10,
 1238,
 14030,
 30,
 309,
 3,
 353,
 344,
 2989,
 143,
 130,
 5,
 7804,
 28,
 4,
 126,
 5359,
 1472,
 2375,
 5,
 23022,
 309,
 10,
 532,
 12,
 108,
 1470,
 4,
 58,
 556,
 101,
 12,
 23022,
 309,
 6,
 227,
 4187,
 48,
 3,
 2237,
 12,
 9,
 215]

## Data preparation

We limit the vocab size to 5000 for this exercise.

In [14]:
vocab_size = 5000

Splitting into train test samples

In [15]:
trn = [np.array([i if i < vocab_size - 1 else vocab_size - 1 for i in s]) for s in x_train]
test = [np.array([i if i < vocab_size - 1 else vocab_size - 1 for i in s]) for s in x_test]

Setting max length of each review sequence to 500.

In [16]:
seq_len = 500

Pad sequence with 0s for reviews with length <500.

In [17]:
from keras.preprocessing import sequence

trn = sequence.pad_sequences(trn, maxlen=seq_len, value = 0)
test = sequence.pad_sequences(test, maxlen=seq_len, value = 0)

In [18]:
y_train[0:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

### Convolutional Neural Network (CNN) model for Sentiment Analysis
We build a 1D conv model with relu activation to do initial sentiment analysis.
This model uses multiple filter sizes stacked in parallel to do text processing

In [17]:
graph_input = Input((vocab_size, 32))
conv_layers = []
for filter_sz in range(3, 6):
    x = Convolution1D(64, filter_sz, border_mode='same', activation='relu')(graph_input)
    x = MaxPooling1D()(x)
    x = Flatten()(x)
    conv_layers.append(x)
out = merge((conv_layers), mode='concat')
graph = Model(graph_input, out)

In [74]:
conv1 = Sequential()
conv1.add(Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2))
conv1.add(graph)
conv1.add(Dropout(0.5))
conv1.add(Dense(128, activation='relu'))
conv1.add(Dropout(0.7))
conv1.add(Dense(1, activation='sigmoid'))

Compile and train model

In [75]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [78]:
conv1.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=2, batch_size=16)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1b9df3ef780>

Finetune learning rate for better training

In [83]:
conv1.optimizer.lr=1e-5

In [84]:
conv1.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=2, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1b9dfbb1320>

### Using a basic Long-Short Term Memory (LSTM) model for Sentiment Analysis
We now try out using a LSTM model instead of a CNN.

In [21]:
from keras.layers import LSTM

rnn1 = Sequential()
rnn1.add(Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2))
rnn1.add(LSTM(100, input_dim=32, return_sequences=False, consume_less='gpu'))
rnn1.add(Dropout(0.2))
rnn1.add(Dense(1, activation='sigmoid'))

Compile and train model.

In [28]:
rnn1.compile(loss='binary_crossentropy', optimizer=RMSprop(lr=1e-4, epsilon=1e-7), metrics=['accuracy'])

In [29]:
rnn1.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=1, batch_size=16)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x269117c8f60>

In [22]:
rnn1.optimizer.lr = 1e-5
rnn1.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=1, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x269799af9e8>

#### Discussion
From this, we can see that LSTMs take a long time to train compared to CNN, which makes sense as LSTM computations are time dependent and cannot be done in parallel using GPUs.

## Using GloVe embeddings
Previously we used a randomized layer and train our embedding layer. 
We now try to use a pre-trained embedding layer instead for transfer learning.
Refer to [GloVe](https://nlp.stanford.edu/projects/glove/) for more info.

In [19]:
import os

embeddings_index = {}
f = open(os.path.join('D:/GloVe', 'glove.6B.50d.txt'), encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Create word embedding in Keras using GloVe word vectors.

In [20]:
def create_emb():
    n_fact = 50
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1, len(emb)):
        word = idx2word[i]
        embedding_vec = embeddings_index.get(word)
        print(embedding_vec)
        if embedding_vec is not None:
            emb[i] = embedding_vec
        else:
            emb[i] = np.random.normal(scale=0.6, size=(n_fact,))
    
    emb[-1] = np.random.normal(scale=0.6, size=(n_fact,))
    return emb

In [21]:
emb = create_emb()

[  4.18000013e-01   2.49679998e-01  -4.12420005e-01   1.21699996e-01
   3.45270008e-01  -4.44569997e-02  -4.96879995e-01  -1.78619996e-01
  -6.60229998e-04  -6.56599998e-01   2.78430015e-01  -1.47670001e-01
  -5.56770027e-01   1.46579996e-01  -9.50950012e-03   1.16579998e-02
   1.02040000e-01  -1.27920002e-01  -8.44299972e-01  -1.21809997e-01
  -1.68009996e-02  -3.32789987e-01  -1.55200005e-01  -2.31309995e-01
  -1.91809997e-01  -1.88230002e+00  -7.67459989e-01   9.90509987e-02
  -4.21249986e-01  -1.95260003e-01   4.00710011e+00  -1.85939997e-01
  -5.22870004e-01  -3.16810012e-01   5.92130003e-04   7.44489999e-03
   1.77780002e-01  -1.58969998e-01   1.20409997e-02  -5.42230010e-02
  -2.98709989e-01  -1.57490000e-01  -3.47579986e-01  -4.56370004e-02
  -4.42510009e-01   1.87849998e-01   2.78489990e-03  -1.84110001e-01
  -1.15139998e-01  -7.85809994e-01]
[ 0.26818001  0.14346001 -0.27877     0.016257    0.11384     0.69923002
 -0.51332003 -0.47367999 -0.33074999 -0.13834     0.27020001  0

## CNN with pre-trained GloVe embeddings.
Rebuild CNN model stacking multiple filter sizes using the new word embeddings.

In [28]:
graph_input2 = Input((vocab_size, 50))
conv_layers2 = []
for filter_sz in range(3, 6):
    x = Convolution1D(64, filter_sz, border_mode='same', activation='relu')(graph_input2)
    x = MaxPooling1D()(x)
    x = Flatten()(x)
    conv_layers2.append(x)
out2 = merge((conv_layers2), mode='concat')
graph2 = Model(graph_input2, out2)

In [29]:
cnn_2 = Sequential()
cnn_2.add(Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]))
cnn_2.add(Dropout(0.2))
cnn_2.add(graph2)
cnn_2.add(Dropout(0.5))
cnn_2.add(Dense(100, activation='relu'))
cnn_2.add(Dropout(0.7))
cnn_2.add(Dense(1, activation='sigmoid'))

Compile and fit new model.

In [30]:
cnn_2.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [31]:
cnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=2, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x27b6c803da0>

In [78]:
cnn_2.optimizer.lr = 1e-6

In [79]:
cnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=5, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x269267f3160>

### Discussion
From this, we can see that using pretrained embeddings allowed for a much bigger train accuracy compared to before. However, the val accuracy remain the same, suggesting for more aggresive dropout, maybe after the conv layers.

##  Building the RNN model with pre-trained embeddings
Now we try using RNN.

In [36]:
rnn_2 = Sequential()
rnn_2.add(Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]))
rnn_2.add(Dropout(0.2))
rnn_2.add(LSTM(100, input_dim=50, return_sequences=False))
rnn_2.add(Dropout(0.2))
rnn_2.add(Dense(1, activation='sigmoid'))

Compile and run.

In [45]:
rnn_2.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-6), 
              metrics=['accuracy'])

In [47]:
rnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=1, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x1ce9a248828>

As RNN trains too slowly, I did not really train the RNN much as the increase in accuracy would not really be worth the time trained. May need a stronger GPU setup to train the RNN.

### Quasi-Recurrent Neural Network (QRNN)
I also tried attempting to use this QRNN model that pseudo simulates the time dependency using convolutional filters. This supposedly gives a better training time as the CNN computations can be parallized while still taking into account time steps before the input itself.

Credits to [Ding Ke](https://github.com/DingKe/qrnn) for the QRNN port to Keras 

Reference: [Quasi-Recurrent Neural Networks](https://arxiv.org/pdf/1611.01576v2.pdf)

In [37]:
from keras.regularizers import l2
from keras.constraints import maxnorm

qrnn_model = Sequential()
qrnn_model.add(Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]))
qrnn_model.add(Dropout(0.2))
qrnn_model.add(QRNN(128, window_size=3, return_sequences=True, dropout=0.2, 
                    W_regularizer=l2(1e-4), b_regularizer=l2(1e-4), 
                    W_constraint=maxnorm(10), b_constraint=maxnorm(10)))
qrnn_model.add(QRNN(128, window_size=3, dropout=0.2, 
                    W_regularizer=l2(1e-4), b_regularizer=l2(1e-4), 
                    W_constraint=maxnorm(10), b_constraint=maxnorm(10)))
qrnn_model.add(Dense(1, activation='sigmoid'))



In [39]:
qrnn_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [40]:
qrnn_model.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=1, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x27b01feeeb8>

In [77]:
from keras.regularizers import l2
from keras.constraints import maxnorm

qrnn_2 = Sequential()
qrnn_2.add(Dropout(0.1, input_shape=(seq_len,)))
qrnn_2.add(Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]))
qrnn_2.add(Dropout(0.2))
qrnn_2.add(QRNN(128, window_size=4, dropout=0.7, 
                    W_regularizer=l2(1e-4), b_regularizer=l2(1e-4), 
                    W_constraint=maxnorm(10), b_constraint=maxnorm(10)))
qrnn_2.add(Dense(1, activation='sigmoid'))



In [74]:
qrnn_2.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dropout_21 (Dropout)             (None, 5000)          0           dropout_input_2[0][0]            
____________________________________________________________________________________________________
embedding_21 (Embedding)         (None, 500, 50)       250000      dropout_21[0][0]                 
____________________________________________________________________________________________________
dropout_22 (Dropout)             (None, 500, 50)       0           embedding_21[0][0]               
____________________________________________________________________________________________________
qrnn_27 (QRNN)                   (None, 128)           77184       dropout_22[0][0]                 
___________________________________________________________________________________________

Compile and train model as usual.

In [78]:
qrnn_2.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [79]:
qrnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=1, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x27b977b4438>

In [82]:
qrnn_2.optimizer.lr = 1e-6

In [83]:
qrnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=2, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x27b97984400>

In [66]:
qrnn_2.optimizer.lr = 1e-6

In [68]:
qrnn_2.fit(trn, y_train, validation_data=(test, y_test), nb_epoch=3, batch_size=16)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x27b91f7f320>

### Discussion
QRNN trains much faster than LSTM and yet still take into account the recurrent factor suggesting that QRNN may be much better than LSTM in terms of performance and computation. However, the train accuracy of QRNN is better CNN, but each epoch runs for a much longer time.

## Conclusion
From this, we can see that the training time for QRNN is much faster than LSTM and yet still can fit to a reasonable train accuracy. It is better than CNN performance but at a much longer time per epoch.

However, the QRNN faces the same issue of diverging traina and val loss as well, suggesting a need for more dropout at embedding layer (?) since both share the same characteristics. 

Overall the choice would be between QRNN and CNN.