# Sentiment Analysis on IMDB

This notebook explores various models for sentiment analysis on the IMDB dataset. I have tried the following models : 
     - Basic Linear Model
     - Simple Convolution Model
     - Model With Pre-Trained Word Embeddings
     - Recurrent Neural Networks 
         - Simple LSTM Model
         - Convolution with LSTM Model
         - Simple GRU Model
         - Convolution with GRU Model
         
Reference : This notebook was developed during the Deep Learning Course by Fast.ai and as such is heavily influenced by it (https://github.com/fastai/courses/blob/master/deeplearning1/nbs/lesson5.ipynb)

## Imports

In [73]:
from __future__ import division,print_function
from PIL import Image
import gc,re

from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input
import numpy as np

from keras.models import Model
from keras.layers import Flatten, Dense, Dropout, Input, LSTM, GRU, Embedding, Convolution1D, MaxPooling1D, MaxPool1D
from keras.optimizers import Adam, RMSprop
from keras.layers.normalization import BatchNormalization
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

from importlib import reload
from keras import backend as K
from keras.datasets import imdb

np.random.seed(7)

import bcolz
from IPython.display import FileLink
import os, json
from glob import glob
import numpy as np
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt

## Constants

In [15]:
VOCAB_SIZE = 5000
SEQ_LEN = 500
EMBEDDING_LEN_50 = 50
EMBEDDING_LEN_100 = 100
EMBEDDING_LEN_200 = 200
EMBEDDING_LEN_300 = 300

PATH = 'data/imdb/'
MODELS = PATH + 'models/'
GLOVE_DIRECTORY = 'data/wordembeddings/'

GLOVE_50_DIM = GLOVE_DIRECTORY + 'glove.6B.50d.txt'
GLOVE_100_DIM = GLOVE_DIRECTORY + 'glove.6B.100d.txt'
GLOVE_200_DIM = GLOVE_DIRECTORY + 'glove.6B.200d.txt'
GLOVE_300_DIM = GLOVE_DIRECTORY + 'glove.6B.300d.txt'


In [17]:
if not os.path.exists(MODELS):
    os.mkdir(MODELS)

In [19]:
%ls $PATH

[0m[01;34mmodels[0m/


## Quick look at the data

One of the good things about working with this particular dataset is that it is already present within the Keras library. As such, we can directly load this dataset and start working with it. 

Reference : https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

In [20]:
(X_train, Y_train), (X_test, Y_test) = imdb.load_data()

IMDB dataset has a dictionary which stores the index of all the unique words in the dataset

In [29]:
word2index = imdb.get_word_index()
print(word2index['and'])
print(len(word2index))

2
88584


As we see above, there are 88584 unique words in the IMDB dataset.

We will use this dictionary to create a reverse mapping from the index to the word which we will use later

In [25]:
index2word = {v : k for k,v in word2index.items()}
index2word[2]

'and'

Let us now see how one of the reviews looks like.

As we can see, each review contains indices of the words it contains.

In [27]:
print(X_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [28]:
' '.join(index2word[i] for i in X_train[0])

"the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but wh

Let us see the label of this review.

1 is for positive and 0 is for a negative review

In [30]:
Y_train[0]

1

Now, let us look at a negative review

In [33]:
' '.join(index2word[i] for i in X_train[1])

"the thought solid thought senator do making to is spot nomination assumed while he of jack in where picked as getting on was did hands fact characters to always life thrillers not as me can't in at are br of sure your way of little it strongly random to view of love it so principles of guy it used producer of where it of here icon film of outside to don't all unique some like of direction it if out her imagination below keep of queen he diverse to makes this stretch stefan of solid it thought begins br senator machinations budget worthwhile though ok brokedown awaiting for ever better were lugia diverse for budget look kicked any to of making it out bosworth's follows for effects show to show cast this family us scenes more it severe making senator to levant's finds tv tend to of emerged these thing wants but fuher an beckinsale cult as it is video do you david see scenery it in few those are of ship for with of wild to one is very work dark they don't do dvd with those them"

In [34]:
Y_train[1]

0

As we had seen before, we have almost 89k different words in the dataset. The load_data() method of keras imdb dataset gives us the option of only taking the top n words if we so desire.

Here, we will load the dataset with top 5000 words.

In [51]:
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(num_words=VOCAB_SIZE, oov_char=5000)

Let us now see the max, the mean and the minimum review lengths in the train and test sets.

In [36]:
len_train_reviews = [len(i) for i in X_train]
len_test_reviews = [len(i) for i in X_test]

In [37]:
print(max(len_train_reviews), np.mean(len_train_reviews), min(len_train_reviews))

2494 238.71364 11


In [38]:
print(max(len_test_reviews), np.mean(len_test_reviews), min(len_test_reviews))

2315 230.8042 7


Most reviews are around 230-240 words. So, what we can do is truncate each review to a length of 500 (which is almost double average review length). Reviews which are shorter will get padded with 0s.

In [39]:
X_train.shape

(25000,)

In [40]:
from keras.preprocessing import sequence

In [50]:
train = sequence.pad_sequences(X_train,SEQ_LEN)
test = sequence.pad_sequences(X_test,SEQ_LEN)

In [42]:
train.shape

(25000, 500)

In [43]:
test.shape

(25000, 500)

In [44]:
train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,

Converting the targets to categorical targets

In [52]:
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

## Basic Linear Model

For text processing we use an embedding layer. This layer represents our words as vectors of some particular length (here EMBEDDING_LEN_50) in some higher dimensional space. We do this to help capture the semantic relationship between the words.

References : https://www.tensorflow.org/tutorials/word2vec

References : https://keras.io/layers/embeddings/

In [56]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
x = Dropout(0.2)(emb)
x = Flatten()(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
preds = Dense(2, activation = 'softmax')(x)

linear_model = Model(inputs=inp, outputs=preds)
linear_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
linear_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 500)               0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 500, 50)           250000    
_________________________________________________________________
dropout_3 (Dropout)          (None, 500, 50)           0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 25000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 100)               2500100   
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 100)               400       
__________

In [57]:
linear_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9b01990b00>

In [58]:
linear_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9b014ebc50>

We see that the linear model is getting almost 86% validation accuracies. It is highly overfitting. Let us next see what a more complex model is able to achieve.

In [60]:
import gc
del linear_model
for i in range(0,5):
    gc.collect()

## Simple Convolution Model

Convultions are good at finding spatial relationships. As such, its intuitive they might work for **embedded** text data too since they can find spatial relations among the word vectors in the high dimensional space.

In [67]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
conv = Convolution1D(filters=64, kernel_size=4, padding='same', activation='relu')(emb)
conv = Convolution1D(filters=64, kernel_size=4, padding='same', activation='relu')(conv)
pool = MaxPooling1D(pool_size=2)(conv) 
x = Dropout(0.4)(pool)
x = Flatten()(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.6)(x)
preds = Dense(2, activation = 'softmax')(x)

conv_model = Model(inputs=inp, outputs=preds)
conv_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
conv_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 500)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 500, 50)           250000    
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 500, 64)           12864     
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 500, 64)           16448     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 64)           0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 250, 64)           0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 16000)             0         
__________

In [68]:
conv_model.optimizer.lr = 1e-3
conv_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9b02da2c18>

In [69]:
conv_model.optimizer.lr = 1e-5
conv_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=6)

Train on 20000 samples, validate on 5000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f9b030bb358>

As we see, this simple convolutional model is performing quite well.

## Pre-trained Word Embeddings

We have seen before for images that we can just remove that top layers of a VGG/RESNet model, fit our own layers on top. We did this for images in order to utilize the previous layers of those models since there is a general form of images. Like in those models, the earlier layers learn to detect edges which is needed for all image detection tasks.

The same things also apply to embeddings. As we discussed previously, we are representing each word in some higher dimensional vector space. Turns out, people have already trained models on billions of tokens and stored their representations in those higher dimensions. We can simply use those for our purposes.

For this notebook, we will be using the GloVe word embeddings developed at Stanford.

Reference : https://nlp.stanford.edu/projects/glove/

Reference : https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

I have already downloaded the embeddeddings. Its now time to use them. The embeddings I downloaded contain embeddings for 50,100,150,and 200 dimensional space. We will be working with 50 dimensions for now.

Each line in the glove file has the word at the beginning, followed by 'd' vectors representing its embedding.

In [70]:
embeddings_glove = {}
f = open(GLOVE_50_DIM)
for line in f:
    value = line.split()
    word = value[0]
    vec = value[1:]
    embeddings_glove[word] = vec
f.close()

In [72]:
print(embeddings_glove['the'])

['0.418', '0.24968', '-0.41242', '0.1217', '0.34527', '-0.044457', '-0.49688', '-0.17862', '-0.00066023', '-0.6566', '0.27843', '-0.14767', '-0.55677', '0.14658', '-0.0095095', '0.011658', '0.10204', '-0.12792', '-0.8443', '-0.12181', '-0.016801', '-0.33279', '-0.1552', '-0.23131', '-0.19181', '-1.8823', '-0.76746', '0.099051', '-0.42125', '-0.19526', '4.0071', '-0.18594', '-0.52287', '-0.31681', '0.00059213', '0.0074449', '0.17778', '-0.15897', '0.012041', '-0.054223', '-0.29871', '-0.15749', '-0.34758', '-0.045637', '-0.44251', '0.18785', '0.0027849', '-0.18411', '-0.11514', '-0.78581']


Now, the indices of the words in our index2word dict and the order followed by glove is different. So we need to map them and create out embedding matrix.

In [74]:
embedding_matrix=np.zeros((VOCAB_SIZE, EMBEDDING_LEN_50))

for i in range(1,len(embedding_matrix)):    #index2word starts with index 1
    word = index2word[i]
    if word and re.match(r"^[a-zA-Z0-9\-]*$", word): #regex to see if word can be used as key for embeddings_glove
        embedding_matrix[i] = embeddings_glove[word] 
    

In [75]:
embedding_matrix[3]

array([ 0.217 ,  0.4652, -0.4676,  0.1008,  1.0135,  0.7484, -0.531 , -0.2626,  0.1681,  0.1318,
       -0.2491, -0.4419, -0.2174,  0.51  ,  0.1345, -0.4314, -0.0312,  0.2067, -0.7814, -0.2015,
       -0.0974,  0.1609, -0.6184, -0.185 , -0.1246, -2.2526, -0.2232,  0.5043,  0.3226,  0.1531,
        3.9636, -0.7137, -0.6701,  0.2839,  0.2174,  0.1443,  0.2593,  0.2343,  0.4274, -0.4445,
        0.1381,  0.3697, -0.6429,  0.0241, -0.0393, -0.2604,  0.1202, -0.0438,  0.4101,  0.1796])

We are now ready to use these prebuilt weights.


In [98]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, \
                input_length=SEQ_LEN, weights=[embedding_matrix], trainable=True)(inp)
conv = Convolution1D(filters=64, kernel_size=4, padding='same', activation='relu')(emb)
conv = Convolution1D(filters=64, kernel_size=4, padding='same', activation='relu')(conv)
pool = MaxPooling1D(pool_size=2)(conv) 
x = Dropout(0.2)(pool)
x = Flatten()(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.3)(x)
preds = Dense(2, activation = 'softmax')(x)

pretrained_model = Model(inputs=inp, outputs=preds)
pretrained_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
pretrained_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_18 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_16 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
conv1d_36 (Conv1D)           (None, 500, 64)           12864     
_________________________________________________________________
conv1d_37 (Conv1D)           (None, 500, 64)           16448     
_________________________________________________________________
max_pooling1d_14 (MaxPooling (None, 250, 64)           0         
_________________________________________________________________
dropout_20 (Dropout)         (None, 250, 64)           0         
_________________________________________________________________
flatten_11 (Flatten)         (None, 16000)             0         
__________

In [99]:
pretrained_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9a43ba7fd0>

In [100]:
pretrained_model.optimizer.lr = 1e-4
pretrained_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=12)

Train on 20000 samples, validate on 5000 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7f9a453325c0>

You might have noticed that we started with the embedding layer to be trainable. After that, we are making it not trainable.

I dont know why but this gave better accuracies during experiments.

In [101]:
pretrained_model.layers[1].trainable= False
pretrained_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

In [102]:
pretrained_model.optimizer.lr = 1e-6
pretrained_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=8)

Train on 20000 samples, validate on 5000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f9a42401eb8>

As we can see, we can use pre-trained word embeddings in our embedding layer.

## RNN --> LSTM and GRU

One of the recent and exciting findings have been the use of RNN (Recurrent Neural Networks) for text processing. RNNs' enable us to find temporal information present in the data. They help store information over time. 

Here we will be trying to different types of RNNs' - LSTMs' (Long-Short-Term-Memory) and GRUs'(Gated Recurrent Units)

References : https://keras.io/layers/recurrent/

A very good post about LSTM's and how they work is detailed in the following blog post : 

References : http://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE : One thing which we will notice is that RNNs' in general take longer to train

### Simple LSTM

First, let us try to use a very simple model using the LSTM layer. We will NOT be working with the pre-trained embeddings for now

In [104]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
lstm = LSTM(100)(emb)
preds = Dense(2, activation = 'softmax')(lstm)

simple_lstm_model = Model(inputs=inp, outputs=preds)
simple_lstm_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
simple_lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_20 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_18 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dense_21 (Dense)             (None, 2)                 202       
Total params: 310,602
Trainable params: 310,602
Non-trainable params: 0
_________________________________________________________________


In [105]:
simple_lstm_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9a3cccbc50>

In [106]:
simple_lstm_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9a3cba5630>

In [107]:
simple_lstm_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9a3cb699e8>

An extremely simple LSTM gave us validation accuracies of almost 87%. 

### Conv-LSTM model

We can also work with convolution and LSTM layers in a model.

In [108]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
x = Convolution1D(filters=32,kernel_size=4,padding='same')(emb)
x = MaxPooling1D(pool_size=(2))(x)
lstm = LSTM(100, dropout=0.2, recurrent_dropout=0.2)(x)
preds = Dense(2, activation = 'softmax')(lstm)

conv_lstm_model = Model(inputs=inp, outputs=preds)
conv_lstm_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
conv_lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_21 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_19 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
conv1d_38 (Conv1D)           (None, 500, 32)           6432      
_________________________________________________________________
max_pooling1d_15 (MaxPooling (None, 250, 32)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_22 (Dense)             (None, 2)                 202       
Total params: 309,834
Trainable params: 309,834
Non-trainable params: 0
_________________________________________________________________


If you look closely, we have used dropouts directly in the LSTM layer. The two dropouts used in the layer itself are : 

     dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

     recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

In [109]:
conv_lstm_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9a3b4ddcf8>

With this model, withing 2 epochs we have reached validation accuracies of almost 85%. Also, this trains faster due to the maxpooling which has halved the number of features.

In [110]:
conv_lstm_model.optimizer.lr = 1e-4
conv_lstm_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9a3b3d54a8>

In just a few epochs we have reached decent accuracies on the validation set.

### Simple GRU Model

Creating a GRU based model is as simpel as simply replacing the LSTM layer above with the GRU layer

In [111]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
lstm = GRU(100)(emb)
preds = Dense(2, activation = 'softmax')(lstm)

simple_gru_model = Model(inputs=inp, outputs=preds)
simple_gru_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
simple_gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_22 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_20 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               45300     
_________________________________________________________________
dense_23 (Dense)             (None, 2)                 202       
Total params: 295,502
Trainable params: 295,502
Non-trainable params: 0
_________________________________________________________________


In [112]:
simple_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9a38f3ea58>

In [113]:
simple_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=3)

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9a393eb940>

An extremely simple GRU model gave almost 87.2% accuracies on the validation set.

### Conv-GRU Model

In [114]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
x = Convolution1D(filters=32,kernel_size=4,padding='same')(emb)
x = MaxPooling1D(pool_size=(2))(x)
lstm = GRU(100, dropout=0.2, recurrent_dropout=0.2)(x)
preds = Dense(2, activation = 'softmax')(lstm)

conv_gru_model = Model(inputs=inp, outputs=preds)
conv_gru_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
conv_gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_23 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_21 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
conv1d_39 (Conv1D)           (None, 500, 32)           6432      
_________________________________________________________________
max_pooling1d_16 (MaxPooling (None, 250, 32)           0         
_________________________________________________________________
gru_2 (GRU)                  (None, 100)               39900     
_________________________________________________________________
dense_24 (Dense)             (None, 2)                 202       
Total params: 296,534
Trainable params: 296,534
Non-trainable params: 0
_________________________________________________________________


In [115]:
conv_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9a3725fcc0>

In [116]:
conv_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9a37090f98>

### Conv-Stacked-GRU Model

Let us build a slightly more complex model with one GRU layer stacked on top of another, i.e. output of one GRU layer feeding into the next GRU layer

In [123]:
inp = Input(shape=(SEQ_LEN,))
emb = Embedding(VOCAB_SIZE, EMBEDDING_LEN_50, input_length=SEQ_LEN)(inp)
x = Convolution1D(filters=32,kernel_size=4,padding='same')(emb)
x = MaxPooling1D(pool_size=(2))(x)
x = Dropout(0.3)(x)
x = GRU(100, dropout=0.3, recurrent_dropout=0.3, return_sequences=True)(x)
x = GRU(100, dropout=0.4, recurrent_dropout=0.4)(x)
x = Dropout(0.4)(x)
preds = Dense(2, activation = 'softmax')(x)

conv_stacked_gru_model = Model(inputs=inp, outputs=preds)
conv_stacked_gru_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
conv_stacked_gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_28 (InputLayer)        (None, 500)               0         
_________________________________________________________________
embedding_26 (Embedding)     (None, 500, 50)           250000    
_________________________________________________________________
conv1d_44 (Conv1D)           (None, 500, 32)           6432      
_________________________________________________________________
max_pooling1d_21 (MaxPooling (None, 250, 32)           0         
_________________________________________________________________
dropout_28 (Dropout)         (None, 250, 32)           0         
_________________________________________________________________
gru_8 (GRU)                  (None, 250, 100)          39900     
_________________________________________________________________
gru_9 (GRU)                  (None, 100)               60300     
__________

In [124]:
conv_stacked_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=4)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f9a2b527cc0>

In [125]:
conv_stacked_gru_model.optimizer.lr = 1e-4
conv_stacked_gru_model.fit(train,Y_train,validation_split=0.2,batch_size=512,epochs=6)

Train on 20000 samples, validate on 5000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f9a2b213d68>