###### Importing the dependencies and the IMDB datasets


In [3]:
# Importing tflearn and other helper functions
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences

# tflearn has a pre-processed dataset of IMDB movie ratings
from tflearn.datasets import imdb

##### The below command divides the IMDB dataset in to train and test datasets.I am using 10,000 words from the database and 10% is used for validation set (therefore valid_portion = 0.1)

Note: load_data function downloads the data from web

In [4]:
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=10000,valid_portion=0.1)

##### Reviews and Labels are further divided into trainX,trainY and testX,testY

In [6]:
trainX, trainY = train
testX, testY = test

##### Visualizing the data
As seen in the below cell, the text reviwes are represented as numerical values in form of matrices. This is the vector representation of each word in the text corpus using Google's Word2Vec model.
Refer to :https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

In [9]:
trainX[0:5] # vector representation of reviews

[[17, 25, 10, 406, 26, 14, 56, 61, 62, 323, 4],
 [16, 586, 32, 885, 17, 39, 68, 31, 2994, 2389, 328, 4],
 [1, 2, 1, 139, 6, 130, 1, 5, 6, 25, 105, 4730, 40],
 [6691, 1, 10, 333, 10, 17, 27, 4, 34, 181, 6, 1418, 256, 4],
 [30, 287, 142, 2216, 707, 3763, 20, 68, 57, 30, 37, 309, 14, 4]]

In [10]:
trainY[0:5] # Labels

[0, 0, 0, 1, 0]

In [17]:
import numpy
print ("Shape of the train dataset reviews is: ",numpy.array(trainX).shape)
print ("Shape of the train dataset labels is: ",numpy.array(trainY).shape)

print ("Shape of the test dataset reviews is: ",numpy.array(testX).shape)
print ("Shape of the test dataset labels is: ",numpy.array(testY).shape)

Shape of the train dataset reviews is:  (22500,)
Shape of the train dataset labels is:  (22500,)
Shape of the test dataset reviews is:  (2500,)
Shape of the test dataset labels is:  (2500,)


Data-Preprocessing:

    1) Sequence padding - making sure each input review vector sample is of same length. 
        Any review sequence length> max length are tructuated
        Any review sequence length< max length are padded with zeroes
    2) Converting labels in to binary vectord

In [18]:
#Pad Sequences using helper function of tflearn - pad_sequences
trainX = pad_sequences(trainX, maxlen=100, value=0.)
testX = pad_sequences(testX, maxlen=100, value=0.)

Visualazing the review features after padding with zeroes

In [20]:
trainX[0:2]

array([[  17,   25,   10,  406,   26,   14,   56,   61,   62,  323,    4,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0],
       [  16,  586,   32,  885,   17,   39,   68,   31, 2994, 2389,  328,
           4,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    

In [22]:
# label conversion to binary vectors using to_categorical function of tflearn utils
trainY = to_categorical(trainY,nb_classes = 2)
testY = to_categorical(testY, nb_classes=2)

Visualizing the labels after binary vector conversion

Note: As seen in the below cell, label "0" is represented as vector [1 0]. This one hot encoding is required for tflearn to handle these labels in modelling

In [23]:
trainY[0:5]

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.]])

Building the Network Model

        At each line, each layer of neural network is defined
        Each layers output is next layer input
        Every layer in the network has the hyperparameters that needs to be defined        

In [24]:
net1 = tflearn.input_data([None,100]) # None - is to accept as many number of samples i.e batch_size and 100 - feature size (word length here)
net2 = tflearn.embedding(net1,input_dim = 10000, output_dim = 128)
net3 = tflearn.lstm(net2,128,dropout = 0.8)
net4 = tflearn.fully_connected(net3,2,activation = 'softmax')
net5 = tflearn.regression(net4,optimizer = 'adam', learning_rate = 0.0001, loss = 'categorical_crossentropy')

Model Training

In [25]:
nn_model = tflearn.DNN(net5, tensorboard_verbose =0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
          batch_size=32)

Training Step: 7534  | total loss: [1m[32m0.07482[0m[0m | time: 157.543s
| Adam_0 | epoch: 011 | loss: 0.07482 - acc: 0.9875 -- iter: 22496/22500
Training Step: 7535  | total loss: [1m[32m0.07349[0m[0m | time: 163.992s
| Adam_0 | epoch: 011 | loss: 0.07349 - acc: 0.9856 | val_loss: 0.69186 - val_acc: 0.8096 -- iter: 22500/22500
--


It can be seen that the model performed well on both training and testing dataset
Accuracy on Training Dataset is 98.75%
Accuracy on Testing Dataset is 80.96