In this notebook, we are using TensorFlow keras packages to classify the movie reviews from IMDB dataset. 
TensorFlow provides a preprocessed [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) which consists 50000 movie reviews. The positive and negative reviews are equal in number. This dataset contains preprocessed movie reviews in the form of sequence of integers.

In [39]:
# Required import statements
import tensorflow as tf
import keras
import numpy as np

from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM, Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.datasets import imdb
from keras.preprocessing.text import one_hot

**1. Download the IMDB dataset from keras datasets.**

In [40]:
# Download the imdb dataset
'''
load_data will load the preprocessed imdb dataset
num_words argument is used to define the top most frequent words in the training data
Here num_words = 1000, means top 1000 most frequent words 
We will split the data into train and test 
'''
top_words = 1000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

**2. Review the size of Train and Test data**

In [41]:
# The size of train data
X_train.shape

(25000,)

In [42]:
# The size of test data
y_test.shape

(25000,)

In [43]:
# The preprocessed data will have each integer representing a word in dictionary.
# The first review will look like this:
print(X_train[0])

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]


**3. Sequence of integers back to original words**

In [44]:
'''
Mapping the movie review in form of sequence of integers back to original words
'''
word_integer = imdb.get_word_index()

integer_word = {i: word for word, i in word_integer.items()}

print([integer_word.get(i, ' ') for i in X_train[0]])

'''
Printing reviews converted to original words
'''
def review_to_word(sentence):
  return ([integer_word.get(i, ' ') for i in sentence])

['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'and', 'and', 'their', 'becomes', 'and', 'had', 'and', 'of', 'lot', 'from', 'anyone', 'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it', 'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and', 'help', 'moments', 'or', 'of', 'every', 'and', 'and', 'movie', 'except', 'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'and', 'film', 'as', 'you', 'of', 'and', 'and', 'unfortunately', 'of', 'you', 'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was', 'and', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but', 'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to', 'of', 'and', 'and', 'with', 'heart', 'had', 'and', 'they', 'of', 'here', 'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why', 'what', 'have', 'and', 'they', 'is', 'you', 'that', "isn't", 'one', 'wi

In [45]:
# Using the method review to word

print(review_to_word(X_train[0]))

['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'and', 'and', 'their', 'becomes', 'and', 'had', 'and', 'of', 'lot', 'from', 'anyone', 'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it', 'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and', 'help', 'moments', 'or', 'of', 'every', 'and', 'and', 'movie', 'except', 'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'and', 'film', 'as', 'you', 'of', 'and', 'and', 'unfortunately', 'of', 'you', 'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was', 'and', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but', 'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to', 'of', 'and', 'and', 'with', 'heart', 'had', 'and', 'they', 'of', 'here', 'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why', 'what', 'have', 'and', 'they', 'is', 'you', 'that', "isn't", 'one', 'wi

In [46]:
'''
All the reviews in the movie dataset are of different length and in the form of array of integers
Neural network only accepts the input of same length (Use pad_sequences function from keras.preprocessing)
The same length integer reviews can be used to create tensor for input to the neural network
Using the argument padding = 'post' will add the extra 0 at the end of the review
'''

max_review_words = 250

train_encode = sequence.pad_sequences(X_train, 
                                      padding='post', 
                                      maxlen=max_review_words)
test_encode = sequence.pad_sequences(X_test, 
                                     padding='post', 
                                     maxlen=max_review_words)

In [47]:
# The length of reviews
len(train_encode[0]), len(test_encode[1])

(250, 250)

In [50]:
# the updated movie review that is padded
print(train_encode[0])

[  1  14  22  16  43 530 973   2   2  65 458   2  66   2   4 173  36 256
   5  25 100  43 838 112  50 670   2   9  35 480 284   5 150   4 172 112
 167   2 336 385  39   4 172   2   2  17 546  38  13 447   4 192  50  16
   6 147   2  19  14  22   4   2   2 469   4  22  71  87  12  16  43 530
  38  76  15  13   2   4  22  17 515  17  12  16 626  18   2   5  62 386
  12   8 316   8 106   5   4   2   2  16 480  66   2  33   4 130  12  16
  38 619   5  25 124  51  36 135  48  25   2  33   6  22  12 215  28  77
  52   5  14 407  16  82   2   8   4 107 117   2  15 256   4   2   7   2
   5 723  36  71  43 530 476  26 400 317  46   7   4   2   2  13 104  88
   4 381  15 297  98  32   2  56  26 141   6 194   2  18   4 226  22  21
 134 476  26 480   5 144  30   2  18  51  36  28 224  92  25 104   4 226
  65  16  38   2  88  12  16 283   5  16   2 113 103  32  15  16   2  19
 178  32   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   

**4. Building Model and Training**

In [51]:
'''
To build a neural network we need decide the number of layers to use in the model
Also the number of hidden units for each layer
'''

# defining parameters
vocab_size = 5000
Max_Words_Review = 250
embedding_vector_length = 32 
epochs = 20


model = Sequential()
#The first layer is an Embedding layer. 
model.add(Embedding(vocab_size, embedding_vector_length, input_length = Max_Words_Review))
#Add convolutional layer that has 32 feature maps and 
# reads embedded word representations 3 vector elements of the word embedding at a time.
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(LSTM(150))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
# The last layer is densely connected with a single output node. 
# Using the sigmoid activation function, representing a probabilty (float value between 0 and 1).
model.add(Dense(1, activation='sigmoid'))
model.summary()

#Loss function and optimizer
opt = keras.optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss='binary_crossentropy', optimizer= opt, metrics=['accuracy'])
    
    
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', 
                                               min_delta=0.0001, patience= 10, 
                                               verbose=0, mode='auto', 
                                               baseline=None, 
                                               restore_best_weights=True)

x_val = train_encode[:10000]
train_set_x = train_encode[10000:]

y_val = y_train[:10000]
train_set_y = y_train[10000:]

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 250, 32)           160000    
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 250, 32)           3104      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 125, 32)           0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 125, 32)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 150)               109800    
_________________________________________________________________
dropout_7 (Dropout)          (None, 150)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)               

**5. Train the model for sentiment analysis**

In [52]:
model.fit(x=np.array(train_set_x), y=np.array(train_set_y),
          epochs=epochs, 
          validation_data=(x_val, y_val),
          callbacks=[early_stopping], batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f0b61f2ae48>

**6. Evaluate the model**

In [36]:
results = model.evaluate(test_encode, y_test)

print(results)

[0.328000009059906, 0.8580800294876099]
