Notes taken from:
    
Learning Path: Deep Dive into Python Machine Learning, presented by Eder Santana, Chapter 4
    
by Ankita Thakur - Curator

Published by Packt Publishing, 2016

Learn what is Deep Learning, Recurrent Neural Networks --Training a Sentiment Analysis Model for Text, using Keras

In [1]:
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility

from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.datasets import imdb  # import dataset

from theano import function

Using Theano backend.


In [2]:
'''
This code was borrowed and modified from https://github.com/fchollet

Train a LSTM on the IMDB sentiment classification task
The dataset is acually too small for LSTM to be of any advantage compared to simpler, much faster methods such as TF-IDF + LogReg
Notes:
    -RNNs are tricky. Choice of batch size is important,
    choice of loss and optimzer is critical, etc.
    Some configurations won't converge.
    -LSTM loss decrease patterns during training can be quite different from what you see with CNNs/MLPs/etc.
    GPU command:
        THEANO_FLAGS=mode-FAST_RUN,device=gpu,floatX=float32 python imdb
        
'''

"\nThis code was borrowed and modified from https://github.com/fchollet\n\nTrain a LSTM on the IMDB sentiment classificaiton task\nThe dataset is acually too small for LSTM to be of any advantage compared to simpler, much faster methods such as TF-IDF + LogReg\nNotes:\n    -RNNs are tricky. Choice of batch size is important,\n    choice of loss and optimzer is critical, etc.\n    Some configurations won't converge.\n    -LSTM loss decrease patterns during training can be quite different from what you see with CNNs/MLPs/etc.\n    GPU command:\n        THEANO_FLAGS=mode-FAST_RUN,device=gpu,floatX=float32 python imdb\n        \n"

In [3]:
max_features = 20000
maxlen = 100 # cut texts after this number of words (among top max_features)
batch_size = 32
test_split = 0.2

print("Loading data...")
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)  # test_split=0.2 did not work with version; test_split left out is set to 50/50
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


In [4]:
print("Pad sequences (Samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Pad sequences (Samples x time)
X_train shape: (25000, 100)
X_test shape: (25000, 100)


In [5]:
'''
Sample reviews from the full IMDb movie reviews dataset.

Negative review examples:
* Unfortunately it stays absurd the WHOLE time with no general narrative
* Even those from the era should be turned off.
* The cryptic dialogue would make Shakespeaere seem easy to a third grader

Positive review examples:
* I didn't know this came from Canada, but it is very good. Very good!
* I liked this movie a lot.  It really intrigued me how Deanna and Alicia
* When I saw the elaborate DVD box for this and the dreadful Red Queen f
  I felt certain I was in for a big disappointment, but surprise, surprise
  
'''


"\nSample reviews from the full IMDb movie reviews dataset.\n\nNegative review examples:\n* Unfortunately it stays absurd the WHOLE time with no general narrative\n* Even those from the era should be turned off.\n* The cryptic dialogue would make Shakespeaere seem easy to a third grader\n\nPositive review examples:\n* I didn't know this came from Canada, but it is very good. Very good!\n* I liked this movie a lot.  It really intrigued me how Deanna and Alicia\n* When I saw the elaborate DVD box for this and the dreadful Red Queen f\n  I felt certain I was in for a big disappointment, but surprise, surprise\n  \n"

In [6]:
X_train[0]  # ID of each word, encode each word as a scalar

array([ 1415,    33,     6,    22,    12,   215,    28,    77,    52,
           5,    14,   407,    16,    82, 10311,     8,     4,   107,
         117,  5952,    15,   256,     4,     2,     7,  3766,     5,
         723,    36,    71,    43,   530,   476,    26,   400,   317,
          46,     7,     4, 12118,  1029,    13,   104,    88,     4,
         381,    15,   297,    98,    32,  2071,    56,    26,   141,
           6,   194,  7486,    18,     4,   226,    22,    21,   134,
         476,    26,   480,     5,   144,    30,  5535,    18,    51,
          36,    28,   224,    92,    25,   104,     4,   226,    65,
          16,    38,  1334,    88,    12,    16,   283,     5,    16,
        4472,   113,   103,    32,    15,    16,  5345,    19,   178,    32], dtype=int32)

In [7]:
# Need word to vector representation

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))

# other options
#See on github piskvorky/gensim  python API for word to vector
# See on github stanfordnlp/GloVe  for word to vector

model.add(LSTM(128)) 
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))


Build model...


In [14]:
model.compile(loss='binary_crossentropy',
                optimizer='adam',
                class_mode="binary",
                metrics=["accuracy"])



In [15]:
inp = model.input
embedding = model.layers[0].output
F = function([inp], embedding, allow_input_downcast=True)

In [16]:
print(">> Input:")
print(X_train[:1])
print(">> Input shape:")
print(X_train[:1].shape)
print(">> Embedding:")
print(F(X_train[:1]))
print(">> Embedding shape:")
print(F(X_train[:1]).shape)

>> Input:
[[ 1415    33     6    22    12   215    28    77    52     5    14   407
     16    82 10311     8     4   107   117  5952    15   256     4     2
      7  3766     5   723    36    71    43   530   476    26   400   317
     46     7     4 12118  1029    13   104    88     4   381    15   297
     98    32  2071    56    26   141     6   194  7486    18     4   226
     22    21   134   476    26   480     5   144    30  5535    18    51
     36    28   224    92    25   104     4   226    65    16    38  1334
     88    12    16   283     5    16  4472   113   103    32    15    16
   5345    19   178    32]]
>> Input shape:
(1, 100)
>> Embedding:
[[[ 0.00810685  0.09325636 -0.07744114 ...,  0.07493792  0.09282391
   -0.02799342]
  [-0.02806837  0.00239264 -0.02356007 ...,  0.03474515  0.01536963
   -0.07854667]
  [-0.05315669 -0.03417091 -0.02350992 ..., -0.02584297 -0.03199266
   -0.0728862 ]
  ..., 
  [-0.0020438  -0.02712085 -0.03039971 ..., -0.0190105   0.04001851
   

In [17]:
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size,
         nb_epoch=4, validation_data=(X_test, y_test))

score, acc = model.evaluate(X_test, y_test,
                           batch_size=batch_size)

print('Test score:', score)
print('Test accuracy:', acc)

# Final Test results is for the held-out data set

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Test score: 0.862169817462
Test accuracy: 0.82844


In [None]:
# Go to the Keras examples folder to see IMDb for more examples