# Sentiment Analysis in Keras
In this notebook we want to train a neural network in Keras in order to predict the sentiment of a movie review, i.e. whether it is positive or negative. For the purpose, the imdb movie review dataset available in Keras is used. This problem has been widely addressed and is one of the example use cases of Keras, useful to experiment and learn.
- We rely on the examples provided by the Coursera course https://www.coursera.org/learn/ai/lecture/hQYsN/recurrent-neural-networks, https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ and http://www.samyzaf.com/ML/imdb/imdb.html
- Embeddings are used to represent words and provide them as input to the neural network: https://en.wikipedia.org/wiki/Embedding, https://en.wikipedia.org/wiki/Word_embedding, https://keras.io/layers/embeddings/
- Convolutional neural nets were shown leading good results in spite of small network structure: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/, https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/docs/1408.5882v2.pdf, http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
- A more complete overview at http://cs224d.stanford.edu/syllabus.html

In [1]:
from keras.datasets import imdb

# select max_features most common items from vocabulary
max_features = 20000
top_words = max_features
max_review_length = 500
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=top_words,
                                                      skip_top=0,
                                                      maxlen=max_review_length,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

Using TensorFlow backend.


In [2]:
print x_train.shape
print y_train.shape
print x_test.shape
print y_test.shape

(25000,)
(25000,)
(20947,)
(20947,)


Since the reviews might have different lenght and we have a fixed input for our network to train, we use padding to create sub-sequences of fixed length, i.e. if the raw sequence is longer than the set size it will truncate it, while if shorter than the set size the padding will add zeros to fill the space.

In [3]:
from keras.preprocessing import sequence

x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)
print x_train.shape
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)
print x_test.shape

(25000, 500)
(20947, 500)


For the model we use:
1. An input Embedding layer using 128-length vectors to represent each word. 
2. An LSTM layer with 128 memory units
3. A Dense output layer with a single neuron, whose sigmoid activation will define the belonging class for the review

In [4]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
#from keras.layers.embeddings import Embedding

embedding_vector_length = 128

model = Sequential()
# input layer
model.add(Embedding(max_features,128
                    #input_dim=top_words,
                    #output_dim=embedding_vector_length,
                    #input_length=max_review_length
                   ))
# hidden layer
model.add(LSTM(units=128
               #units=128
               #dropout=0.2,
               #recurrent_dropout=0.2
              ))

# output layer
model.add(Dense(1,
                activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              #optimizer='adam',
              optimizer='sgd',
              metrics=['accuracy'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________
None


Set the seed and train the model:

In [5]:
# fix random seed for reproducibility
import numpy
numpy.random.seed(7)

# define the batch size, ie. the size of the set (or number of observations) used between weight updates
# this is a tradeoff as bigger batch sizes allow for learning bigger dependencies across data samples
# although brings in greater computational complexity
# viceversa smaller batch sizer are lighter to handle and might lead to better weight updates for non-sequential data
batch_size = 32

model.fit(x_train,
          y_train,
          validation_data=(x_test, y_test),
          epochs=1,
          batch_size=batch_size)

Train on 25000 samples, validate on 20947 samples
Epoch 1/1


<keras.callbacks.History at 0x11c94d5d0>

Evaluate the model:

In [None]:
loss, accuracy = model.evaluate(x_train, y_train, batch_size=batch_size)
print("Training: accuracy = %f  ;  loss = %f" % (accuracy, loss))

In [6]:
loss, accuracy = model.evaluate(x_test, y_test, batch_size=batch_size)
print("Testing: accuracy = %f  ;  loss = %f" % (accuracy, loss))



[0.6927631945605949, 0.5335847615438776]

Save the model:

In [7]:
model.save("imdb.h5")
!ls

imdb.h5       imdb_sa.ipynb


which can be reloaded as:

In [8]:
from keras.models import load_model
model = load_model("imdb.h5")