### LSTM with Keras: sentiment analysis

Our network will take in a sentence (a sequence of words) and outputs a sentiment value (positive or
negative). Our training set is a dataset of about 7,000 short sentences from UMICH SI650 sentiment
classification competition on Kaggle (https://inclass.kaggle.com/c/si650winter11). Each sentence is labeled 1 or
0 for positive or negative sentiment respectively, which our network will learn to predict.

In [1]:
from keras.layers.core import Activation, Dense, Dropout, SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import codecs

#For TensorBoard
from tensorflow.keras.callbacks import TensorBoard
from time import gmtime, strftime
import datetime, os
import tensorflow as tf
import time

Using TensorFlow backend.


**Create folder for TensorBoard**

In [2]:
NAME = "lstmumich{}".format(int(time.time()))
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))

**Read Data and Generate vocabulary**

In [3]:
INPUT_FILE = "data/umich-sentiment-train.txt"
ftrain = codecs.open(INPUT_FILE, "r", encoding='ascii', errors='ignore')

# Read training data and generate vocabulary
maxlen = 0
word_freqs = collections.Counter()
num_recs = 0

for line in ftrain:
    labels, sentence = line.strip().split("\t")
    words = nltk.word_tokenize(sentence.lower())
    if len(words) > maxlen:
        maxlen = len(words)
    for word in words:
        word_freqs[word] += 1
    num_recs += 1
ftrain.close()

In [172]:
num_recs

7086

**Estimates for our Corpus**

In [134]:
## Get some information about our corpus
print(maxlen)            # 42
print(len(word_freqs))   # 2311

42
2311


Using the number of unique words _len(word_freqs)_, we set our vocabulary size to a fixed number and
treat all the other words as **out of vocabulary (OOV) words** and replace them with the pseudo-word
UNK (for unknown). At prediction time, this will allow us to handle previously unseen words as
OOV words as well.

The number of words in the sentence (maxlen) allows us to set a fixed sequence length and zero pad
shorter sentences and truncate longer sentences to that length as appropriate. Even though RNNs
handle variable sequence length, this is usually achieved either by padding and truncating as above,
or by grouping the inputs in different batches by sequence length. We will use the former approach
here. For the latter approach, Keras recommends using batches of size one (for more information
refer to: https://github.com/fchollet/keras/issues/40).

**Based on the preceding estimates,** we set our _VOCABULARY_SIZE_ to 2002. This is 2000 words from our
vocabulary plus the UNK pseudo-word and the PAD pseudo word (used for padding sentences to a
fixed number of words), in our case 40 given by _MAX_SENTENCE_LENGTH_

In [4]:
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40

Next we need a pair of lookup tables. Each row of input to the RNN is a sequence of word indices,
where the indices are ordered by most frequent to least frequent word in the training set. The two
lookup tables allow us to lookup an index given the word and the word given the index. This includes
the PAD and UNK pseudo-words as well

In [5]:
vocab_size = min(MAX_FEATURES, len(word_freqs)) + 2
word2index = {x[0]: i+2 for i, x in
              enumerate(word_freqs.most_common(MAX_FEATURES))}
word2index["PAD"] = 0
word2index["UNK"] = 1
index2word = {v:k for k, v in word2index.items()}

Next, we convert our input sentences to word index sequences, pad them to the MAX_SENTENCE_LENGTH
words. Since our output label in this case is binary (positive or negative sentiment), we don't need to
process the labels:

In [6]:
# convert sentences to sequences
X = np.empty((num_recs, ), dtype=list)
y = np.zeros((num_recs, ))
i = 0
ftrain = codecs.open(INPUT_FILE, "r", encoding='ascii', errors='ignore')
for line in ftrain:
    label, sentence = line.strip().split("\t")
    words = nltk.word_tokenize(sentence.lower())
    seqs = []
    for word in words:
        #if word2index.has_key(word):
        if word in word2index:
            seqs.append(word2index[word])
        else:
            seqs.append(word2index["UNK"])
    X[i] = seqs
    y[i] = int(label)
    i += 1
ftrain.close()


**Pad the sequences**

In [7]:
# Pad the sequences (left padded with zeros)
X = sequence.pad_sequences(X, maxlen=MAX_SENTENCE_LENGTH)

**Split input into training and test**

In [8]:
# Split input into training and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, 
                                                random_state=42)

print(Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape)

(5668, 40) (1418, 40) (5668,) (1418,)


The following diagram shows the structure of our RNN:
<img src="lstm_struct_model.JPG">

The **input for each row** is a sequence of word indices. The sequence length is given by
MAX_SENTENCE_LENGTH. The **first dimension of the tensor** is set to _None_ to indicate that the batch size (the
number of records fed to the network each time) is currently unknown at definition time; it is
specified during run time using the batch_size parameter. So assuming an as - yet undetermined batch
size, the shape of **the input tensor** is (None, MAX_SENTENCE_LENGTH, 1). These tensors are fed into an
**embedding layer** of size EMBEDDING_SIZE whose **weights are initialized** with small random values and
learned during training. This layer (embedding layer) will transform the tensor to a shape (None,MAX_SENTENCE_LENGTH,
EMBEDDING_SIZE). The **output of the embedding layer** is fed into an LSTM with sequence length
MAX_SENTENCE_LENGTH and output layer size HIDDEN_LAYER_SIZE, so **the output of the LSTM** is a tensor of shape
(None, HIDDEN_LAYER_SIZE, MAX_SENTENCE_LENGTH). By default, the LSTM will output a single tensor of shape
(None, HIDDEN_LAYER_SIZE) at its last sequence (return_sequences=False). This is fed **to a dense layer with
output size** of 1 with a sigmoid activation function, so it will output either 0 (negative review) or 1
(positive review).


**Constants of the Model**

In [9]:
EMBEDDING_SIZE = 128
HIDDEN_LAYER_SIZE = 64
BATCH_SIZE = 32
NUM_EPOCHS = 10

**Model**

We compile the model using the binary cross-entropy loss function since it predicts a binary value,
and the Adam optimizer, a good general purpose optimizer. Note that the hyperparameters
EMBEDDING_SIZE, HIDDEN_LAYER_SIZE, BATCH_SIZE and NUM_EPOCHS were tuned
experimentally over several runs:

In [10]:
model = Sequential()

model.add(Embedding(vocab_size, 
                    EMBEDDING_SIZE,
                    input_length=MAX_SENTENCE_LENGTH))

model.add(SpatialDropout1D(0.2))

model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1))
model.add(Activation("sigmoid"))

model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


**Train the model**

We then train the network for 10 epochs (NUM_EPOCHS) and batch size of 32 (BATCH_SIZE). At each epoch we
validate the model using the test data:

In [12]:
history = model.fit(Xtrain, ytrain, 
                    batch_size=BATCH_SIZE, 
                    epochs=NUM_EPOCHS,
                    callbacks=[tensorboard],                #for plot in TensorBoard
                    validation_data=(Xtest, ytest))

Train on 5668 samples, validate on 1418 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Validation plots for Loss and Accuracy** 
<img src="lstm_umich.JPG">

**Structure of the Neural Network Model**
<img src="network_lstm.JPG">

As you can see from the results, we get back close to 99% accuracy. The predictions the model makes
for this particular set match exactly with the labels, although this is not the case for all predictions:

In [23]:
# evaluate
score, acc = model.evaluate(Xtest, ytest, batch_size=BATCH_SIZE)
print("Test score: %.3f, accuracy: %.3f" % (score, acc))

for i in range(30):
    idx = np.random.randint(len(Xtest))
    xtest = Xtest[idx].reshape(1,MAX_SENTENCE_LENGTH)
    ylabel = ytest[idx]
    ypred = model.predict(xtest)[0][0]
    sent = " ".join([index2word[x] for x in xtest[0].tolist() if x != 0])
    print("%.0f\t%d\t%s" % (ypred, ylabel, sent))

Test score: 0.066, accuracy: 0.991
1	1	da vinci code was an awesome movie ...
0	0	da vinci code = up , up , down , down , left , right , left , right , b , a , suck !
1	1	very da vinci code slash amazing race .
1	1	harry potter is awesome i do n't care if anyone says differently ! ..
1	1	i love harry potter..
1	1	i either love brokeback mountain or think it 's great that homosexuality is becoming more acceptable ! :
0	0	this quiz sucks and harry potter sucks ok bye..
0	0	then we drove to bayers lake for the da vinci code , which as expected , tom hanks sucks ass in that movie , but the dramatic last 2 minutes were good .
1	1	i love harry potter .
1	1	i want to be here because i love harry potter , and i really want a place where people take it serious , but it is still so much fun .
1	1	i love harry potter .
1	1	so as felicia 's mom is cleaning the table , felicia grabs my keys and we dash out like freakin mission impossible .
1	1	and i like brokeback mountain .
1	1	the da vinci code i