# Deep Learning for Sentiment Analysis

This notebook aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Keras(https://keras.io/). Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. 

In this notebook, we have experimented with LSTM to perform sentiment analysis on movie reviews from the Large Movie Review Dataset(http://ai.stanford.edu/~amaas/data/sentiment/), better known as the IMDB dataset.

In this task, given a movie review, the model attempts to predict whether it is positive or negative. This is a binary classification task.

In [1]:
import numpy as np
# from keras.datasets import imdb
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Activation, Embedding, Bidirectional, LSTM, Input, merge, SpatialDropout1D, Lambda, Layer
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.normalization  import BatchNormalization
# from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from nltk.corpus import stopwords
from collections import Counter
from keras import backend as K
from keras.initializers import Constant
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import pandas as pd
import re
import copy
import time
import random
import nltk

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Instructions for updating:
Use the retry module or similar alternatives.


## Data

As previously mentioned, we shall train a LSTM recurrent neural network on the Large Movie Review Dataset dataset.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.

The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test. **An accuracy of 88.89% was achieved.**

### Keras advantages:
As stated earlier, Keras was built with a focus on fast experimentation and prototyping. Hence,Keras provides access to the IMDB dataset built-in! 

The **imdb.load_data()** function allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

In [2]:
def clean_review(raw_review, remove_symbol = True, remove_stopwords = False, output_format = "string"):
    """
    Input:
            raw_review: raw text of a movie review
            remove_stopwords: a boolean variable to indicate whether to remove stop words
            output_format: if "string", return a cleaned string 
                           if "list", a list of words extracted from cleaned string.
    Output:
            Cleaned string or list.
    """
    
    # Remove HTML markup
    text = BeautifulSoup(raw_review)
    
    # Keep only characters
    if remove_symbol:
        text = re.sub("[^a-zA-Z]", " ", text.get_text())
    
    # Split words and store to list
    text = text.lower().split()
    
    if remove_stopwords:
    
        # Use set as it has O(1) lookup time
        stops = set(stopwords.words("english"))
        words = [w for w in text if w not in stops]
    
    else:
        words = text
    
    # Return a cleaned string or list
    if output_format == "string":
        return " ".join(words)
        
    elif output_format == "list":
        return words

In [3]:
# # fix random seed for reproducibility
# np.random.seed(7)
# # load the dataset but only keep the top n words, zero the rest
# top_words = 5000
# (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [4]:
train_data = pd.read_csv("TrainSet.csv", header=0, delimiter=",", quoting=0)
# print(train_data[0])

In [5]:
train_list = []
test_list = []
train_rate = []
test_rate = []
word2vec_input = []
pred = []

train_data = pd.read_csv("TrainSet1.csv", header=0, delimiter=",", quoting=0)
test_data = pd.read_csv("TestSet1.csv", header=0, delimiter=",", quoting=0)

# vector_type = "Word2vec"
# Extract words from reviews
# xrange is faster when iterating
for i in range(0, len(train_data.review)):

    # Append raw texts rather than lists as Count/TFIDF vectorizers take raw texts as inputs
    train_list.append(clean_review(train_data.review[i]))
    train_rate.append(train_data.rating[i])
    if i%1000 == 0:
        print ("Cleaning training review", i)

for i in range(0, len(test_data.review)):

    # Append raw texts rather than lists as Count/TFIDF vectorizers take raw texts as inputs
    test_list.append(clean_review(test_data.review[i]))
    test_rate.append(test_data.rating[i])
    if i%1000 == 0:
        print ("Cleaning test review", i)






 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Cleaning training review 0
Cleaning training review 1000
Cleaning training review 2000
Cleaning training review 3000
Cleaning training review 4000
Cleaning test review 0
Cleaning test review 1000


In [6]:
top_words = 5000
unigram = []
for i in range(len(train_list)):
    unigram += train_list[i].split()
unigram = Counter(unigram)

# print(unigram)
unigram = sorted(unigram.items(),key = lambda x:x[1],reverse = True)

rank5000 = {k:i for i,(k,v) in enumerate(unigram) if i < top_words}
rank = {k:i for i,(k,v) in enumerate(unigram)}
# print(rank[defencei])

In [7]:
print(train_list[0])



### Each word is represented by Tag if it is a top 5000 words

In [8]:
def sent2tags(sentence):
    """ Returns a vector of tag-classes from a given
        sentence.
    """
    tags = word_tokenize(sentence)
    tags = nltk.pos_tag(tags)
    out = []

    for _, tag in tags[:500]:
        if tag in postags:
            out.append(t2k.get(tag))
        else:
            out.append(t2k.get("UKN"))
    return out

In [9]:
postags = ["CC", "CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS",
           "LS", "MD", "NN", "NNS", "NNP", "NNPS", "PDT", "POS", "PRP",
           "PRP$", "RB", "RBR", "RBS", "RP", "TO", "UH", "VB", "VBD", "VBG",
           "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB", "UKN"]
nb_postags = len(postags)

t2k = dict([(v,k) for k, v in enumerate(postags)])
k2t = dict([(k,v) for k, v in enumerate(postags)])

print(nb_postags)

36


In [10]:
word2tag = {}
# for i,(k,v) in enumerate(rank5000.items()):
for i,(k,v) in enumerate(rank.items()):
#     print(k)
    tags = nltk.pos_tag([k])
#     print(i,tags)
    for _, tag in tags:
        if tag in postags:
            word2tag[k] = t2k.get(tag)
        else:
            word2tag[k] = t2k.get("UKN")
# print(word2tag)

### Each word is represented by its index in the Term Frequency

In [11]:
temp_train_list = copy.deepcopy(train_list)
tag_train_list = [[] for i in range(len(train_list))]
for i in range(len(temp_train_list)):
    temp_train_list[i] = temp_train_list[i].split()
    for j in range(len(temp_train_list[i])):
        try:
            temp = temp_train_list[i][j]
            temp_train_list[i][j] = rank5000[temp_train_list[i][j]]
            tag_train_list[i].append(word2tag[temp])
        except:
            temp = temp = temp_train_list[i][j]
            temp_train_list[i][j] = 5001
            tag_train_list[i].append(word2tag[temp])
    temp_train_list[i] = list(filter(lambda x:x<=5000,temp_train_list[i]))
print(temp_train_list[0])
print(len(temp_train_list[0]))
print(tag_train_list[0])
print(len(tag_train_list[0]))

[1108, 1641, 6, 356, 150, 18, 42, 1013, 42, 135, 0, 209, 700, 9, 44, 140, 72, 3, 2, 216, 1, 254, 4, 528, 34, 413, 27, 969, 8, 2382, 3, 717, 230, 2462, 5, 22, 102, 12, 3, 849, 2462, 7, 1, 1267, 33, 42, 92, 9, 6, 21, 4, 528, 397, 13, 0, 97, 236, 37, 0, 3172, 1, 180, 3027, 167, 0, 12, 7, 33, 1034, 41, 7, 1619, 134, 243, 16, 142, 626, 13, 0, 2383, 9, 0, 134, 5, 36, 11, 1569, 11, 2, 1619, 8, 25, 453, 313, 28, 0, 437, 386, 8, 3, 60, 1, 773, 42, 82, 234, 9, 93, 629, 62, 187, 37, 6, 122, 1672, 1, 1939, 221, 9, 210, 8, 85, 19, 59, 44, 210, 438, 4, 5, 0, 367, 1, 23, 9, 38, 1, 2706, 0, 97, 27, 2756, 2707, 5, 183, 75, 1039, 9, 264, 5, 454, 757, 1344, 288, 13, 7, 1877, 49, 118, 1, 28, 638, 147, 0, 5, 345, 4, 1313, 6, 15, 264, 5, 165, 1261, 757, 18, 21, 507, 1940, 21, 270, 206, 235, 4766, 430, 78, 139, 96, 357, 9, 4, 80, 226, 910, 21, 358, 1722, 8, 2, 23, 14, 602, 6, 7, 178, 4, 2193, 2, 703, 393, 10, 91, 6, 2158, 170, 4382, 1673, 55, 8, 28, 568, 3614, 1570, 4, 4280, 1354, 242, 352, 3446, 24, 149, 17

In [12]:
temp_test_list = copy.deepcopy(test_list)
tag_test_list = [[] for i in range(len(test_list))]
for i in range(len(temp_test_list)):
    temp_test_list[i] = temp_test_list[i].split()
    for j in range(len(temp_test_list[i])):
        try:
            temp = temp_test_list[i][j]
            temp_test_list[i][j] = rank5000[temp_test_list[i][j]]
            tag_test_list[i].append(word2tag[temp])
        except:
            temp = temp_test_list[i][j]
            temp_test_list[i][j] = 5001
#             print(i,j)
            try:
                tag_test_list[i].append(word2tag[temp])
            except:
                tag_test_list[i].append(t2k.get("UKN"))
                
    temp_test_list[i] = list(filter(lambda x:x<=5000,temp_test_list[i]))
print(temp_test_list[0])
print(len(temp_test_list[0]))
print(tag_test_list[0])
print(len(tag_test_list[0]))

[42, 47, 50, 1346, 2486, 520, 3, 36, 1236, 2486, 340, 4, 4305, 4849, 21, 4305, 391, 91, 446, 21, 55, 1559, 9, 0, 200, 1380, 1191, 260, 81, 149, 1380, 1191, 67, 18, 3, 222, 66, 104, 1, 0, 780, 3, 1251, 4712, 1, 1191, 496, 0, 2863, 10, 28, 150, 18, 697, 42, 954, 9, 137, 1380, 1191, 6, 50, 17, 152, 700, 4, 185, 4, 68, 81, 4585, 2496, 617, 8, 33, 59, 74, 140, 617, 137, 1900, 0, 12, 4, 17, 8, 2435, 4, 12, 883, 14, 5, 206, 934, 2009, 167, 619, 0, 1380, 1191, 67, 14, 22, 5, 206, 22, 1, 0, 3, 46, 21, 26, 20, 186, 28, 150, 213, 21, 130, 17, 4288, 4, 4251, 93, 629, 3, 1352, 60, 313, 0, 103, 164, 1, 14, 22, 11, 6, 7, 183, 2463, 1220, 50, 17, 2, 1150, 128, 185, 8, 0, 103, 272, 5, 143, 3115, 382, 21, 140, 29, 1624, 207, 104, 1380, 286, 8, 0, 12, 7, 1367, 1808, 266, 4, 4478, 22, 1192, 3546, 170, 184, 0, 416, 23, 5, 77, 32, 40, 2, 10, 1380, 4, 64, 141, 28, 3546, 1877, 8, 2, 1125, 866, 8, 14, 1511, 803, 2624, 20, 20, 20, 15, 340, 34, 2, 58, 3, 1083, 3405, 59, 213, 31, 148, 363, 4, 17, 4396, 406, 4, 2,

In [13]:
print(rank5000["not"])

20


## Word Embeddings

The vocabulary of words in all the reviews is very large. Mere one-hot encoding of individual words will lead to an extremely sparse dataset.

Hence we will use Word Embeddings, a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space. This reduces sparsity of the data, as well as gives more meaning to each word in this embedded space, rather than just present or not. For more information on word embeddings, check out my experiments with Word Embeddings on Harry Potter and Game of Thrones [here](https://github.com/darshanbagul/Word_Embeddings)!

In this notebook we won't have to use Gensim or create an embedding network from scratch. This is another advantage of using Keras. We just include another layer after the input in our model for generating embeddings of the input word! Keras provides a convenient way to convert positive integer representations of words into a word embedding by an **Embedding layer.**

We will map each word onto a **32 length real valued vector**. We will also limit the total number of words that we are interested in modeling to the **5000 most frequent words, and zero out the rest.** Finally, the sequence length (number of words) in each review varies, so we will **constrain each review to be 500 words,** truncating long reviews and pad the shorter reviews with zero values. (Another alternative for this could be experimenting with a Dynamic RNN, but that is something I shall experiment with later.)

In [14]:
X_train = temp_train_list
X_test = temp_test_list

tag_X_train = tag_train_list
tag_X_test = tag_test_list

In [15]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

tag_X_train = sequence.pad_sequences(tag_X_train, maxlen=max_review_length)
tag_X_test = sequence.pad_sequences(tag_X_test, maxlen=max_review_length)

## Models

Now that we have preprocessed the dataset and split into train and test, let us begin exploring with different model architectures. The models that we are going to implement are listed below with increasing complexity:

CNN + LSTM with Dropout regularization

### LSTM and Convolutional Neural Network with dropout

Convolutional neural networks (CNNs) generally excel at learning the spatial structure in input data. Hence they are widely used on data which comprise of highly correlated spatial structures. For example - Images! Images are unstructured data points, where groups of pixels represent a particular structure. Presence or absence of such structures, helps us classify images into particular category.

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

Adding a 1-D Convolutional layer followed by Max pooling is elementary in Keras. 

Here we will add a 1-D Conv layer and max pooling layer after the Embedding layer which then feed the consolidated features to the LSTM. We use a set of 32 feature maps (convolutional filters) with a size of 3x3. The pooling layer can use the standard length of 2 to halve the feature map size.

### Binary model

In [16]:
# # create the model
# max_review_length = 500
# embedding_vecor_length = 32
# model = Sequential()
# model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
# model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
# model.add(MaxPooling1D(pool_size=2))
# model.add(Dropout(0.2))
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(1, activation='sigmoid'))
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [17]:
# # create the model
# # max_review_length = 500
# MAX_LEN = max_review_length
# embedding_vecor_length = 32
# model = Sequential()
# main_input = Input(shape=(MAX_LEN,), dtype='int32', name='main_input')
# model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True))
# model.add(SpatialDropout1D(0.2))
# model.add(BatchNormalization())
# model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
# model.add(MaxPooling1D(pool_size=5))
# model.add(Dropout(0.2))
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
# # model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
# model.add(Dense(2, activation='sigmoid'))
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.summary()

### MultiLossLayer

### 3 tasks

In [18]:
# create the model
# max_review_length = 500
MAX_LEN = max_review_length
embedding_vecor_length = 32
model = Sequential()
main_input = Input(shape=(MAX_LEN,), dtype='int32', name='main_input')
x = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(main_input)
x = SpatialDropout1D(0.2)(x)
x = BatchNormalization()(x)
x = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling1D(pool_size=2)(x)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(x)


sec_input_tag = Input(shape=(MAX_LEN,), dtype='float32', name='sec_input_tag')
y = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(sec_input_tag)
y = SpatialDropout1D(0.2)(y)
y = BatchNormalization()(y)
y = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(y)
y = MaxPooling1D(pool_size=2)(y)
y = Dropout(0.2)(y)
y = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(y)

xy = merge([x, y], mode='concat')
xy = Dense(30, activation='tanh', )(xy)
xy = Dropout(0.5)(xy)


sec_input_rate = Input(shape=(MAX_LEN,), dtype='float32', name='sec_input_rating')
z = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(sec_input_rate)
z = SpatialDropout1D(0.2)(z)
z = BatchNormalization()(z)
z = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(z)
z = MaxPooling1D(pool_size=2)(z)
z = Dropout(0.2)(z)
z = LSTM(100, dropout=0.2, recurrent_dropout=0.2)(z)

xz = merge([x, z], mode='concat')
# xz = Dense(30, activation='tanh', )(xz)
# xz = Dropout(0.5)(xz)

task0_output = Dense(5, activation='sigmoid', name='sec_output_tag')(x)
task1_output = Dense(5, activation='softmax', name='main_output')(xy)
task2_output = Dense(5, activation='softmax', name='sec_output_rating')(xz)


model_task0 = Model(input=[main_input], output=[task0_output])
model_task1 = Model(input=[main_input, sec_input_tag], output=[task1_output])
model_task2 = Model(input=[main_input, sec_input_rate], output=[task2_output])

model_task0.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])
model_task1.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])
model_task2.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])

model_task0.summary()
model_task1.summary()
model_task2.summary()

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead


  name=name)


Instructions for updating:
keep_dims is deprecated, use keepdims instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
main_input (InputLayer)      (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 500, 32)           0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 500, 32)           128       
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
dro



### copy

In [19]:
# # create the model
# # max_review_length = 500
# MAX_LEN = max_review_length
# embedding_vecor_length = 32
# model = Sequential()
# main_input = Input(shape=(MAX_LEN,), dtype='int32', name='main_input')
# x = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(main_input)
# x = SpatialDropout1D(0.2)(x)
# x = BatchNormalization()(x)

# x = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
# x = MaxPooling1D(pool_size=2)(x)
# x = Dropout(0.2)(x)

# x = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(x)
# task1_output = Dense(5, activation='softmax', name='main_output')(x)
# task2_output = Dense(2, activation='softmax', name='aux_output')(x)


# model_task1 = Model(input=[main_input], output=[task1_output])
# model_task2 = Model(input=[main_input], output=[task2_output])

# model_task1.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])
# model_task2.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])

# model_task1.summary()
# model_task2.summary()

### map the train and test score into 5 classes

In [20]:
temp_y_train = copy.deepcopy(train_rate)
temp_y_train_binary = copy.deepcopy(train_rate)
for i in range(len(temp_y_train)):
    if 0 <= temp_y_train[i] <= 5:
        temp_y_train_binary[i] = 0
    else:
        temp_y_train_binary[i] = 1
    
    if 0 <= temp_y_train[i] <= 0.2:
        temp_y_train[i] = 1
    elif 0.2 < temp_y_train[i] <= 0.4:
        temp_y_train[i] = 2
    elif 0.4 < temp_y_train[i] <= 0.6:
        temp_y_train[i] = 3
    elif 0.6 < temp_y_train[i] <= 0.8:
        temp_y_train[i] = 4
    elif 0.8 < temp_y_train[i] <= 1:
        temp_y_train[i] = 5

print(len(temp_y_train))
print(set(temp_y_train))
# print(temp_y_train)

4004
{1, 2, 3, 4, 5}


In [21]:
temp_y_test = copy.deepcopy(test_rate)
temp_y_test_binary = copy.deepcopy(test_rate)
for i in range(len(temp_y_test)):
    if 0 <= temp_y_test[i] <= 5:
        temp_y_test_binary[i] = 0
    else:
        temp_y_test_binary[i] = 1
    
    if 0 <= temp_y_test[i] <= 0.2:
        temp_y_test[i] = 1
    elif 0.2 < temp_y_test[i] <= 0.4:
        temp_y_test[i] = 2
    elif 0.4 < temp_y_test[i] <= 0.6:
        temp_y_test[i] = 3
    elif 0.6 < temp_y_test[i] <= 0.8:
        temp_y_test[i] = 4
    elif 0.8 < temp_y_test[i] <= 1:
        temp_y_test[i] = 5

print(len(temp_y_test))
print(set(temp_y_test))

1002
{1, 2, 3, 4, 5}


In [22]:
print(rank5000["neutral"])

2001


In [23]:
temp_train_rating = copy.deepcopy(train_rate)
train_rating = [[] for i in range(len(temp_train_rating))]
for i in range(len(temp_train_rating)):
#     if temp_y_train[i] == 1:
#         train_rating[i] += [rank5000["very"], rank5000["positive"]]
#     elif temp_y_train[i] == 2:
#         train_rating[i] += [rank5000["positive"]]
#     elif temp_y_train[i] == 3:
#         train_rating[i] += [rank5000["neutral"]]#[rank5000["not"],rank5000["positive"],rank5000["not"],rank5000["negative"]]
#     elif temp_y_train[i] == 4:
#         train_rating[i] += [rank5000["negative"]]
#     elif temp_y_train[i] == 5:
#         train_rating[i] += [rank5000["very"], rank5000["negative"]]

    if temp_y_train[i] == 1:
        train_rating[i] += [1]
    elif temp_y_train[i] == 2:
        train_rating[i] += [2]
    elif temp_y_train[i] == 3:
        train_rating[i] += [3]
    elif temp_y_train[i] == 4:
        train_rating[i] += [4]
    elif temp_y_train[i] == 5:
        train_rating[i] += [5]
# print(train_rating)

In [24]:
temp_test_rating = copy.deepcopy(test_rate)
test_rating = [[] for i in range(len(temp_test_rating))]
for i in range(len(temp_test_rating)):
#     if temp_y_train[i] == 1:
#         test_rating[i] += [rank5000["very"], rank5000["positive"]]
#     elif temp_y_train[i] == 2:
#         test_rating[i] += [rank5000["positive"]]
#     elif temp_y_train[i] == 3:
#         test_rating[i] += [rank5000["neutral"]]#[rank5000["not"],rank5000["positive"],rank5000["not"],rank5000["negative"]]
#     elif temp_y_train[i] == 4:
#         test_rating[i] += [rank5000["negative"]]
#     elif temp_y_train[i] == 5:
#         test_rating[i] += [rank5000["very"], rank5000["negative"]]
        
    if temp_y_train[i] == 1:
        test_rating[i] += [1]
    elif temp_y_train[i] == 2:
        test_rating[i] += [2]
    elif temp_y_train[i] == 3:
        test_rating[i] += [3]
    elif temp_y_train[i] == 4:
        test_rating[i] += [4]
    elif temp_y_train[i] == 5:
        test_rating[i] += [5]
# print(train_rating)

In [25]:
from sklearn import utils
class_weights = utils.compute_class_weight('balanced', np.unique(temp_y_train), temp_y_train)
class_weights= {class_id:class_weight for class_id, class_weight in zip(list(range(5)), class_weights)}
# class_weights[0] = 0
# class_weights = {k:v for k,v in sorted(class_weights.items(),key=lambda x:x[0])}
print(class_weights)

{0: 5.976119402985074, 1: 0.9789731051344743, 2: 0.5488690884167238, 3: 0.6071266110689917, 4: 2.9226277372262772}


In [26]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=[1,2,3,4,5])
mlb_binary = MultiLabelBinarizer(classes=[0,1])
y_train = mlb.fit_transform([[y] for y in temp_y_train])
# y_train_binary = mlb_binary.fit_transform([[y] for y in temp_y_train_binary])
print(y_train)

y_test = mlb.fit_transform([[y] for y in temp_y_test])
# y_test_binary = mlb_binary.fit_transform([[y] for y in temp_y_test_binary])
print(y_test)

[[0 0 0 1 0]
 [0 0 0 1 0]
 [0 0 0 1 0]
 ...
 [0 0 1 0 0]
 [0 0 0 1 0]
 [0 0 1 0 0]]
[[0 0 1 0 0]
 [0 0 0 1 0]
 [0 0 0 0 1]
 ...
 [0 0 0 1 0]
 [0 1 0 0 0]
 [0 0 0 1 0]]


In [None]:
rate_X_train = sequence.pad_sequences(train_rating, maxlen=max_review_length)
rate_X_test = sequence.pad_sequences(test_rating, maxlen=max_review_length)
print(rate_X_train[0])
# rate_Y_train = sequence.pad_sequences([[y] for y in temp_y_train], maxlen=max_review_length)
# rate_Y_test = sequence.pad_sequences([[y] for y in temp_y_test], maxlen=max_review_length)
# print(rate_Y_train[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [None]:
# Custom loss layer
class CustomMultiLossLayer(Layer):
    def __init__(self, nb_outputs=2, **kwargs):
        self.nb_outputs = nb_outputs
        self.is_placeholder = True
        super(CustomMultiLossLayer, self).__init__(**kwargs)
        
    def build(self, input_shape=None):
        # initialise log_vars
        self.log_vars = []
        for i in range(self.nb_outputs):
            self.log_vars += [self.add_weight(name='log_var' + str(i), shape=(1,),
                                              initializer=Constant(0.), trainable=True)]
        super(CustomMultiLossLayer, self).build(input_shape)

    def multi_loss(self, ys_true, ys_pred):
#         print(self.nb_outputs)
#         print(len(ys_true))
#         print(len(ys_pred))
        assert len(ys_true) == self.nb_outputs and len(ys_pred) == self.nb_outputs
        loss = 0
        for y_true, y_pred, log_var in zip(ys_true, ys_pred, self.log_vars):
            precision = K.exp(-log_var[0])
            loss += K.sum(precision * (y_true - y_pred)**2. + log_var[0], -1)
        return K.mean(loss)

    def call(self, inputs):
        ys_true = inputs[:self.nb_outputs]
        ys_pred = inputs[self.nb_outputs:]
        loss = self.multi_loss(ys_true, ys_pred)
        self.add_loss(loss, inputs=inputs)
        # We won't actually use the output.
        return K.concatenate(inputs, -1)

In [None]:
MAX_LEN = 500
def get_prediction_model():
    main_input = Input(shape=(MAX_LEN,), dtype='float32', name='main_input')
    sec_input = Input(shape=(MAX_LEN,), dtype='float32', name='sec_input')
    x = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(main_input)
    x = SpatialDropout1D(0.2)(x)
    x = BatchNormalization()(x)
    x = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
    x = MaxPooling1D(pool_size=2)(x)
    x = Dropout(0.2)(x)
    x = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(x)
    
    y = Embedding(top_words, embedding_vecor_length, input_length=max_review_length,trainable=True)(sec_input)
    y = SpatialDropout1D(0.2)(y)
    y = BatchNormalization()(y)
    y = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(y)
    y = MaxPooling1D(pool_size=2)(y)
    y = Dropout(0.2)(y)
    y = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2))(y)
    
    task1_output = Dense(5, activation='softmax', name='word_output')(x)
    task2_output = Dense(5, activation='softmax', name='tag_output')(y)
    return Model([main_input, sec_input], [task1_output, task2_output])

def get_trainable_model(prediction_model):
    main_input = Input(shape=(MAX_LEN,), dtype='float32', name='main_input')
    sec_input = Input(shape=(MAX_LEN,), dtype='float32', name='sec_input')
    task1_output, task2_output = prediction_model([main_input, sec_input])

    task1_input = Input(shape=(5,), dtype='float32', name='task1_input')
    task2_input = Input(shape=(5,), dtype='float32', name='task2_input')
    out = CustomMultiLossLayer(nb_outputs=2)([task1_input, task2_input, task1_output, task2_output])
    
    xz = merge([task1_output, task1_output], mode='concat')
    task_output = Dense(5, activation='softmax', name='sec_output_rating')(xz)
    
    return Model([main_input, sec_input, task1_input, task2_input], output = out)

prediction_model = get_prediction_model()
trainable_model = get_trainable_model(prediction_model)
trainable_model.compile(optimizer='RMSprop', loss=None)
trainable_model.compile(optimizer='RMSprop', loss=None, metrics=['accuracy'])

In [None]:
hist = trainable_model.fit([X_train, tag_X_train, y_train, y_train], epochs=2, batch_size=64)

In [None]:
# print(hist)
scores = trainable_model.evaluate([X_test, tag_X_test,y_test,y_test], None, verbose=1)
# print("Epoch: %d, Accuracy: %.2f%%" % (i, scores[1]*100))
# print(scores)

In [None]:
import pylab
%matplotlib inline
pylab.plot(hist.history['loss'])

### model 3 for tag only

In [None]:
BATCH_SIZE = 128
results = []
for batch in range(60):
    t=time.time()
    if random.random() < 0.8:
        sample = np.random.randint(0, len(y_train), BATCH_SIZE)
        x_sampled, y_sampled = tag_X_train[sample], y_train[sample]
        model_task1.train_on_batch(x_sampled, y_sampled)#, class_weight=class_weights,sample_weight=None)
    print(batch,time.time()-t)

In [None]:
scores = model_task1.fit(tag_X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

### model 1 for sentence
#### task1: top 5000 words
#### task2: tag

In [None]:
# BATCH_SIZE = 128
# results = []
# for batch in range(60):
#     t=time.time()
#     if random.random() < 0.8:
#         sample = np.random.randint(0, len(y_train), BATCH_SIZE)
#         x_sampled, y_sampled = tag_X_train[sample], y_train[sample]
#         model_task1.train_on_batch(x_sampled, y_sampled)#, class_weight=class_weights,sample_weight=None)
#     print(batch,time.time()-t)

In [None]:
model_task1.fit([X_train, tag_X_train], y_train, epochs=10, batch_size=64)
scores = model_task1.evaluate([X_test, tag_X_test], y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
 768/4004 [====>.........................] - ETA: 58s - loss: 1.0611 - acc: 0.5729

### model 2 for sentence
#### task1: top 5000 words
#### task2: rating

In [None]:
model_task2.fit([X_train, rate_X_train], y_train, epochs=10, batch_size=64)
scores = model_task2.evaluate([X_test, rate_X_test], y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

### model 0
#### single task on word

In [None]:
model_task0.fit(X_train, y_train, epochs=10, batch_size=64)
scores = model_task0.evaluate(X_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
# BATCH_SIZE = 128
# results = []
# for batch in range(60):
#     t=time.time()
#     if random.random() < 0.8:
#         sample = np.random.randint(0, len(y_train), BATCH_SIZE)
#         x_sampled, y_sampled = X_train[sample], y_train[sample]
#         model_task1.train_on_batch(x_sampled, y_sampled)#, class_weight=class_weights,sample_weight=None)
#     print(batch,time.time()-t)

In [None]:
# scores = model_task1.evaluate(X_test, y_test, verbose=0)
# print("Accuracy: %.2f%%" % (scores[1]*100))

#### model2 for sentence

In [None]:
# BATCH_SIZE = 128
# results = []
# for batch in range(60):
#     t=time.time()
#     if random.random() < 0.8:
#         sample = np.random.randint(0, len(y_train), BATCH_SIZE)
#         x_sampled, y_sampled = X_train[sample], y_train_binary[sample]
#         model_task2.train_on_batch(x_sampled, y_sampled)#, class_weight=class_weights,sample_weight=None)
#     print(batch,time.time()-t)

In [None]:
# scores = model_task2.evaluate(X_test, y_test_binary, verbose=0)
# print("Accuracy: %.2f%%" % (scores[1]*100))

### model 0 for sentence binary

In [None]:
model.fit(X_train, y_train_binary, epochs=4, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test_binary, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
model.fit(X_train, y_train, epochs=1, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

## Results

As we can see, with a mixture of a **Convolutional Neural network + LSTM**, we are able to achieve an accuracy as high as the state-of-art accuracy for this dataset. As stated earlier, the Stanford researchers were able to achieve an accuracy of 88.89%, and **we have been able to reach 88.18%** (I even achieved **88.41%** with a different seed earlier!)

Some reflections:
    1. CNN and max pooling layers after the Embedding layer are able to pick out invariant features for good and bad sentiment. These learned spatial features are then learned as sequences by an LSTM layer.
    2. We have less weights and faster training time!

## Summary

In this notebook we implemented LSTM network models for sequence classification problem, specifically sentiment classification of movie reviews from **Large Movie Review Dataset dataset.**

    1. Implemented a simple single layer LSTM model for the IMDB movie review sentiment classification problem.
    2. Extended LSTM model with LSTM-specific dropout to combat overfitting.
    3. Combined the spatial structure learning properties of a Convolutional Neural Network (CNN) with the sequence learning of an LSTM.