# Recurrent Neural Networks for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb


## The IMDb Movie Review Dataset**

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively. For simplicity, I assembled the reviews in a single CSV file.


## 0. Import libraries

In [1]:
#Import libraries
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

Using TensorFlow backend.


## 1. Reading Data using Panda

In [12]:
#Import IMDb Movie Review Dataset
df = pd.read_csv('shuffled_movie_data.csv') #the file exceds the limit of github
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


## 2. Prepare Data

In [16]:
#First, modify the data, now i use lambda :D
df['review'] = df['review'].apply(lambda x: x.lower())
df['review'] = df['review'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(df['review'].values)
X = tokenizer.texts_to_sequences(df['review'].values)
X = pad_sequences(X)

In [27]:
#Split the data
Y = df['sentiment'].values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(33500, 1862) (33500, 2)
(16500, 1862) (16500, 2)


## 3. Create the model

In [26]:
#Now, we create the model for RNN
model = Sequential()
model.add(Embedding(max_fatures, 128,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 1862, 128)         256000    
_________________________________________________________________
spatial_dropout1d_8 (Spatial (None, 1862, 128)         0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 256)               394240    
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 514       
Total params: 650,754
Trainable params: 650,754
Non-trainable params: 0
_________________________________________________________________
None


## 4. Train

In [31]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
 - 20s - loss: 0.4382 - acc: 0.8168
Epoch 2/7
 - 19s - loss: 0.3214 - acc: 0.8648
Epoch 3/7
 - 19s - loss: 0.2836 - acc: 0.8808
Epoch 4/7
 - 19s - loss: 0.2578 - acc: 0.8958
Epoch 5/7
 - 19s - loss: 0.2288 - acc: 0.9036
Epoch 6/7
 - 19s - loss: 0.2103 - acc: 0.9158
Epoch 7/7
 - 19s - loss: 0.1959 - acc: 0.9239


## 5. Validate

In [30]:
#Calc accuracy (validation) with 1500
val_size = 1500
X_validate = X_test[-val_size:]
Y_validate = Y_test[-val_size:]
X_test = X_test[:-val_size]
Y_test = Y_test[:-val_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.40
acc: 0.84
