# IMDB movie reviews sentiment analysis Using BidirectionalLSTM


This is a short implementation for IMDB movie reviews sentiment analysis. Its a binary classification task, where we  have a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. I used the original dataset relased by Stanford(Large Movie Review Dataset v1.0 ). I used Bidirectional LSTM on this dataset, and got a really good accuracy(Training 94%, Testing 91%). 



For implementation I have used Keras, Pandas, NLTK. First of all here we are going to import all the libraries that we are going to use.

In [1]:
import pandas as pd
import numpy as np
import re
import os
from IPython.display import HTML

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras import initializers, regularizers, constraints, optimizers, layers


import nltk
# nltk.download('words')
# nltk.download('wordnet')
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import words
from nltk.corpus import wordnet 
allEnglishWords = words.words() + [w for w in wordnet.words()]
allEnglishWords = np.unique([x.lower() for x in allEnglishWords])
import warnings
warnings.filterwarnings('ignore')

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


After that we are going to import our datasets. So main dataset have 2 folders for train and test set. Train and Test set both consist of two subfolders for positive and negative reviews. All the reviews are inside .txt files. So to use this dataset we have combine all these files from various folders and make a single dataset.
Here I have imported files from all the 4 folders.

In [2]:
path = "/Users/ved/Desktop/Int_prep/aclImdb/"
positiveSamples = [x for x in os.listdir(path+"train/pos/") if x.endswith(".txt")]
negativeSamples = [x for x in os.listdir(path+"train/neg/") if x.endswith(".txt")]
pos_test_samples = [x for x in os.listdir(path+"test/pos/") if x.endswith(".txt")]
neg_test_samples = [x for x in os.listdir(path+"test/neg/") if x.endswith(".txt")]


Here I have saved texts from all the files into 4 different lists for each section.

In [119]:
positiveReviews, negativeReviews, pos_test_Reviews, neg_test_Reviews = [], [], [], []
for pSameple in positiveSamples:
    with open(path+"train/pos/"+pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nSample in negativeSamples:
    with open(path+"train/neg/"+nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
for tfile in pos_test_samples:
    with open(path+"test/pos/"+tfile, encoding="latin1") as f:
        pos_test_Reviews.append(f.read())
for tfile in neg_test_samples:
    with open(path+"test/neg/"+tfile, encoding="latin1") as f:
        neg_test_Reviews.append(f.read())
print(len(pos_test_Reviews))

12500


After that I combined these lists into single pandas dataframe. This dataframe have 2 columns, one is for review text and other one is for label. Now in total this dataframe consists 50000 rows.


In [133]:
df1 = pd.concat([
    pd.DataFrame({"review":positiveReviews, "label":1}),
    pd.DataFrame({"review":negativeReviews, "label":0}),
    pd.DataFrame({"review":pos_test_Reviews, "label":1}),
    pd.DataFrame({"review":neg_test_Reviews, "label":0})
], ignore_index=True).sample(frac=1, random_state=1)


len(df1)

50000

To use our text we need to first do some preprocessing on it. We need to remove the stopwords, convert texts to lowercase, extracts lemmas from words. So here we will create a new column of processed reviews in our dataframe df1, which will consist our reviews after preprocessing. 

In [134]:
stop_words = set(stopwords.words("english")) 
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'[^\w\s]','',text, re.UNICODE)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [lemmatizer.lemmatize(token, "v") for token in text]
    text = [word for word in text if not word in stop_words]
    text = " ".join(text)
    return text

df1['Processed_Reviews'] = df1.review.apply(lambda x: clean_text(x))
#df2['Processed_Reviews'] = df2.review.apply(lambda x: clean_text(x))

Now comes the main part of our task. Here we are going to first tokenize and then do the modeling for this task. I extracted tokens from the text, passed it through embedding layer to use embedding for each token, then I passed it through a Bidirectional LSTM layer. Bidirectional LSTM are really just putting two independent LSTMs together. The input sequence is fed in normal time order for one network, and in reverse time order for another. Using Bidirectional LSTM improves accuracy by good amount. We also have a dropout layer to add some regularization into the network. Also I have tried inbuilt LSTM regularizers, as a result our model performs really well and its not that overfitted. 


Apart from that there are 2 dense layer too. For optimization I have used 'Adam' optimizer which is really good as compared to other optimizers as it consider momentum also.  Activation function is Relu, which is kind of standard for LSTM, its saves us from the issue of vanishing gradient and help our network converge faster.
Batch size is 500 and number of epochs is 20.

In [143]:
max_features = 8000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(df1['Processed_Reviews'])
list_tokenized_train = tokenizer.texts_to_sequences(df1['Processed_Reviews'])

maxlen = 130
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
y = df1['label']

embed_size = 128
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(32, return_sequences = True, kernel_regularizer=regularizers.l2(0.001),
                activity_regularizer=regularizers.l1(0.001))))
model.add(GlobalMaxPool1D())
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.05))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 500
epochs = 20
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.2)


Train on 40000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1a4c384da0>

So here we have got training accuracy of 94.33% and validation accuracy of 90.30 % in just 20 epochs, which is pretty good. Model is slightly overfitted which we can imporve by proper hyerparameter tunning. By tunning regularization parameters and ruuning for few more epoch we can get really good results for both traning and validation.