#### Name : Subodh Dharmadhikari
### PRN : 18030142043
### Course : Text Analytics
#### Name - Subodh Dharmadhikari
# Sentiment analysis using Bidirectional LSTM on IMDB Reviews dataset

#### Importing required packages

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import os
import urllib.request
import tarfile
import warnings
warnings.filterwarnings("ignore")

Using TensorFlow backend.


#### Downloading the dataset from Stanford Education API it will take upto 4 minutes to download

In [2]:

url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filePath = "aclImdb_v1.tar.gz"

if not os.path.isfile(filePath):
    result = urllib.request.urlretrieve(url, filePath)
    print('downloaded: ', result)

if not os.path.exists("aclImdb"):
    tempTarFile = tarfile.open("aclImdb_v1.tar.gz",'r:gz')
    result = tempTarFile.extractall('')

#### Using regex to clean the data

In [3]:
import re

def remove_tags(text):
    regular_expression_tag = re.compile(r'<[^>]+>')
    return regular_expression_tag.sub('',text)

#### Read positive and negative files

In [4]:
import os

def read_files(file_type):
    path = "aclImdb/"
    file_list=[]
    
    positive_path = path + file_type + "/pos/"
    for f in os.listdir(positive_path):
        file_list = file_list + [positive_path + f] 
    
    negative_path = path + file_type + "/neg/"
    for f in os.listdir(negative_path):
        file_list = file_list + [negative_path + f] 
        
    print('read', file_type, 'files: ', len(file_list) )
    
    all_labels = ( [1]*12500 + [0]*12500 )
    
    all_texts = []
    
    for f in file_list:
        with open(f, encoding='utf8') as file_input:
            all_texts = all_texts + [ remove_tags(" ".join(file_input.readlines() ) ) ] #remove html tags
    
    return all_labels, all_texts

In [5]:
y_train, x_train_text = read_files("train")

read train files:  25000


In [6]:
y_test, x_test_text = read_files("test")

read test files:  25000


In [7]:
token = Tokenizer( num_words=2000 )
token.fit_on_texts(x_train_text)

x_train_seq = token.texts_to_sequences(x_train_text)
x_test_seq = token.texts_to_sequences(x_test_text)

x_train_final = sequence.pad_sequences( x_train_seq, maxlen=100 )
x_test_final = sequence.pad_sequences( x_test_seq, maxlen=100 )

#### Add Embedding layer

In [8]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional

#### Import RNN

In [9]:
from keras.layers.recurrent import LSTM

In [10]:
model = Sequential()




In [11]:
model.add( Embedding(output_dim=32, input_dim=2000, input_length=100))
model.add( Dropout(0.35) )




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


#### Add LSTM and other layers

In [12]:
model.add( Bidirectional(LSTM(units=16)))

In [13]:
model.add( Dense(units=256, activation='relu') )
model.add( Dropout(0.5) )

In [14]:
model.add( Dense(units=1, activation='sigmoid') )

In [15]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 32)                6272      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 78,977
Trainable params: 78,977
Non-trainable params: 0
_________________________________________________________________


#### Adding Adam optimizer

In [16]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


#### Train the model on the bidirectional LSTM

In [17]:
train_history = model.fit( x_train_final, y_train, batch_size=200, epochs=30, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/30
 - 11s - loss: 0.5850 - acc: 0.6809 - val_loss: 0.4556 - val_acc: 0.8174
Epoch 2/30
 - 9s - loss: 0.3550 - acc: 0.8473 - val_loss: 0.6556 - val_acc: 0.7068
Epoch 3/30
 - 9s - loss: 0.3108 - acc: 0.8714 - val_loss: 0.5445 - val_acc: 0.7650
Epoch 4/30
 - 9s - loss: 0.2950 - acc: 0.8786 - val_loss: 0.6204 - val_acc: 0.7196
Epoch 5/30
 - 9s - loss: 0.2781 - acc: 0.8883 - val_loss: 0.4802 - val_acc: 0.7910
Epoch 6/30
 - 8s - loss: 0.2678 - acc: 0.8920 - val_loss: 0.5641 - val_acc: 0.7736
Epoch 7/30
 - 9s - loss: 0.2564 - acc: 0.8958 - val_loss: 0.5219 - val_acc: 0.7924
Epoch 8/30
 - 8s - loss: 0.2478 - acc: 0.9006 - val_loss: 0.4788 - val_acc: 0.7968
Epoch 9/30
 - 8s - loss: 0.2350 - acc: 0.9051 - val_loss: 0.5823 - val_acc: 0.7724
Epoch 10/30
 - 8s - loss: 0.2248 - acc: 0.9087 - val_loss: 0.6516 - val_acc: 0.7242
Epoch 11/30
 - 8s - loss: 0.2198 - acc: 0.9106 - val_loss: 0.7051 - val_acc: 0.7418
Epoch 12/30
 - 9s - loss: 0.2125 - 

#### Evaluating the accuracy

In [18]:
scores = model.evaluate(x_test_final, y_test, verbose=1)
print(scores[1])

0.81972


In [19]:
predict = model.predict_classes(x_test_final)

In [20]:
predict[:10]

array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1]], dtype=int32)

In [21]:
predict_classes = predict.reshape(-1)
predict_classes[:10]

array([1, 1, 1, 1, 1, 1, 0, 1, 0, 1], dtype=int32)

In [22]:
SentimentDict = {1:'positive', 0:'negative'}
print('\n')
def display_test_Sentiment(i):
    print(x_test_text[i])
    print('Ground truth:', SentimentDict[y_test[i]]) 
    print('Predict result:', SentimentDict[ predict_classes[i]]+'\n\n')





In [23]:
for i in range(1,20):
    display_test_Sentiment(i)

This is a gem. As a Film Four production - the anticipated quality was indeed delivered. Shot with great style that reminded me some Errol Morris films, well arranged and simply gripping. It's long yet horrifying to the point it's excruciating. We know something bad happened (one can guess by the lack of participation of a person in the interviews) but we are compelled to see it, a bit like a car accident in slow motion. The story spans most conceivable aspects and unlike some documentaries did not try and refrain from showing the grimmer sides of the stories, as also dealing with the guilt of the people Don left behind him, wondering why they didn't stop him in time. It took me a few hours to get out of the melancholy that gripped me after seeing this very-well made documentary.
Ground truth: positive
Predict result: positive


I really like this show. It has drama, romance, and comedy all rolled into one. I am 28 and I am a married mother, so I can identify both with Lorelei's and Ro