Sentiment analysis of movie (IMDB) reviews using dataset provided by the ACL 2011 paper, 
see http://ai.stanford.edu/~amaas/data/sentiment/
This notebook uses neural net models

The plan is to compare a variety of hyperparameters, vectorization techniques, neural net based models:
* dense neural network with bag of words
* dense neural network with fixed size input and words mapped to integers
* LSTM
* CNN


### Table of Contents<a class="anchor" id="table"></a>
* [Load data](#load)
* [Train different architectures](#train)
    * [Train NN 50 - 10 - 1](#train1)
    * [Train NN 256 - 128 - 1](#train2)
    * [Train NN with K-Fold cross validation](#kfold)
    * [Train RNN](#rnn)
* [Optimize](#opti)
    * [Optimize on dropouts](#opti_d)
        * no dropout
        * low dropout on 1 layer
        * high dropout on 1 layer
        * low dropout on 2 layers
        * high dropout on 2 layers
        * [Observation](#opti_d_o)

In [1]:
!pip install wget

Collecting wget
[33m  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x107294550>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',)': /simple/wget/[0m
[33m  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x1072942d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',)': /simple/wget/[0m
[33m  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x107294310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',)': /simple/wget

In [9]:
import numpy as np
import os
import os.path
import glob
import time

import pandas as pd
import matplotlib as plt

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import layers
from tensorflow.keras import models
nltk.download('punkt')
import nltk
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /Users/swami/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
from sklearn.model_selection import KFold

In [11]:
import wget
import tarfile

# By checking if the directory exists first, we allow people to delete the tarfile without the notebook re-downloading it
if os.path.isdir('aclImdb'):
    print("Dataset directory exists, taking no action")
else:    
    if not os.path.isfile('aclImdb_v1.tar.gz'):
        print("Downloading dataset")
        #!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
        wget.download('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')
    else:
        print("Dataset already downloaded")
    
    print("Unpacking dataset")
    #!tar -xf aclImdb_v1.tar.gz 
    tar = tarfile.open("aclImdb_v1.tar.gz")
    tar.extractall()
    tar.close()
    print("Dataset unpacked in aclImdb")

Dataset directory exists, taking no action


In [12]:
# configuration
SAMPLE_SIZE=1000

In [13]:
def hasGPU():
  device_name = tf.test.gpu_device_name()
  if device_name != '/device:GPU:0':
    print('No GPU found')
    return false
  else:
    print('Found GPU at: {}'.format(device_name))
    return true

In [14]:
hasGPU()

No GPU found


NameError: global name 'false' is not defined

<a href='#table'>Back</a>
# Load data<a class="anchor" id="load"></a>

## Create a dense vector from reviews 

In [6]:
time_beginning_of_notebook = time.time()
positive_file_list = glob.glob(os.path.join('aclImdb/train/pos', "*.txt"))
positive_sample_file_list = positive_file_list[:SAMPLE_SIZE]

negative_file_list = glob.glob(os.path.join('aclImdb/train/neg', "*.txt"))
negative_sample_file_list = negative_file_list[:SAMPLE_SIZE]

import re

# load doc into memory
# regex to clean markup elements 
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r', encoding='utf8')
    # read all text
    text = re.sub('<[^>]*>', ' ', file.read())
    #text = file.read()
    # close the file
    file.close()
    return text

In [7]:
positive_strings = [load_doc(x) for x in positive_sample_file_list]
negative_strings = [load_doc(x) for x in negative_sample_file_list]

positive_tokenized = [word_tokenize(s) for s in positive_strings]
negative_tokenized = [word_tokenize(s) for s in negative_strings]

TypeError: 'encoding' is an invalid keyword argument for this function

In [None]:
from collections import Counter
import numpy as np

In [None]:
total_counts = Counter()
all_reviews = positive_tokenized + negative_tokenized
for r in all_reviews:
    for word in r:
        total_counts[word] += 1

In [None]:
vocab = set(total_counts.keys())

In [None]:
vocab_size = len(vocab)
print(vocab_size)

In [None]:
# Create a dictionary of words in the vocabulary mapped to index positions
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
    word2index[word] = i

In [None]:
print("ID of 'movie' = {}".format(word2index['movie']))

In [None]:
def convert_to_bag(review):
    bag = np.zeros(vocab_size)
    for word in review:
        i = word2index[word]
        bag[i]+=1
    return bag

In [None]:
test_bag = convert_to_bag(all_reviews[0])


In [None]:
all_reviews_encoded = [convert_to_bag(x) for x in all_reviews]

In [None]:
all_reviews_encoded[0].shape

In [None]:
#all_reviews_trunc = np.trunc()

In [None]:
# display the map of words to indices
# print("word indexes = {}".format(word2index))

In [None]:
import random

positive_labels = []
for i in range(len(positive_tokenized)):
    positive_labels.append('POSITIVE')
negative_labels = []
for i in range(len(negative_tokenized)):
    negative_labels.append('NEGATIVE')
   

In [None]:
labels = positive_labels + negative_labels

num_lables = []

for val in labels:
    if val == 'POSITIVE':
       num_lables.append(1)
    else:
       num_lables.append(0) 
    

In [None]:
reviews_and_labels = list(zip(all_reviews_encoded, num_lables))
random.shuffle(reviews_and_labels)
reviews, labels = zip(*reviews_and_labels)

In [None]:
labels = np.array(labels)

## Create a sparse matrix from reviews (where we keep the order of the words)

In [None]:
positive_strings[0]

In [None]:
reviews=[]
for sentence in positive_strings:
    reviews.append([sentence,1])
for sentence in negative_strings:
    reviews.append([sentence,0])
random.shuffle(reviews)


In [None]:
tokenizer = keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',\
                                   lower=True, split=' ', char_level=False, oov_token=None, document_count=0)


In [None]:
len(reviews)

<a href='#table'>Back</a>
# Train models<a class="anchor" id="train"></a>
## Train NN 50 - 10 - 1 <a class="anchor" id="train1"></a>


In [None]:
X_train, X_test, y_train, y_test = train_test_split(np.array(reviews), np.array(labels), test_size=0.25)

In [None]:
model = keras.Sequential()
model.add(layers.Dense(50, activation = "relu", input_shape=(vocab_size, )))
model.add(layers.Dense(10, activation = "relu"))
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()

In [None]:
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)

In [None]:
results = model.fit(
 X_train, y_train,
 epochs= 20,
 validation_data=(X_test, y_test),
batch_size=500
)

## Train NN 256 - 128 - 1 <a class="anchor" id="train2"></a>

In [None]:
model = keras.Sequential([
    layers.Dense(256, activation = "relu", input_shape=(vocab_size, )),
    layers.Dense(128, activation = "relu"),
    layers.Dense(1, activation = "sigmoid")
])

In [None]:
model.summary()

In [None]:
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)

In [None]:
results = model.fit(
 X_train, y_train,
 epochs= 1,
 validation_data=(X_test, y_test),
batch_size=500
)

In [None]:
results.history

## Train NN with K-Fold cross validation <a class="anchor" id="kfold"></a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(np.array(reviews), np.array(labels), test_size=0.25)

In [None]:
kfold = KFold(3, True, 1)

In [None]:
train_data = list(zip(X_train, y_train))

In [None]:
train_data[0][0].shape

In [None]:
histories=[]
for train_indices, test_indices in kfold.split(X_train,y=y_train):
    model = keras.Sequential([
    layers.Dense(256, activation = "relu", input_shape=(vocab_size, )),
    layers.Dense(128, activation = "relu"),
    layers.Dense(1, activation = "sigmoid")
    ])
    model.compile(
     optimizer = "adam",
     loss = "binary_crossentropy",
     metrics = ["accuracy"]
    )
    K_X_train = X_train[train_indices]
    K_y_train = y_train[train_indices]
    K_X_test = X_train[test_indices]
    K_y_test = y_train[test_indices]
    results=model.fit(
        K_X_train, K_y_train,
        epochs= 5,
        validation_data=(K_X_test, K_y_test),
        batch_size=1000
    )
    histories.append(results.history)

In [None]:

df = pd.DataFrame(data=histories)
for col in df.columns:
    df[col] =  df[col].apply(lambda x: x[-1])
plot=df[["acc","val_acc"]].plot()
plot.set_ylim([0,1])

means=df[["acc","val_acc"]].mean()
print("mean acc: {}, mean val_acc: {}".format(means["acc"],means["val_acc"]))


## Train RNN <a class="anchor" id="rnn"></a>

In [None]:
X_train

<a href='#table'>Back</a>
# OPTIMIZE<a class="anchor" id="opti"></a>

## Optimize on dropout<a class="anchor" id="opti_d"></a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(np.array(reviews), np.array(labels), test_size=0.25)

In [None]:
kfold = KFold(3, True, 1)

In [None]:
train_data = list(zip(X_train, y_train))

In [None]:
# train_data[0][0].shape

In [None]:
from pdb import set_trace

def getMeansFromResultsHistory(histories):
  df = pd.DataFrame(data=histories)
  for col in df.columns:
      df[col] =  df[col].apply(lambda x: x[-1])
  means=df[["acc","val_acc"]].mean()
  return means

def trainModelWithDropoutOn1Layer(epochs_nb=5,rate=0.0):
  histories=[]
  for train_indices, test_indices in kfold.split(X_train,y=y_train):
      model = keras.Sequential([
      layers.Dense(256, activation = "relu", input_shape=(vocab_size, )),
      layers.Dropout(rate),
      layers.Dense(128, activation = "relu"),
      layers.Dense(1, activation = "sigmoid")
      ])
      model.compile(
       optimizer = "adam",
       loss = "binary_crossentropy",
       metrics = ["accuracy"]
      )
      K_X_train = X_train[train_indices]
      K_y_train = y_train[train_indices]
      K_X_test = X_train[test_indices]
      K_y_test = y_train[test_indices]
      results=model.fit(
          K_X_train, K_y_train,
          epochs= epochs_nb,
          validation_data=(K_X_test, K_y_test),
          batch_size=1000
      )
      histories.append(results.history)
#       set_trace()
  
  means= getMeansFromResultsHistory(histories)
  print(means) 
  return means


def trainModelWithDropoutOn2Layers(epochs_nb=5,rate=0.0):
  histories=[]
  for train_indices, test_indices in kfold.split(X_train,y=y_train):
      model = keras.Sequential([
      layers.Dense(256, activation = "relu", input_shape=(vocab_size, )),
      layers.Dropout(rate),
      layers.Dense(128, activation = "relu"),
      layers.Dropout(rate),
      layers.Dense(1, activation = "sigmoid")
      ])
      model.compile(
       optimizer = "adam",
       loss = "binary_crossentropy",
       metrics = ["accuracy"]
      )
      K_X_train = X_train[train_indices]
      K_y_train = y_train[train_indices]
      K_X_test = X_train[test_indices]
      K_y_test = y_train[test_indices]
      results=model.fit(
          K_X_train, K_y_train,
          epochs= epochs_nb,
          validation_data=(K_X_test, K_y_test),
          batch_size=1000
      )
      histories.append(results.history)
#       set_trace()
  
  means= getMeansFromResultsHistory(histories)
  print(means) 
  return means

In [None]:
dropout_means=[]

### No dropout


In [None]:
rate=0.0
means=trainModelWithDropoutOn1Layer(epochs_nb=5,rate=rate)
dropout_means.append([means["acc"],means["val_acc"], rate,0])

### Low dropout on 1 layer

In [None]:

rate=0.2
means=trainModelWithDropoutOn1Layer(epochs_nb=5,rate=rate)
dropout_means.append([means["acc"],means["val_acc"], rate,1])

### High dropout on 1 layer


In [None]:
rate=0.4
means=trainModelWithDropoutOn1Layer(epochs_nb=5,rate=rate)
dropout_means.append([means["acc"],means["val_acc"], rate,1])

### Low dropout on 2 layers

In [None]:

rate=0.2
means=trainModelWithDropoutOn1Layer(epochs_nb=5,rate=rate)
dropout_means.append([means["acc"],means["val_acc"], rate,2])

### High dropout on 2 layers


In [None]:
rate=0.4
means=trainModelWithDropoutOn1Layer(epochs_nb=5,rate=rate)
dropout_means.append([means["acc"],means["val_acc"], rate,2])

Plot results

In [None]:

df = pd.DataFrame(data=dropout_means,columns=['acc','val_acc','rate','nb_layers'])
plt.rcParams["figure.figsize"] = [17,2]
plot=df[["acc","val_acc"]].plot()
plot.set_ylim([0.7,1])
plot.grid()

plt.rcParams["figure.figsize"] = [17,1.5]
plot=df[["rate"]].plot()
plot.set_ylim([0,0.5])
plot.grid()

plt.rcParams["figure.figsize"] = [17,1.5]
plot=df[["nb_layers"]].plot()
plot.set_ylim([0,2.2])
plot.grid()

# means=df[["acc","val_acc"]].mean()
# print("mean acc: {}, mean val_acc: {}".format(means["acc"],means["val_acc"]))


In [None]:
dropout_means

### Observation<a class="anchor" id="opti_d_o"></a>
we have similar results, but got a higher test accuracy with low dropout on all layers and also less overfit (training and test accuracies are closer)