# Module 5 Assignment Ian Feekes

This assignment covers Module 5 assignemt for Ian Feekes. I can be contacted at ifeekes@sandiego.edu (916-333-9381)

Please feel free to contact me if this work does not meet the rubric or expectations and I will expediently and gratefully make necessary adjustments

The summary of my findings can be found at the bottom of this colab notebook, and all input and output files will be placed in the same [google drive folder](https://drive.google.com/drive/folders/1URUqCUnSGwVxGla79RQrdoi1867IuWdr?usp=sharing) in which this file resides.

## Initial Prompt

Sentiment analysis or uncovering sentiment from natural language data is a common application of natural language processing. In this assignment, you will build a Recurrent Neural network to predict whether a given film review is a positive one or a negative one.
Build a Recurrent Neural Network (RNN) based classifier that deciphers whether the sentiment of a movie review is positive or negative.
Instructions:
*   Import the dataset. Dataset – ‘imdb’ dataset comes prepackaged with the keras.datasets module; to import the dataset use: from keras.datasets import imdb
*   Data pre-processing to standardize the review length
*   Build the RNN model
*   Compile the RNN model
*   Split the dataset into train/validation sets
*   Fit the training dataset to the RNN model
*   Use the validation dataset to generate predictions
*   Visualize the actual data in conjunction with the predictions
*   Summarize your findings

Rubric:
The code to prepare the
data, building the neural
net and validating the
accuracy of the model
has been successfully
executed and all the
steps are well
documented. In addition,
the findings are clearly
summarized.

## Initial Configuration

### Initial Imports

In [1]:
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import numpy as np

### Global Variables

In [2]:
maxWords = 5000
maxReview = 500
embeddingVectorLength = 32 # Each word is turned into a vector of 32 digits
numEpochs = 5
batchSize = 64

## Import Dataset

In [3]:
from keras.datasets import imdb

## Split the Dataset into Train/Validation Sets

In [4]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=maxWords)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


### Exploratory Analysis

In [5]:
len(X_train)

25000

In [6]:
len(X_test)

25000

In [7]:
print("Categories:", np.unique(y_train))
print("Number of unique words:", len(np.unique(np.hstack(X_train))))

Categories: [0 1]
Number of unique words: 4998


## Data Pre-processing to Standardize the Review Length

The below cell standardizes the review length by pad_sequences. Any sequence longer than "Max Review" will be shortened to that length, and any sequence shorter will have 0's padded to it

In [8]:
X_train = sequence.pad_sequences(X_train, maxlen = maxReview)
X_test = sequence.pad_sequences(X_test, maxlen = maxReview)

In [9]:
allData = np.concatenate((X_train, X_test), axis = 0)
allLabels = np.concatenate((y_train, y_test), axis = 0)

assert(len(allData) == len(allLabels))

len(allLabels)

50000

In [10]:
# Get unlabelled data sorted into 80:20
X_train = allData[0: -10000]
X_test = allData[-10000:]

# Get labels sorted into 80:20
y_train = allLabels[0: -10000]
y_test = allLabels[-10000:]

# Break the code if it's not 80% 20% split and the labels and inputs aren't properly suited
assert(len(X_train) == 40000 == len(y_train))
assert(len(X_test) == 10000 == len(y_test))

In [11]:
X_test

array([[   0,    0,    0, ...,  393,    7,   14],
       [   0,    0,    0, ...,  790,    7,  158],
       [   0,    0,    0, ...,  253,    5,  737],
       ...,
       [   0,    0,    0, ...,   21,  846,    2],
       [   0,    0,    0, ..., 2302,    7,  470],
       [   0,    0,    0, ...,   34, 2005, 2643]], dtype=int32)

In [12]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X_train[0]] )
print(decoded) 

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this fil

## Build the RNN Model

In [13]:
model = Sequential()

In [14]:
model.add(Embedding(maxWords, embeddingVectorLength, input_length=maxReview))
model.add(LSTM(100))

# Output Layer: sigmoid activation function for positive or negative classification
model.add(Dense(1, activation='sigmoid'))

## Compile the RNN Model

In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


## Fit the Training Dataset to the RNN Model

In [17]:
model.fit(X_train, y_train, epochs=numEpochs, batch_size = batchSize)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f56d8cdc790>

In [18]:
model.save("baseline.h5")

## Use the Validation Dataset to Generate Predictions



In [19]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Model accuracy on the IMDb dataset: {0:.2f}%".format(scores[1]*100))

Model accuracy on the IMDb dataset: 85.76%


## Visualize the Actual Data in Conjunction with Predictions

In [26]:
# Function that takes in review string, encodes it, and obtains prediction
def user_input_processing(review):
    vec = []
    for word in review.split(" "):
        if word[-1] == ".":
            word = word[:-1]
        vec.append(index[str.lower(word)])
    vec_padded = sequence.pad_sequences([vec], 500)
    print(review, model.predict(vec_padded))

In [30]:
user_input_processing("Quite a nice film. I really enjoyed the ending.")

Quite a nice film. I really enjoyed the ending. [[0.8481015]]


In [31]:
user_input_processing("That was absolutely terrible. The worst film I've seen in my life")

That was absolutely terrible. The worst film I've seen in my life [[0.40989488]]


## Summarize Findings

LSTM models consist of multiple layers. Each layer takes input from the previous one and advances output to the next one. The first layer takes the numerical sequences as input, and the last layer gives the prediction label as the output.

The embedding vector length converts each word into a dense vector, and 32 worked most nicely in training the model, just to be safe.

Allowing for additional units provided significant improvements in accuracy as well. The baseline was initially evaluated with 10 units, but beefing the LSTM up to 100 units saw an additional 2% improvement in predictions. 

The model is able to predict the test data with an accuracy of 85.76%. Additionally the model performs well on new predictions generated. Tuning the epochs to allow for maximum training allowed the model to see a 5% improvement in performance for the final model. 

This accuracy could further be improved by using back-to-back LSTM layers. Additionally, increasing the word dictionary or training data may see improvements in the models' classification prediction performance.