<a href="https://colab.research.google.com/github/lmodahl/PSYCH-755-Final/blob/main/psych755_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In the following notebook, I will be using the IMDB sentiment analysis dataset from keras to create a neural network, which is able to accurately classify positive vs. negative reviews. The dataset includes around 50,000 reviews from IMDB, which are labeled as positive or negative.

# Libraries

In [51]:
import tensorflow as tf # necessary for importing keras
from tensorflow import keras # keras includes the dataset and other neural network tools we're using
from keras.datasets import imdb # imdb dataset
import numpy as np # for defining arrays
from keras.models import Sequential # for modeling
from keras.wrappers.scikit_learn import KerasClassifier # for creating the model
from keras import layers # for defining layers in model
from keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense # for model parameters
from sklearn.model_selection import ParameterGrid # for creating the model parameter grid
from keras.callbacks import EarlyStopping # used in parameter grid
from sklearn.metrics import accuracy_score # for calculating test accuracy

# Splitting data into train and test sets

In [7]:
(X_train, y_train), (X_test, y_test) = imdb.load_data()

# Decoding the reviews
This dataset is a bit confusing since the text of the reviews are actually integer-encoded. This means that the value of each review is just a sequence of integers with each integer representing a specific word in the dictionary. Because of this, we need to convert the integers into more meaningful text values which we will be doing below.

## Example review

In [3]:
print(X_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


## Decoding from integer to text

In [4]:
word_index = imdb.get_word_index()
index_from = 3
word_index = {key:(value+index_from) for key,value in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2

reverse_word_index = {value:key for key, value in word_index.items()}

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# source: https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

## Same review as above, but now decoded

In [8]:
decode_review(X_train[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

# Defining train, validation, and test sequences

In [11]:
total_words = 2500 # Only using the 2500 most common words (you can use any amount, I just decided to use 2500)

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words= total_words)

X_train, X_val = X_train[:-2500], X_train[-2500:]
y_train, y_val = y_train[:-2500], y_train[-2500:]

print(len(X_train)) # number of training sequences
print(len(X_val)) # number of validation sequences
print(len(X_test)) # number of test sequences

22500
2500
25000


## Same review as before, but now with less common words removed

In [12]:
decode_review(X_train[0])

"<START> this film was just brilliant casting location scenery story direction <UNK> really <UNK> the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same <UNK> island as myself so i loved the fact there was a real connection with this film the witty <UNK> throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really <UNK> at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of <UNK> and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> for the whole film but these children are amazing and should be <UNK> for what they have done 

# Standardize length of reviews:
Here, we will be using the pad_sequences function from keras to standardize the length of the reviews in this dataset. We do this because the length of the reviews can vary greatly and this can effect the accuracy of our sentiment analysis.

In [25]:
max_sequence_length = 500 # setting a 500 word limit for all reviews

X_train = keras.preprocessing.sequence.pad_sequences(X_train, value= word_index["<PAD>"], padding= 'post', maxlen= max_sequence_length)
X_val = keras.preprocessing.sequence.pad_sequences(X_val, value= word_index["<PAD>"], padding= 'post', maxlen= max_sequence_length)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, value= word_index["<PAD>"], padding= 'post', maxlen= max_sequence_length)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

# source: https://note.com/mlai/n/ndd0643e2a843

(22500, 500)
(2500, 500)
(25000, 500)


# Modeling

In [27]:
embedding_dim = 16

def create_model(filters = 64, kernel_size = 3, strides=1, units = 256,
                 optimizer='adam', rate = 0.25, kernel_initializer ='glorot_uniform'): # model parameters
    model = Sequential()

    model.add(Embedding(total_words, embedding_dim, input_length= max_sequence_length)) # embedding layer

    model.add(Dropout(rate))
    model.add(Conv1D(filters = filters, kernel_size = kernel_size, strides= strides,
                     padding='same', activation= 'relu')) # convolutional layers

    model.add(GlobalMaxPooling1D()) # max pooling

    model.add(Dense(units = units, activation= 'relu', kernel_initializer= kernel_initializer)) # dense layer
    model.add(Dropout(rate)) # dropout

    model.add(Dense(1, activation= 'sigmoid')) # output layer

    model.compile(loss='binary_crossentropy',
                  optimizer= optimizer,
                  metrics=['accuracy']) # compile the model
    return model

model = KerasClassifier(build_fn= create_model)

  model = KerasClassifier(build_fn= create_model)


## Tuning hyperparameters

In [32]:
# hyperparameters
filters = [128]
kernel_size = [5]
strides = [1]
Dense_units = [128, 512]
kernel_initializer = ['TruncatedNormal']
rate_dropouts = [0.25]
optimizers = ['adam']
epochs = [5]
batches = [64]

# parameter grid search
param_grid = dict(optimizer = optimizers, epochs = epochs, batch_size = batches,
                  filters = filters, kernel_size = kernel_size, strides = strides,
                  units = Dense_units, kernel_initializer = kernel_initializer, rate = rate_dropouts)

grid = ParameterGrid(param_grid)
param_sets = list(grid)

param_scores = []
for params in grid:

    print(params)
    model.set_params(**params)

    earlystopper = EarlyStopping(monitor='val_accuracy', patience= 0, verbose=1)

    history = model.fit(X_train, y_train,
                        shuffle= True,
                        validation_data=(X_val, y_val),
                        callbacks= [earlystopper])

    param_score = history.history['val_accuracy']
    param_scores.append(param_score[-1])
    print('+-'*50)

print('param_scores:', param_scores)

# Choose best parameters
p = np.argmax(np.array(param_scores))
best_params = param_sets[p]
print("best parameter set", best_params)

{'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 128}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 5: early stopping
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
{'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 512}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 4: early stopping
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
param_scores: [0.8831999897956848, 0.8795999884605408]
best parameter set {'batch_size': 64, 'epochs': 5, 'filters': 128, 'kernel_initializer': 'TruncatedNormal', 'kernel_size': 5, 'optimizer': 'adam', 'rate': 0.25, 'strides': 1, 'units': 128}


## Running model with best parameters

In [33]:
model.set_params(**best_params)
model.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7abf7d9bba00>

## Test accuracy of model

In [52]:
print("Test accuracy = %f%%" % (accuracy_score(y_test, model.predict(X_test))*100))

Test accuracy = 81.876000%


# Conclusion
Although the code I worked on for this project does not directly relate to the Healthy Minds project I've been working on, I could see potential for the use of a neural network model in the future with the Healthy Minds data. For example, perhaps Healthy Minds could use a neural network model to predict whether or not participants have high or low ACIP (Awareness, Connection, Insight, Purpose) scores from the responses they gave on certain survey items. Alternatively, Healthy Minds could use a similar kind of sentiment analysis to quickly calculate the total negative vs. positive reviews for their app. Additionally, I just found it to be some great neural network practice since I'm sure these kinds of models will pop up in my work in the near future.