In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LinearRegression
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Preparing our Data
First we read in the data and separate it into two lists, one for the reviews and one for the sentiments. We go through the sentiment column and assign integer values that correspond to the positive or negative nature of the review. This will be helpful later on when determining how positive or negative a review is for our model.

In [2]:
imdb_data = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')


imdb_data.loc[imdb_data['sentiment'] == 'positive',  'sentiment'] = 1
imdb_data.loc[imdb_data['sentiment'] == 'negative',  'sentiment'] = 0

imdb_data.head()

reviews, sentiments = imdb_data['review'], imdb_data['sentiment']

Having broken up the data, we now split it into training and testing sets. We will train our model with 90% of our total data, and validate our model with the remaining 10%.

In [3]:
train_reviews = np.array(reviews[0:45000])
train_sentiments = np.array(sentiments[0:45000]).astype(np.int)

test_reviews = np.array(reviews[45000:50000])
test_sentiments = np.array(sentiments[45000:50000]).astype(np.int)

## Tokenization
Tokenization allows us to create vectors that correspond to each review. This process turns text into integers where each integer is the index of a token in a dictionary that contains all the vocabulary for our dataset. Punctuation is removed with this step, so our model will only be concerned with words.

In [4]:
input_size, output_size, max_length = 1500, 128, 120

tokenizer = Tokenizer(num_words = input_size)
tokenizer.fit_on_texts(train_reviews)
word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(train_reviews)
testing_sequences = tokenizer.texts_to_sequences(test_reviews)

## Padding
We pad out input sequences here. This step in crucial in that it normalizes all of our varying input reviews to all have consistent length. Review sequences that are longer than our max length are truncated, and those which are shorter are filled with zeroes until they reach our desired length.

In [5]:
training_padded = pad_sequences(training_sequences, maxlen = max_length, truncating = 'post')
testing_padded = pad_sequences(testing_sequences, maxlen = max_length)

# Building the Model
Now that our preparation is complete, we can begin the actually build our model. We landed on the Sequential model which Keras provides a template for. The main part of our model is a series of Dense and Dropout layers, which utilize the ReLU activation function which was determined to be the best option for our model. The Dropout functions are useful and allow our model to avoid overfitting, and improve our validation accuracy in relation to the testing accuracy.

1. Tanh: 0.9693 accuracy, 0.8234 validation accuracy
1. RELU: 0.9351 accuracy, 0.8206 validation accuracy
1. Sigmoid: 0.9691 accuracy, 0.8122 validation accuracy

In [6]:
model = tf.keras.Sequential([
	tf.keras.layers.Embedding(input_size, output_size, input_length = max_length),
	tf.keras.layers.Flatten(),
])

in_size = 128

while in_size > 7:
    model.add(tf.keras.layers.Dense(in_size, activation='tanh'))
    model.add(tf.keras.layers.Dropout(0.1))
    in_size = in_size / 2
    
model.add(tf.keras.layers.Dense(1, activation='relu'))


User settings:

   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
   KMP_BLOCKTIME=0
   KMP_DUPLICATE_LIB_OK=True
   KMP_INIT_AT_FORK=FALSE
   KMP_SETTINGS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=true
   KMP_ENABLE_TASK_THROTTLING=true
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hype

We tested a number of different optimizers to determine which was the best option for our model. Here are the resulting accuracies and validation accuracies for each.
1. RMSprop: 0.8777 accuracy, 0.7856 validation accuracy
2. Adam: 0.7500 accuracy, 0.7578 validation accuracy
3. SGD: 0.5005 accuracy, 0.4940 validation accuracy
4. FTRL: 0.4993 accuracy, 0.5060 validation accuracy

We chose to go with RMSprop, as it provided us with the best accuracy, and subsequent validation accuracy.

As for the loss function, we tested Binary Cross Entropy, Hinge, MSLE, Huber, and Poisson functions to see which would pair best with the RMSprop optimizer. 
1. BCE: 0.8989 accuracy, 0.7898 validation accuracy
1. Hinge: 0.8911 accuracy, 0.7836 validation accuracy
1. Poisson: 0.9641 accuracy, 0.7558 validation accuracy
1. Huber: 0.9679 accuracy, 0.7820 validation accuracy
1. MSLE: 0.4993 accuracy, 0.5060 validation accuracy

In the end, with determined that the combination of RMSprop and Binary Cross Entropy was the best option.

In [7]:
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 128)          192000    
_________________________________________________________________
flatten (Flatten)            (None, 15360)             0         
_________________________________________________________________
dense (Dense)                (None, 128)               1966208   
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2

# Running the Model

In [8]:
num_epochs = 8
model.fit(training_padded, train_sentiments, epochs = num_epochs, validation_data = (testing_padded, test_sentiments))

2021-12-20 23:34:10.371332: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fa1ab7bf950>

Our final accuracy:

In [9]:
test_loss, test_acurracy = model.evaluate(testing_padded, test_sentiments)
test_acurracy



0.8091999888420105