In [1]:
# Paul-Jason Mello
# Professor Shim
# CMPE 257
# May 5th, 2022

# Recurrent Neural Networks and Long Short Term Memory

In [2]:
import numpy as np
import pandas as pd

from tensorflow import keras
from tensorflow.keras import layers

## 1. What is meant by Recurrent Neural Networks?

In [3]:
# Recurrent Neural Networks (RNN)
# 
# An RNN is essentially a recursive neural network which feeds the previous models output into a new model as input. This
# creates an interesting dynamic where each network result is not independent of the last. Traditionally speaking, RNN's are
# used to remember previous information from past states. In this regard they remind me of reinforcement learning models
# for their "state, action, reward" concept of remembering the previous states. 

## 2. What is meant by vanishing and exploding gradient and why is that a problem in RNN?

In [4]:
# Exploding and Vanishing Gradient (EVGP)
# 
# The vanishing gradient problem essentially states that using certain activation functions will result in a collapsing
# or exploding gradient. What this means is that when we backpropogate through the network to update neuron weights the 
# gradient may collapse to zero or explode. In either case this ruins the models ability to properly generalize. RNN's
# suffer from this by using sigmoid and tan functions in tandum leading to bad results over the span of the RNN.

## 3. What is meant by Long Short Term Memory?

In [5]:
# Long Short Term Memory (LSTM)
# 
# Long Short Term Memory solves the vanishing gradient problem in RNN's. Essentially, there is a memory cell that does 
# not forget its state. Along with the incorporation of circuit principles LSTM's act as a means of regulating the flow
# of information. This is an upgrade from RNN's as it uses nearly identical architecture. 

## 4. What is meant by Gated Recurrent Unit?

In [6]:
# Gated Recurrent Unit (GRU)
# 
# A Gated Recurrent Unit, like the LSTM, attempts to solve the vanishing gradient problem in RNN's. Essentially, 2 gates
# controls what information should be passed and what information should be forgotton between each "hidden layer" in the
# RNN architecture. This helps prevent the vanishing gradient problem as it acts like backpropogation in LSTM.

## 5. Train a bi-directional LSTM on imdb movies sentiment dataset from keras (tutorial available on its website, follow that tutorial) (https://keras.io/examples/nlp/bidirectional_lstm_imdb/)

In [7]:
(X_train, Y_train), (X_valid, Y_valid) = keras.datasets.imdb.load_data(num_words = 15000)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [8]:
dataInput = keras.Input(shape = (None,), dtype = "int32")

val = layers.Embedding(15000, 128)(dataInput)
val = layers.Bidirectional(layers.LSTM(64, return_sequences = True))(val)
val = layers.Bidirectional(layers.LSTM(64))(val)

dataOutput = layers.Dense(1, activation = "sigmoid")(val)

In [9]:
model = keras.Model(dataInput, dataOutput)
model.compile("Adamax", "binary_crossentropy", metrics = ["accuracy"])
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         1920000   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         98816     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,117,761
Trainable params: 2,117,761
Non-trainable params: 0
_________________________________________________________________


In [10]:
print(len(X_train))
print(len(X_valid))

25000
25000


In [11]:
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen = 150)
X_valid = keras.preprocessing.sequence.pad_sequences(X_valid, maxlen = 150)

In [12]:
model.fit(X_train, Y_train, batch_size = 64, epochs = 5, validation_data = (X_valid, Y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1541cc160a0>

In [13]:
# We can see that after the third epoch of training our model actually falls in terms of validation loss. As a result,
# the best model, fit for generalization at least, would be the results of epoch 2. Any training past that results
# in overfitting for the data set. Despite this will never achieved better than 90% accuracy which is likely the result
# of me deciding to use a smaller number of inputs and layer counts.