# Assignment 3: LSTM and GRU vs. Multiplicative Variations

### Maxim Ryabinov (U02204083)
### CAP4641: Natural Language Processing 
### Instructor: Dr. Ankur Mali 
### University of South Florida (Spring 2025)

---

# Description

In this assignment, I implemented Standard LSTM RNN, Standard GRU RNN, Multiplicative LSTM RNN, and Multiplicative GRU RNN. My choice for the machine learning library used in this notebook is TensorFlow.

Below, you will find an implementation for each recurrent neural network architecture, all following a set of model equations that each of the architectures are based off of.

Lastly, in order to show robustness and highlight the differences in gating mechanisms between the architectures, the Copy Task is carried out. Each model is ran through this task using varying lengths of sequences that contain a randomly generated set of characters (sequences lengths: `{100, 200, 500, 1000}`).



---

# 1. Initial Setup

In [1]:
import tensorflow as tf
import time
import numpy as np
from sklearn.metrics import precision_score, recall_score

tf.random.set_seed(123)
np.random.seed(123)

2025-03-15 14:41:43.148795: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742064103.164952   22310 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742064103.169864   22310 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-15 14:41:43.186976: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# 2. Defining each RNN Architecture

### Standard LSTM RNN Implementation

In [2]:
tf.random.set_seed(123)
np.random.seed(123)

class StandardLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Input gate weights
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        i_t = tf.sigmoid(tf.matmul(x_t, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_t, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_t, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_t, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

# ------- Higher-level TF RNN that unrolls over time -------
class StandardLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = StandardLSTMCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

### Multiplicative LSTM RNN Implementation

In [3]:
tf.random.set_seed(123)
np.random.seed(123)

class MultiplicativeLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Input gate weights and bias
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights and bias
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights and bias
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights and bias
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

        # Multiplicative extension weights and bias
        self.W_m = self.add_weight(shape=(input_size, input_size), initializer="random_normal", trainable=True)
        self.U_m = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_m = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        # Multiplicative Extension
        m_t = tf.matmul(x_t, self.W_m) + tf.matmul(h_prev, self.U_m) + self.b_m
        x_cap = m_t * x_t

        i_t = tf.sigmoid(tf.matmul(x_cap, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_cap, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_cap, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_cap, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

class MultiplicativeLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = MultiplicativeLSTMCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

### Standard GRU RNN Implementation

In [4]:
tf.random.set_seed(123)
np.random.seed(123)

class StandardGRUCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Update gate weights
        self.W_z = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_z = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_z = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Reset gate weights
        self.W_r = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_r = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_r = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Candidate hidden state weights
        self.W_h = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_h = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_h = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev):
        # Update gate
        z_t = tf.sigmoid(tf.matmul(x_t, self.W_z) + tf.matmul(h_prev, self.U_z) + self.b_z)
        
        # Reset gate
        r_t = tf.sigmoid(tf.matmul(x_t, self.W_r) + tf.matmul(h_prev, self.U_r) + self.b_r)
        
        # Candidate hidden state
        h_hat = tf.tanh(tf.matmul(x_t, self.W_h) + tf.matmul(r_t * h_prev, self.U_h) + self.b_h)
        
        # New hidden state
        h_t = (1 - z_t) * h_prev + z_t * h_hat
        
        return h_t

# ------- Higher-level TF RNN that unrolls over time -------
class StandardGRU(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.gru_cell = StandardGRUCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h = self.gru_cell(x_t, h)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

### Multiplicative GRU RNN Implementation

In [5]:
tf.random.set_seed(123)
np.random.seed(123)

class MultiplicativeGRUCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Update gate weights
        self.W_z = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_z = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_z = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Reset gate weights
        self.W_r = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_r = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_r = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Candidate hidden state weights
        self.W_h = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_h = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_h = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

        # Multiplicative extension weights and bias
        self.W_m = self.add_weight(shape=(input_size, input_size), initializer="random_normal", trainable=True)
        self.U_m = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_m = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev):
        # Memory matrix introduction
        m_t = tf.matmul(x_t, self.W_m) + tf.matmul(h_prev, self.U_m) + self.b_m
        x_cap = m_t * x_t

        # Update gate
        z_t = tf.sigmoid(tf.matmul(x_cap, self.W_z) + tf.matmul(h_prev, self.U_z) + self.b_z)
        
        # Reset gate
        r_t = tf.sigmoid(tf.matmul(x_cap, self.W_r) + tf.matmul(h_prev, self.U_r) + self.b_r)
        
        # Candidate hidden state
        h_hat = tf.tanh(tf.matmul(x_cap, self.W_h) + tf.matmul(r_t * h_prev, self.U_h) + self.b_h)
        
        # New hidden state
        h_t = (1 - z_t) * h_prev + z_t * h_hat
        
        return h_t

# ------- Higher-level TF RNN that unrolls over time -------
class MultiplicativeGRU(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.gru_cell = MultiplicativeGRUCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h = self.gru_cell(x_t, h)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

# 3. Train, Test, and Validation Splits

# 4. Copy Task

In [69]:
def benchmark_model(model, X_train, Y_train, X_val, Y_val, X_test, Y_test, epochs=10, batch_size=32, lr=0.01):
    X_train = tf.convert_to_tensor(X_train, dtype=tf.float32)
    Y_train = tf.convert_to_tensor(Y_train, dtype=tf.float32)
    X_val = tf.convert_to_tensor(X_val, dtype=tf.float32)
    Y_val = tf.convert_to_tensor(Y_val, dtype=tf.float32)
    X_test = tf.convert_to_tensor(X_test, dtype=tf.float32)
    Y_test = tf.convert_to_tensor(Y_test, dtype=tf.float32)

    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    loss_fn = tf.keras.losses.MeanSquaredError()

    start_time = time.time()
    #num_batches = int(np.ceil(X_train.shape[0] / batch_size))
    
    for epoch in range(epochs):
        # Shuffle training data at the start of each epoch.
        indices = tf.range(start=X_train.shape[0])
        indices = tf.random.shuffle(indices)
        X_train = tf.gather(X_train, indices)
        Y_train = tf.gather(Y_train, indices)
        
        epoch_loss = 0
        num_batches = int(np.ceil(X_train.shape[0] / batch_size))

        for i in range(num_batches):
            start = i * batch_size
            end = min((i+1) * batch_size, X_train.shape[0])
            X_batch = X_train[start:end]
            Y_batch = Y_train[start:end]
            
            with tf.GradientTape() as tape:
                output = model(X_batch)
                batch_loss = loss_fn(Y_batch, output)
            gradients = tape.gradient(batch_loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))
            
            epoch_loss += batch_loss.numpy()

        epoch_loss /= num_batches
        
        # Calculate training accuracy
        train_preds = np.argmax(output.numpy(), axis=-1).flatten()
        true_train = np.argmax(Y_batch.numpy(), axis=-1).flatten()
        train_accuracy = np.mean(train_preds == true_train)
        
        # Calculate validation loss and accuracy
        val_output = model(X_val)
        val_loss = loss_fn(Y_val, val_output).numpy()
        val_preds = np.argmax(val_output.numpy(), axis=-1)
        true_val = np.argmax(Y_val.numpy(), axis=-1)
        val_accuracy = np.mean(val_preds == true_val)
        
        print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
              f"Trainning Accuracy: {train_accuracy:.4f} | Val Accuracy: {val_accuracy:.4f}")
    
    test_output = model(X_test)
    test_loss = loss_fn(Y_test, test_output).numpy()
    test_preds = np.argmax(test_output.numpy(), axis=-1)
    true_test = np.argmax(Y_test.numpy(), axis=-1)
    test_accuracy = np.mean(test_preds == true_test)
    
    print(f"\nTest Loss: {test_loss:.4f} | Test Accuracy: {test_accuracy:.4f}")
    
    return time.time() - start_time


############################
# Main Run
############################
def run_benchmark():
    seq_lengths = [100, 200, 500, 1000]
    seq_length = seq_lengths[1]
    seq_count = 100
    vocab_size = 10
    delimiter = 10
    
    # Uniform hyper-parameters used across each model
    input_size = vocab_size + 1 # Vocabulary tokens + delimiter token
    total_delimiters = 3 # Adds a few delimiters to make it clear that its the end of the sequence.
    hidden_size = 128
    num_epochs = 10
    batch_size = 32
    learning_rate = 0.01

    # TODO: set all random seeds, currently results still unpredictable
    tf.random.set_seed(123)
    np.random.seed(123)
    
    # X_train = np.random.rand(seq_count, seq_length, input_size).astype(np.float32)
    # Y_train = X_train.copy()
    # X_val = np.random.rand(seq_count, seq_length, input_size).astype(np.float32)
    # Y_val = X_train.copy()
    # X_test = np.random.rand(seq_count, seq_length, input_size).astype(np.float32)
    # Y_test = X_test.copy()
    
    X_train = np.random.randint(0, 10, size=(seq_count, seq_length, input_size)).astype(np.float32)
    Y_train = X_train.copy()
    X_val = np.random.randint(0, 10, size=(seq_count, seq_length, input_size)).astype(np.float32)
    Y_val = X_val.copy()
    X_test = np.random.randint(0, 10, size=(seq_count, seq_length, input_size)).astype(np.float32)
    Y_test = X_test.copy()
    
    # Add delimiters to the sequences
    X_train = np.concatenate([X_train, np.full((seq_count, 1, input_size), delimiter)], axis=1)
    Y_train = np.concatenate([Y_train, np.full((seq_count, 1, input_size), delimiter)], axis=1)

    X_val = np.concatenate([X_val, np.full((seq_count, 1, input_size), delimiter)], axis=1)
    Y_val = np.concatenate([Y_val, np.full((seq_count, 1, input_size), delimiter)], axis=1)

    X_test = np.concatenate([X_test, np.full((seq_count, 1, input_size), delimiter)], axis=1)
    Y_test = np.concatenate([Y_test, np.full((seq_count, 1, input_size), delimiter)], axis=1)

    
    lstm_time = benchmark_model(StandardLSTM(input_size, hidden_size), X_train, Y_train, X_val, Y_val, X_test, Y_test)
    print("=======================================================================================================")
    mul_lstm_time = benchmark_model(MultiplicativeLSTM(input_size, hidden_size), X_train, Y_train, X_val, Y_val, X_test, Y_test)
    print("=======================================================================================================")
    gru_time = benchmark_model(StandardGRU(input_size, hidden_size), X_train, Y_train, X_val, Y_val, X_test, Y_test)
    print("=======================================================================================================")
    mul_gru_time = benchmark_model(MultiplicativeGRU(input_size, hidden_size), X_train, Y_train, X_val, Y_val, X_test, Y_test)
    print("=======================================================================================================\n")

    print(f"Standard LSTM Time: {lstm_time:.4f} seconds")
    print(f"Multiplicative LSTM Time: {mul_lstm_time:.4f} seconds")
    print(f"Standard GRU Time: {gru_time:.4f} seconds")
    print(f"Multiplicative GRU Time: {mul_gru_time:.4f} seconds")

In [70]:
run_benchmark()

Epoch 01 | Training Loss: 19.9935 | Val Loss: 9.8051 | Trainning Accuracy: 0.0858 | Val Accuracy: 0.1064
Epoch 02 | Training Loss: 9.0754 | Val Loss: 9.6939 | Trainning Accuracy: 0.0597 | Val Accuracy: 0.1097
Epoch 03 | Training Loss: 9.7622 | Val Loss: 9.1968 | Trainning Accuracy: 0.1953 | Val Accuracy: 0.1769
Epoch 04 | Training Loss: 8.8216 | Val Loss: 8.4217 | Trainning Accuracy: 0.1592 | Val Accuracy: 0.1340
Epoch 05 | Training Loss: 8.4987 | Val Loss: 8.5831 | Trainning Accuracy: 0.1468 | Val Accuracy: 0.1316
Epoch 06 | Training Loss: 8.5270 | Val Loss: 8.3713 | Trainning Accuracy: 0.1032 | Val Accuracy: 0.1063
Epoch 07 | Training Loss: 8.3459 | Val Loss: 8.2600 | Trainning Accuracy: 0.1269 | Val Accuracy: 0.1120
Epoch 08 | Training Loss: 8.2469 | Val Loss: 8.1885 | Trainning Accuracy: 0.1803 | Val Accuracy: 0.2094
Epoch 09 | Training Loss: 8.1941 | Val Loss: 8.0972 | Trainning Accuracy: 0.1555 | Val Accuracy: 0.2038
Epoch 10 | Training Loss: 8.0254 | Val Loss: 7.8851 | Trainning

KeyboardInterrupt: 

# TODO

- Add accuracy measurement
- Look into vanishing/exploding gradients and how these can be displayed (need this for analysis later)
- Do copy task, figure out what a delimitor is
- Remember, 3 copy tasks need to be ran on each model so that mean accuracy and standard error can be measured
- look into cross entropy??