# Assignment 3: LSTM and GRU vs. Multiplicative Variations

### Maxim Ryabinov (U02204083)
### CAP4641: Natural Language Processing 
### Instructor: Dr. Ankur Mali 
### University of South Florida (Spring 2025)

---

# Description

In this assignment, I implemented Standard LSTM RNN, Standard GRU RNN, Multiplicative LSTM RNN, and Multiplicative GRU RNN. My choice for the machine learning library used in this notebook is TensorFlow.

Below, you will find an implementation for each recurrent neural network architecture, all following a set of model equations that each of the architectures are based off of.

Lastly, in order to show robustness and highlight the differences in gating mechanisms between the architectures, the Copy Task is carried out. Each model is ran through this task using varying lengths of sequences that contain a randomly generated set of characters (sequences lengths: `{100, 200, 500, 1000}`).



---

# 1. Initial Setup

In [1]:
import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
import time
import os
import random

# Set seeds for reproducibility
seed = 123

tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)

# # Make TensorFlow deterministic
# os.environ['TF_DETERMINISTIC_OPS'] = '1'  # Force TensorFlow to use deterministic ops
# os.environ['TF_CUDNN_DETERMINISTIC'] = '1'  # Ensure cuDNN is deterministic

2025-03-16 15:48:36.544347: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742154516.561974   31890 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742154516.567053   31890 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-16 15:48:36.584901: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# 2. Defining each RNN Architecture

### Standard LSTM RNN Implementation

In [2]:
class StandardLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Input gate weights
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        i_t = tf.sigmoid(tf.matmul(x_t, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_t, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_t, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_t, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

# ------- Higher-level TF RNN that unrolls over time -------
class StandardLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size, epochs, batch_size, learning_rate, optimizer, loss_fn):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.lstm_cell = StandardLSTMCell(input_size, hidden_size)
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X, initial_state = None):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        
        if initial_state == None:
            h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
            c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        else:
            h, c = initial_state

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1), (h, c)  # [batch_size, seq_length, input_size] and h, c

### Multiplicative LSTM RNN Implementation

In [3]:
class MultiplicativeLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Input gate weights and bias
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights and bias
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights and bias
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights and bias
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

        # Multiplicative extension weights and bias
        self.W_m = self.add_weight(shape=(input_size, input_size), initializer="random_normal", trainable=True)
        self.U_m = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_m = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        # Multiplicative Extension
        m_t = tf.matmul(x_t, self.W_m) + tf.matmul(h_prev, self.U_m) + self.b_m
        x_cap = m_t * x_t

        i_t = tf.sigmoid(tf.matmul(x_cap, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_cap, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_cap, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_cap, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

class MultiplicativeLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size, epochs, batch_size, learning_rate, optimizer, loss_fn):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.lstm_cell = MultiplicativeLSTMCell(input_size, hidden_size)
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X, initial_state=None):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]

        if initial_state == None:
            h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
            c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        else:
            h, c = initial_state

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1), (h, c)  # [batch_size, seq_length, input_size] and h, c

### Standard GRU RNN Implementation

In [4]:
class StandardGRUCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Update gate weights
        self.W_z = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_z = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_z = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Reset gate weights
        self.W_r = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_r = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_r = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Candidate hidden state weights
        self.W_h = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_h = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_h = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev):
        # Update gate
        z_t = tf.sigmoid(tf.matmul(x_t, self.W_z) + tf.matmul(h_prev, self.U_z) + self.b_z)
        
        # Reset gate
        r_t = tf.sigmoid(tf.matmul(x_t, self.W_r) + tf.matmul(h_prev, self.U_r) + self.b_r)
        
        # Candidate hidden state
        h_hat = tf.tanh(tf.matmul(x_t, self.W_h) + tf.matmul(r_t * h_prev, self.U_h) + self.b_h)
        
        # New hidden state
        h_t = (1 - z_t) * h_prev + z_t * h_hat
        
        return h_t

# ------- Higher-level TF RNN that unrolls over time -------
class StandardGRU(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size, epochs, batch_size, learning_rate, optimizer, loss_fn):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.gru_cell = StandardGRUCell(input_size, hidden_size)
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.optimizer = optimizer
        self.loss_fn = loss_fn

        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X, initial_state=None):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        
        if initial_state == None:
            h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        else:
            h = initial_state

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h = self.gru_cell(x_t, h)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1), h  # [batch_size, seq_length, input_size] and h

### Multiplicative GRU RNN Implementation

In [5]:
class MultiplicativeGRUCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Update gate weights
        self.W_z = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_z = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_z = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Reset gate weights
        self.W_r = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_r = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_r = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Candidate hidden state weights
        self.W_h = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_h = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_h = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

        # Multiplicative extension weights and bias
        self.W_m = self.add_weight(shape=(input_size, input_size), initializer="random_normal", trainable=True)
        self.U_m = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_m = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev):
        # Memory matrix introduction
        m_t = tf.matmul(x_t, self.W_m) + tf.matmul(h_prev, self.U_m) + self.b_m
        x_cap = m_t * x_t

        # Update gate
        z_t = tf.sigmoid(tf.matmul(x_cap, self.W_z) + tf.matmul(h_prev, self.U_z) + self.b_z)
        
        # Reset gate
        r_t = tf.sigmoid(tf.matmul(x_cap, self.W_r) + tf.matmul(h_prev, self.U_r) + self.b_r)
        
        # Candidate hidden state
        h_hat = tf.tanh(tf.matmul(x_cap, self.W_h) + tf.matmul(r_t * h_prev, self.U_h) + self.b_h)
        
        # New hidden state
        h_t = (1 - z_t) * h_prev + z_t * h_hat
        
        return h_t

# ------- Higher-level TF RNN that unrolls over time -------
class MultiplicativeGRU(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size, epochs, batch_size, learning_rate, optimizer, loss_fn):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.gru_cell = MultiplicativeGRUCell(input_size, hidden_size)
        self.epochs = epochs
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X, initial_state=None):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        
        if initial_state == None:
            h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        else:
            h = initial_state

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h = self.gru_cell(x_t, h)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1), h  # [batch_size, seq_length, input_size] and h

# 3. Running the Copy Task
### Train, Test, and Validation Splits

In [6]:
def generate_dataset_splits(sequence_count, input_size, training_length, sequence_length, delimiter, total_delimiters):
    X_train = np.random.randint(0, 10, size=(sequence_count, training_length, input_size)).astype(np.float32)
    X_val = np.random.randint(0, 10, size=(sequence_count, training_length, input_size)).astype(np.float32)
    X_test = np.random.randint(0, 10, size=(sequence_count, sequence_length, input_size)).astype(np.float32)
    delimiters = np.full((sequence_count, total_delimiters, input_size), delimiter, dtype=np.float32)
    
    X_train = np.concatenate([X_train, delimiters], axis=1)
    X_val = np.concatenate([X_val, delimiters], axis=1)
    X_test = np.concatenate([X_test, delimiters], axis=1)
    
    Y_train = X_train.copy()
    Y_val = X_val.copy()
    Y_test = X_test.copy()
    
    return X_train, X_val, X_test, Y_train, Y_val, Y_test

### Training and Validating the Model (Training Loop)

In [7]:
def train_model(model, X_train, Y_train, X_val, Y_val):
    X_train = tf.convert_to_tensor(X_train, dtype=tf.float32)
    Y_train = tf.convert_to_tensor(Y_train, dtype=tf.float32)
    X_val = tf.convert_to_tensor(X_val, dtype=tf.float32)
    Y_val = tf.convert_to_tensor(Y_val, dtype=tf.float32)
    
    # Used for plotting the graphs 
    train_loss_data = []
    val_loss_data = []
    train_accuracy_data = []
    val_accuracy_data = []
    
    for epoch in range(model.epochs):
        # Shuffle training data at the start of each epoch.
        indices = tf.range(start=X_train.shape[0])
        indices = tf.random.shuffle(indices)
        X_train = tf.gather(X_train, indices)
        Y_train = tf.gather(Y_train, indices)
        
        epoch_loss = 0
        num_batches = int(np.ceil(X_train.shape[0] / model.batch_size))

        for i in range(num_batches):
            start = i * model.batch_size
            end = min((i+1) * model.batch_size, X_train.shape[0])
            X_batch = X_train[start:end]
            Y_batch = Y_train[start:end]
            
            # Convert one-hot encoded labels to integer class labels
            Y_batch_labels = tf.argmax(Y_batch, axis=-1)  # Shape will be (batch_size, seq_length)
            
            with tf.GradientTape() as tape:
                output, _ = model(X_batch)
                batch_loss = model.loss_fn(Y_batch_labels, output)
            gradients = tape.gradient(batch_loss, model.trainable_variables)
            model.optimizer.apply_gradients(zip(gradients, model.trainable_variables))
            
            epoch_loss += batch_loss.numpy()

        epoch_loss /= num_batches
        
        # Calculate training accuracy
        train_preds = np.argmax(output.numpy(), axis=-1).flatten()
        true_train = Y_batch_labels.numpy().flatten()
        train_accuracy = np.mean(train_preds == true_train)
        
        # Calculate validation loss and accuracy
        val_output, _ = model(X_val)
        val_loss = model.loss_fn(tf.argmax(Y_val, axis=-1), val_output).numpy()
        
        val_preds = np.argmax(val_output.numpy(), axis=-1)
        true_val = tf.argmax(Y_val, axis=-1).numpy()
        val_accuracy = np.mean(val_preds == true_val)
        
        # Store metrics for plotting
        train_loss_data.append(epoch_loss)
        val_loss_data.append(val_loss)
        train_accuracy_data.append(train_accuracy)
        val_accuracy_data.append(val_accuracy)
        
        print(f"Epoch {epoch+1:02d} | Training Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f} | "
              f"Training Accuracy: {train_accuracy:.4f} | Val Accuracy: {val_accuracy:.4f}")
        
    # # After training loop
    # plt.figure(figsize=(10, 6))
    # # Training data
    # plt.plot(train_loss_data, label="Training Loss", marker=".", color="#1f77b4")
    # plt.plot(train_accuracy_data, label="Training Accuracy", marker=".", color="#66c2ff")
    
    # # Validation data
    # plt.plot(val_loss_data, label="Validation Loss", marker=".", color="#d62728")
    # plt.plot(val_accuracy_data, label="Validation Accuracy", marker=".", color="#ff7f7f")

    # plt.xticks(range(len(train_loss_data))) # To display all epochs on x axis.

    # plt.xlabel("Epoch")
    # plt.ylabel("Loss and Accuracy")
    # plt.title("Training and Validation Loss/Accuracy")
    # plt.legend()
    # plt.show()

### Evaluating the Model (Test Loop)

In [8]:
def test_model(model, X_test, Y_test, sequence_length, training_length=100):
    X_test = tf.convert_to_tensor(X_test, dtype=tf.float32)
    Y_test = tf.convert_to_tensor(Y_test, dtype=tf.int32)
    
    test_loss = 0
    hidden_state = None
    num_test_segments = sequence_length // training_length

    all_test_preds = []  # Collect all predictions for final accuracy
    all_true_test = []   # Collect all true labels for final accuracy

    for i in range(num_test_segments):
        start = i * training_length
        end = start + training_length
        
        X_segment = X_test[:, start:end, :]
        Y_segment = tf.argmax(Y_test[:, start:end, :], axis=-1)

        if hidden_state is None:
            output, hidden_state = model(X_segment)  # First segment
        else:
            output, hidden_state = model(X_segment, initial_state=hidden_state)  # Keep state

        segment_loss = model.loss_fn(Y_segment, output).numpy()
        test_loss += segment_loss

        test_preds = np.argmax(output.numpy(), axis=-1)
        true_test = Y_segment.numpy()

        # Collect for final accuracy calculation
        all_test_preds.append(test_preds)
        all_true_test.append(true_test)
        
        if num_test_segments > 1:
            segment_accuracy = np.mean(test_preds == true_test)
            print(f"Segment {i + 1:02d}/{num_test_segments:02d} | Loss: {segment_loss:.4f} | Accuracy: {segment_accuracy:.4f}")

    # Final test loss
    test_loss /= num_test_segments

    # Combine all predictions and labels for final accuracy
    all_test_preds = np.concatenate(all_test_preds, axis=1)
    all_true_test = np.concatenate(all_true_test, axis=1)
    test_accuracy = np.mean(all_test_preds == all_true_test)
    
    print(f"\nFinal Test Loss: {test_loss:.4f} | Final Test Accuracy: {test_accuracy:.4f}")
    
    return (test_loss, test_accuracy)
    

### Setting up the Copy Task Experiment

In [9]:
def run_benchmark():
    vocabulary = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    delimiter = 10
    sequence_lengths = [100, 200, 500, 1000]
    sequence_count = 100
    total_delimiters = 3 # Adds a few delimiters to make it clear that its the end of the sequence.
    
    # Hyper-parameters used across each model
    input_size = len(vocabulary) + 1 # Vocabulary tokens + delimiter token
    hidden_size = 128
    training_length = 100
    total_epochs = 10
    batch_size = 32
    learning_rate = 0.01
    
    # Storing and printing the training and testing status of each trial along with some analytical information.
    total_trials = 3
    metrics = {"Standard LSTM": [], "Multiplicative LSTM": [], "Standard GRU": [], "Multiplicative GRU": []}
    
    for model_name, model_metrics in metrics.items():
        for sequence in sequence_lengths:
            metrics[model_name].append([])
    
    for trial in range(total_trials):
        print(f"Running Trial {trial+1}/{total_trials} ================================================================================")
        for i, sequence_length in enumerate(sequence_lengths):
            models = {
                "Standard LSTM": StandardLSTM(input_size=input_size,
                                            hidden_size=hidden_size,
                                            epochs=total_epochs,
                                            batch_size=batch_size,
                                            learning_rate=learning_rate,
                                            optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                            loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)),
                
                "Multiplicative LSTM": MultiplicativeLSTM(input_size=input_size,
                                            hidden_size=hidden_size,
                                            epochs=total_epochs,
                                            batch_size=batch_size,
                                            learning_rate=learning_rate,
                                            optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                            loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)),
                
                "Standard GRU": StandardGRU(input_size=input_size,
                                            hidden_size=hidden_size,
                                            epochs=total_epochs,
                                            batch_size=batch_size,
                                            learning_rate=learning_rate,
                                            optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                            loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)),
                
                "Multiplicative GRU": MultiplicativeGRU(input_size=input_size,
                                            hidden_size=hidden_size,
                                            epochs=total_epochs,
                                            batch_size=batch_size,
                                            learning_rate=learning_rate,
                                            optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                            loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
            }
            
            for model_name, model in models.items():
                print(f"Training {model_name}:")
                X_train, X_val, X_test, Y_train, Y_val, Y_test = generate_dataset_splits(sequence_count,
                                                                                         input_size,
                                                                                         training_length,
                                                                                         sequence_length,
                                                                                         delimiter,
                                                                                         total_delimiters)
                train_model(model, X_train, Y_train, X_val, Y_val)
                print(f"\nTest Sequence Length: {sequence_length}")
                print(f'Testing {model_name}:')
                result_metrics = test_model(model, X_test, Y_test, sequence_length, training_length)
                metrics[model_name][i].append(result_metrics) # stores (loss, accuracy) pair
                print("-------------------------------------------------------------------------------------------------------")
        
            print(f"*** Final metrics report for Trial {trial+1}/{total_trials} on sequence length of {sequence_length} ***")
            for model_name, model_metrics in metrics.items():
                print(model_name)
                print(f"Final Test Loss: {model_metrics[i][trial][0]:.4f}")
                print(f"Final Test Accuracy: {model_metrics[i][trial][1]:.4f}\n")

            print("=======================================================================================================")
            
    # Reports the mean performance (accuracy) and standard error of each model for each sequence across the total trials.
    print("\n===================================== Final Metrics Report =====================================\n")

    # Define a format for the output
    header_format = "{:<22} {:<22} {:<22} {:<22}"  # Adjust the width for better formatting
    data_format = "{:<22} {:<22} {:<22.4f} {:<22.4f}"  # Ensure enough space for standard error

    # Print header for the report
    print(header_format.format("Model", "Sequence Length", "Mean Accuracy", "Standard Error"))

    # Loop through the models and their metrics
    for model_name, model_metrics in metrics.items():
        for i, sequence_length in enumerate(sequence_lengths):
            accuracies = [trial[1] for trial in model_metrics[i]]  # Extract accuracy values
            mean_accuracy = np.mean(accuracies)
            std_error = np.std(accuracies) / np.sqrt(len(accuracies))  # Standard error = std deviation / sqrt(n)

            # Print the model and its corresponding metrics for each sequence length
            print(data_format.format(model_name, sequence_length, mean_accuracy, std_error))

### Running Copy Task and Analyzing the Results

In [10]:
run_benchmark()

Training Standard LSTM:


2025-03-16 15:48:38.960703: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Epoch 01 | Training Loss: 2.2894 | Val Loss: 2.1263 | Training Accuracy: 0.3374 | Val Accuracy: 0.3740

Test Sequence Length: 100
Testing Standard LSTM:

Final Test Loss: 2.1490 | Final Test Accuracy: 0.3499
-------------------------------------------------------------------------------------------------------
Training Multiplicative LSTM:
Epoch 01 | Training Loss: 2.2814 | Val Loss: 2.1318 | Training Accuracy: 0.3180 | Val Accuracy: 0.3246

Test Sequence Length: 100
Testing Multiplicative LSTM:

Final Test Loss: 2.1491 | Final Test Accuracy: 0.3059
-------------------------------------------------------------------------------------------------------
Training Standard GRU:
Epoch 01 | Training Loss: 2.3749 | Val Loss: 2.0742 | Training Accuracy: 0.2937 | Val Accuracy: 0.3863

Test Sequence Length: 100
Testing Standard GRU:

Final Test Loss: 2.0735 | Final Test Accuracy: 0.3934
-------------------------------------------------------------------------------------------------------
Traini

# 4. Analyzing Copy Task Results
### Analysis of Results
### Observations on Multiplicative Effects