# Tutorial 3: Intro to Recurrent Neural Networks: Math, Training, and the Copy Task

# Instructor: Dr. Ankur Mali
# University of South Florida (Spring 2025)
### In this tutorial we will build RNNs based on equation and will compare 3 popular frameworks (Jax, TensorFlow and Pytorch)

## Vanilla RNN -- For more in depth explanation refer to your slides

### Forward Pass (Inference) -- Stage 1
Given an input at time \(t\):
\begin{aligned}
\mathbf{x}_t \in \mathbb{R}^{d_{\text{in}}},\quad \mathbf{h}_{t-1} \in \mathbb{R}^{d_{\text{hid}}}
\end{aligned}
we define RNN parameters:
\begin{aligned}
\mathbf{W}_x \in \mathbb{R}^{d_{\text{in}} \times d_{\text{hid}}}, \quad
\mathbf{W}_h \in \mathbb{R}^{d_{\text{hid}} \times d_{\text{hid}}}, \quad
\mathbf{b}_h \in \mathbb{R}^{d_{\text{hid}}}.
\end{aligned}

The hidden state update:
\begin{aligned}
\mathbf{h}_t = \tanh\Bigl(\mathbf{x}_t\,\mathbf{W}_x \;+\;\mathbf{h}_{t-1}\,\mathbf{W}_h \;+\;\mathbf{b}_h\Bigr).
\end{aligned}

Over a sequence  ($\mathbf{x}_1$, $\dots$, $\mathbf{x}_T$), we unroll:
\begin{aligned}
\mathbf{h}_0 = \mathbf{0},\quad
\mathbf{h}_1 = \tanh(\mathbf{x}_1 \mathbf{W}_x + \mathbf{h}_0 \mathbf{W}_h + \mathbf{b}_h),\,\dots,\,
\mathbf{h}_T = \tanh(\mathbf{x}_T \mathbf{W}_x + \mathbf{h}_{T-1} \mathbf{W}_h + \mathbf{b}_h).
\end{aligned}

Optionally, each hidden state  \($\mathbf{h}_t$\) can be projected to the output dimension $d_{\text{in}}$:
\begin{aligned}
\mathbf{\hat{y}}_t = \mathbf{h}_t \mathbf{W}_{\text{out}} + \mathbf{b}_{\text{out}}
\end{aligned}

<!-- $\mathbf{\hat{y}}$_t = $\mathbf{h}_t$,$\mathbf{W}_{\text{out}}$ + $\mathbf{b}_{\text{out}}$. -->


### Remaining Stages
We define a loss (Stage 2) over all time steps, for instance:
\begin{aligned}
\mathbf{L} = \frac{1}{T} \sum_{t=1}^T \left\|\,\mathbf{\hat{y}}_t - \mathbf{y}_t\,\right\|^2,
\end{aligned}
and use Backpropagation Through Time (BPTT) (Stage 3). An optimizer (e.g., Adam) updates parameters (Stage 4):
\begin{aligned}
\theta \,\leftarrow\, \theta \;-\; \eta \,\nabla_\theta \,\mathbf{L}.
\end{aligned}

---

## GRU

### Forward Pass (Inference)
A Gated Recurrent Unit includes reset $\mathbf{r}_t$ and update $\mathbf{z}_t$ gates:

\begin{aligned}
\mathbf{z}_t &= \sigma\!\bigl(\mathbf{x}_t \mathbf{W}_z + \mathbf{h}_{t-1}\,\mathbf{U}_z + \mathbf{b}_z\bigr), \\
\mathbf{r}_t &= \sigma\!\bigl(\mathbf{x}_t \mathbf{W}_r + \mathbf{h}_{t-1}\,\mathbf{U}_r + \mathbf{b}_r\bigr), \\
\tilde{\mathbf{h}}_t &= \tanh\!\bigl(\mathbf{x}_t \mathbf{W}_h + (\mathbf{r}_t \odot \mathbf{h}_{t-1})\,\mathbf{U}_h + \mathbf{b}_h\bigr), \\
\mathbf{h}_t &= (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} \;+\; \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
\end{aligned}

where $\sigma$ is the sigmoid function, and $\odot$ denotes elementwise multiplication.

### Remaining Stages
As in the vanilla RNN, define a loss $\mathbf{L}$ (e.g. MSE). The same BPTT logic applies, but the derivatives now include the GRU gating operations. Parameters (e.g., $\mathbf{W}_z, \mathbf{U}_z, \ldots$ ) are updated by any gradient-based optimizer.

---

## Optimizer
A typical training loop includes:

1. **Forward pass**: compute model outputs $\mathbf{\hat{y}}_t$.
2. **Loss computation**: $\mathbf{L}(\mathbf{\hat{y}}_t, \mathbf{y}_t)$.
3. **Backward pass**: compute $\nabla_\theta \mathbf{L}$ via BPTT.
4. **Parameter update**:
   \begin{aligned}
   \theta \leftarrow \theta - \eta \;\nabla_\theta \,\mathcal{L}.
   \end{aligned}
   (For example, using Adam, SGD, RMSProp, etc.)

---

## The Copy Task
The **copy task** is a simple sequence-to-sequence challenge:

- **Input**: a sequence of random vectors {$\mathbf{x}_1, \dots, \mathbf{x}_T$}.
- **Target**: the **same** sequence {$\mathbf{x}_1, \dots, \mathbf{x}_T$}.

Thus, the model should learn to produce $\mathbf{\hat{y}}_t \approx \mathbf{x}_t$ at each time step ($t$). It's a straightforward yet revealing test of a model’s capacity to retain and reproduce a sequence—particularly sensitive to the model’s ability to **remember** information over time.  


In [8]:
import tensorflow as tf
import time
import numpy as np

In [10]:
########################################
# Custom RNN Cell (Core Computation)
########################################

########################################
# TensorFlow Implementation
########################################

# ------- Single-Step RNN Cell -------
class RNNCellTF(tf.keras.layers.Layer):
    """
    A single-step RNN cell in TensorFlow.
    h_t = tanh( x_t * W_x + h_{t-1} * W_h + b )
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.W_x = self.add_weight(
            shape=(input_size, hidden_size), initializer="random_normal", trainable=True
        )
        self.W_h = self.add_weight(
            shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True
        )
        self.b_h = self.add_weight(
            shape=(hidden_size,), initializer="zeros", trainable=True
        )

    def call(self, x_t, h_prev):
        h_t = tf.math.tanh(
            tf.matmul(x_t, self.W_x) + tf.matmul(h_prev, self.W_h) + self.b_h
        )
        return h_t

# ------- Higher-level TF RNN that unrolls over time -------
class RNNTF(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = RNNCellTF(input_size, hidden_size)
        # Output projection
        self.W_out = self.add_weight(
            shape=(hidden_size, input_size), initializer="random_normal", trainable=True
        )
        self.b_out = self.add_weight(
            shape=(input_size,), initializer="zeros", trainable=True
        )

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        outputs = []
        for t in range(seq_length):
            x_t = X[:, t, :]
            h = self.rnn_cell(x_t, h)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]



########################################
# Training / Benchmark
########################################

# -------------- TensorFlow Benchmark --------------
def benchmark_tensorflow(input_size, hidden_size, X_train, Y_train, epochs=10, lr=0.01):
    model = RNNTF(input_size, hidden_size)
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    loss_fn = tf.keras.losses.MeanSquaredError()

    X_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
    Y_tf = tf.convert_to_tensor(Y_train, dtype=tf.float32)

    start_time = time.time()
    for epoch in range(epochs):
        with tf.GradientTape() as tape:
            output = model(X_tf)
            loss = loss_fn(output, Y_tf)
        grads = tape.gradient(loss, model.trainable_variables)
        #print(f"Epoch {epoch} | Loss TF: {loss.numpy():.6f}")
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

    return time.time() - start_time


############################
# Main Run
############################
def run_benchmark():
    seq_length = 20
    batch_size = 32
    input_size = 10
    hidden_size = 128
    num_epochs = 10

    np.random.seed(42)
    X_train = np.random.rand(1000, seq_length, input_size).astype(np.float32)
    Y_train = X_train.copy()

    # TensorFlow
    tensorflow_time = benchmark_tensorflow(input_size, hidden_size, X_train, Y_train, num_epochs)


    print(f"TensorFlow Time: {tensorflow_time:.4f} s")

run_benchmark()

# Standard LSTM RNN Implementation

In [11]:
class StandardLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Input gate weights
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        i_t = tf.sigmoid(tf.matmul(x_t, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_t, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_t, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_t, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

# ------- Higher-level TF RNN that unrolls over time -------
class StandardLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = StandardLSTMCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

# Multiplicative LSTM RNN Implementation

In [15]:
class MultiplicativeLSTMCell(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Learnable memory weights and bias
        self.W_m = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_m = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_m = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

        # Input gate weights and bias
        self.W_i = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_i = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_i = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Forget gate weights and bias
        self.W_f = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_f = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_f = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Output gate weights and bias
        self.W_o = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_o = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_o = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)
        
        # Cell candidate weights and bias
        self.W_c = self.add_weight(shape=(input_size, hidden_size), initializer="random_normal", trainable=True)
        self.U_c = self.add_weight(shape=(hidden_size, hidden_size), initializer="random_normal", trainable=True)
        self.b_c = self.add_weight(shape=(hidden_size,), initializer="zeros", trainable=True)

    def call(self, x_t, h_prev, c_prev):
        # Multiplicative Extension
        m_t = tf.matmul(x_t, self.W_m) + tf.matmul(h_prev, self.U_m) + self.b_m
        x_cap = m_t * x_t

        i_t = tf.sigmoid(tf.matmul(x_cap, self.W_i) + tf.matmul(h_prev, self.U_i) + self.b_i)
        f_t = tf.sigmoid(tf.matmul(x_cap, self.W_f) + tf.matmul(h_prev, self.U_f) + self.b_f)
        o_t = tf.sigmoid(tf.matmul(x_cap, self.W_o) + tf.matmul(h_prev, self.U_o) + self.b_o)
        c_hat = tf.tanh(tf.matmul(x_cap, self.W_c) + tf.matmul(h_prev, self.U_c) + self.b_c)
        
        c_t = f_t * c_prev + i_t * c_hat
        h_t = o_t * tf.tanh(c_t)
        
        return h_t, c_t

class MultiplicativeLSTM(tf.keras.layers.Layer):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = MultiplicativeLSTMCell(input_size, hidden_size)
        
        # Output projection
        self.W_out = self.add_weight(shape=(hidden_size, input_size), initializer="random_normal", trainable=True)
        self.b_out = self.add_weight(shape=(input_size,), initializer="zeros", trainable=True)

    def call(self, X):
        # X: [batch_size, seq_length, input_size]
        batch_size = tf.shape(X)[0]
        seq_length = tf.shape(X)[1]
        h = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)
        c = tf.zeros((batch_size, self.hidden_size), dtype=X.dtype)

        outputs = []
        
        for t in range(seq_length):
            x_t = X[:, t, :]
            h, c = self.lstm_cell(x_t, h, c)
            out_t = tf.matmul(h, self.W_out) + self.b_out
            outputs.append(tf.expand_dims(out_t, axis=1))
        return tf.concat(outputs, axis=1)  # [batch_size, seq_length, input_size]

In [17]:
def benchmark_rnn_models(input_size, hidden_size, X_train, Y_train, epochs=10, lr=0.01):
    model = RNNTF(input_size, hidden_size)

    X = tf.convert_to_tensor(X_train, dtype=tf.float32)
    Y = tf.convert_to_tensor(Y_train, dtype=tf.float32)

    # Standard LSTM Benchmark
    model = StandardLSTM(input_size, hidden_size)
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    loss_fn = tf.keras.losses.MeanSquaredError()

    start_time = time.time()
    
    for epoch in range(epochs):
        with tf.GradientTape() as tape:
            output = model(X)
            loss = loss_fn(output, Y)
        gradients = tape.gradient(loss, model.trainable_variables)
        print(f"Epoch {epoch} | Loss: {loss.numpy():.6f}")
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    return time.time() - start_time


############################
# Main Run
############################
def run_benchmark():
    seq_length = 20
    batch_size = 32
    input_size = 10
    hidden_size = 128
    num_epochs = 10

    np.random.seed(123)
    X_train = np.random.rand(1000, seq_length, input_size).astype(np.float32)
    Y_train = X_train.copy()

    # TensorFlow
    tensorflow_time = benchmark_rnn_models(input_size, hidden_size, X_train, Y_train, num_epochs)


    print(f"TensorFlow Time: {tensorflow_time:.4f} s")

run_benchmark()

Epoch 0 | Loss: 0.330919
Epoch 1 | Loss: 0.209169
Epoch 2 | Loss: 0.179643
Epoch 3 | Loss: 0.120039
Epoch 4 | Loss: 0.136909
Epoch 5 | Loss: 0.121190
Epoch 6 | Loss: 0.097766
Epoch 7 | Loss: 0.090170
Epoch 8 | Loss: 0.098913
Epoch 9 | Loss: 0.092541
TensorFlow Time: 5.7567 s
