# LSTM and SimpleRNN

Resources:
- Textbook, Geron 2019: Aurelien Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Sebastopol, CA, 3rd edition, 2023.
- Interactive Video: [StatQuest - LSTM](https://www.youtube.com/watch?v=YCzL96nL7j0&t=358s&ab_channel=StatQuestwithJoshStarmer)

Keras Documentation: 
- [Keras Code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/recurrent.py)
- [SimpleRNN](https://keras.io/api/layers/recurrent_layers/simple_rnn/), [Dense](https://keras.io/api/layers/core_layers/dense/), [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/)

Documentation on order of weights in LSTM: 
- [Stack Overflow Thread 1](https://stackoverflow.com/questions/68845790/gate-weights-order-for-lstm-layers-in-tensorflow)
- [Stack Thread 2](https://stackoverflow.com/questions/46817085/keras-interpreting-the-output-of-get-weights)

## Background

### Simple RNN Connections and Equations

The Simple RNN cell modifies the standard neural network cell by adding a recurrent connection. NOTE: some people refer to a single unit as just a "neuron", and use the concept of a "cell" to refer to an entire computational layer of neurons. The output of the cell at the previous time step is stored as a "hidden state", and this state is combined in a linear combination with the input to the cell to create the cell output, which is then stored as the next hidden state. Let,
- $h_t$: the hidden state, or the cell output, at time $t$, $h_t\in \mathbb R$
- $X_t$: vector of external input at time $t$, $X_t \in \mathbb R^p$
    - dimensions $p\times 1$, where $p$ is the number of predictors, or the dimensionality of the features
- $W_x$: weights associated with $X_t$, $W_x \in \mathbb R^p$
    - dimensions $1\times p$ (using this convention to simplify the multiplication notation, it's a dot product between $X_t$ and $W_x$)
- $W_h$: weights associated with the hidden state $h_t$, aka the recurrent weight, $W_h\in \mathbb R$
- $b$: bias, $b\in \mathbb R$

The equation for the output of the cell at time $t$ is a function of a linear combination with activation function $\phi: \mathbb R\to \mathbb R$ (e.g. sigmoid, tanh, ReLU):

$$
h_t = \phi(W_x\cdot X_t + W_h\cdot h_{t-1}+b)
$$

## LSTM Connections and Equations

The LSTM cell also has a hidden state $h_t$ that is the output of the cell and is stored as the short-term memory. The LSTM cell augments the Simple RNN cell by adding an additional state, called the "cell state" or the long-term memory. Further, the LSTM cell has several "gates" that control how the cell state is modified by new information. The hidden state $h_t$ and input $X_t$ are defined as before. The LSTM machinery uses 4 separate linear combos of $h_t$ and $X_t$, each with a different interpretation. All bias' are scalar values, and all weights associated with the hidden state are scalar, and all weights associated with the input are vectors of the same length as the input. 

**Forget Gate:** produces a value that is multiplied with the previous cell state. By default, this value is in $(0,1)$, so it is interpretable as the percentage of the long-term memory to retain. Let,
- $W_x^{(f)}$: weight vector for input at forget gate
- $W_h^{(f)}$: weight for hidden state at forget gate
- $b^{(f)}$: bias at forget gate

The forget gate by default uses the sigmoid activation function $\sigma: \mathbb R \to (0,1)$, where $\sigma(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{1+e^x}$. The sigmoid maps the real number line to the open interval $(0,1)$, interpetable as probabilities or percentages. The percentage produced by the forget get is multplied by the cell state at the previous time step, so it is interpretable as a percentage of the old memory to retain. Let the forget gate be denoted $f_t$, and

$$
f_t = \sigma(W_x^{(f)} \cdot X_t + W_h^{(f)}\cdot h_t + b^{(f)}), \quad \in (0,1)
$$

**Candidate Memory:** produces a value interpretable as a proposed new contribution to the long-term cell state. This is often denoted as $g_t$, or sometimes $\tilde c_t$. Let,
- $W_x^{(g)}$: weight vector for input at candidate 
- $W_h^{(g)}$: weight for hidden state at candidate
- $b^{(g)}$: bias at candidate

By default, the candidate memory uses the hyperbolic tangent function $\tanh: \mathbb R\to (-1,1)$, where $\tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$. The tanh function maps the real numbers to the open interval $(-1,1)$. ReLU or any other activation function can be used in this stage, but using the unbounded ReLU can lead to stability issues during training that the LSTM was designed to remedy. Let the candidate be denoted $g_t$, and

$$
g_t = \tanh(W_x^{(g)} \cdot X_t + W_h^{(g)}\cdot h_t + b^{(g)}), \quad \in (-1,1)
$$

**Input Gate:** produces a value in $(0,1)$, interpretable as the percentager of the candidate memory to add to the long-term memory. Let,
- $W_x^{(i)}$: weight vector for input at input gate 
- $W_h^{(i)}$: weight for hidden state at input gate
- $b^{(i)}$: bias at input gate

For the same reasons as the forget gate, the input gate typically utilizes sigmoid activation. Let the input gate be denotexd $i_t$, and

$$
i_t = \sigma(W_x^{(i)} \cdot X_t + W_h^{(i)}\cdot h_t + b^{(i)}), \quad \in (0,1)
$$

**Updating the Cell State**: the long-term memory is updated by applying the forget gate to retain a percentage of the old cell state, and then we add the candidate memory times the input gate

$$
c_t = f_t\cdot c_{t-1} + i_t\cdot g_t, \quad \in \mathbb R
$$

**Output Gate:** produces a value that is used to construct the new hidden state. By default, this value is in $(0,1)$. This interpertation is more subtle, but values of the output gate near 0 means that the output $h_t$ retains almost nothing from the long-term memory. Values of the output gate near 1 means that $h_t$ is mostly a function of the cell state. Let,
- $W_x^{(o)}$: weight vector for input at output gate 
- $W_h^{(o)}$: weight for hidden state at output gate
- $b^{(o)}$: bias at output gate

Again, by default the output gate utilizes sigmoid activation. Let the output gate be denoted $o_t$, and

$$
o_t = \sigma(W_x^{(o)} \cdot X_t + W_h^{(o)}\cdot h_t + b^{(o)}), \quad \in (0,1)
$$

**Updating the Hidden State:** the new hidden state of the LSTM, which is the output of the cell, is the product of the output gate and an activation modified cell state, by default tanh activation

$$
h_t = o_t \cdot \tanh(c_t)
$$

## Connecting the LSTM and Simple RNN

The simple RNN can be thought of as a special the LSTM when the forget gate is exactly zero, the input gate is exactly 1, and the output gate is exactly 1. The sigmoid function will never exactly produce 0 or 1. With sigmoid activation for the input/output/forget gates, an LSTM can produce outputs arbitrarily close to a simple RNN, but not exactly reproduce them. There are activation functions that could be used to produce exact 0's and 1's, like the "hard sigmoid", which is a piece-wise linear approximation to the sigmoid. 

Then, the candidate memory takes the place of the linear combination used in the simple RNN so that it becomes the new long term memory $c_t$ if the input gate was 1. In the LSTM cell, another activation function is applied to $c_t$ to generate the new hidden state $h_t$. If this activation were linear, there would be no issue and the LSTM cell would reproduce the simple RNN with the setup described in this section. However, by default in both tensorlfow and pytorch, you cannot set the activation used in the candidate to a different function than the one used to generate the new $h_t$. So if you used `activation=tanh` in the LSTM cell, the tanh function would be applied twice to the output and you would not reproduce the same values. Further, $\lim_{x\to \infty}\tanh(\tanh(x)) = \tanh(1) < 1$. We will proceed with the example that the simple RNN and LSTM candidate and final output use linear activation to make the math and programming easier.

To recreate the output of a simple RNN cell using an LSTM, we can set the weights to zeros in the forget, input, and output gates and use large biases that approximate 0 or 1 depending on the gate. Observe that,

$$
\sigma(-10)\approx 0.00004 \qquad\qquad \tanh(5)\approx 0.99991
$$

So, we can almost exactly reproduce the outputs of the simple RNN by copying the weights from the simple RNN cell into the candidate $g_t$ with ReLU activation, and setting other weights to

| Weight | Description | Value to set |
|--------|-------------|-------|
|    $W_x^{(f)}$    |      Forget gate input weight       |    0   |
|    $W_h^{(f)}$    |      Forget gate recurrent weight       |    0   |
|    $b^{(f)}$    |      Forget gate bias       |    -10   |
|    $W_x^{(f)}$    |      Input gate input weight       |    0   |
|    $W_h^{(f)}$    |      Gate gate recurrent weight       |    0   |
|    $b^{(f)}$    |      Input gate bias       |    10   |
|    $W_x^{(f)}$    |      Ouput gate input weight       |    0   |
|    $W_h^{(f)}$    |      Output gate recurrent weight       |    0   |
|    $b^{(f)}$    |      Output gate bias       |    10   |

However, at $x=10$, the derivative of the sigmoid function is $\sigma(10)(1-\sigma(10))\approx 0.00005$. So this would lead to incredibly slow training updates for those weights if the model was used in gradient descent. The training might not work at all, since the gradients of the other weights in the candidate memory step would be much larger and would be learning quickly compared to those weights hit by the near-zero gradient. The example with weights equal to $\pm$10 will be used to show how the simple RNN can be recreated with the LSTM. But setting the weights to lower values like 3-5 would still be close to the simple RNN, and if it were to be trained with gradient descent it would have a better chance of learning at a reasonable rate. 

## Coding Demonstration

### Environment

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [None]:
nfeatures=3
batch_size=None
timesteps=None
tf.random.set_seed(123)

### Simple RNN

Network archicture:
- Input: number of weights equal to features
- Simple RNN: 1 recurrent weight, 1 bias
- Dense Output Neuron: 1 weight, 1 bias

For 3 features, expect 7 trainable parameterss in summary below.

Default initialization for layers from Keras use zeros for bias. For SimpleRNN, you get $\pm 1$ for a single cell since $[1]^T[1] = [1][1]^T=I_1$. NOTE: orthogonal matrices, not orthogonal vectors

In [None]:
inputs = tf.keras.Input(batch_shape=(batch_size, timesteps, nfeatures))
x = tf.keras.layers.SimpleRNN(1, return_sequences=True, activation="linear")(inputs)
outputs = tf.keras.layers.Dense(1)(x)
rnn = tf.keras.Model(inputs, outputs)
rnn.compile(loss = "mean_squared_error", optimizer="Adam")
rnn.summary()

In [None]:
rweights = rnn.get_weights()
print("~"*25)
print(f"Input Weights: ")
print(f"    Shape: {rweights[0].shape}")
print(f"    Values: {rweights[0].T}")
print("~"*25)
print(f"Recurrent Weight:")
print(f"    Shape: {rweights[1].shape}")
print(f"    Values: {rweights[1].T}")
print("~"*25)
print(f"Recurrent Bias:")
print(f"    Shape: {rweights[2].shape}")
print(f"    Values: {rweights[2].T}")
print("~"*25)
print(f"Dense Output Weight: {rweights[3]}")
print(f"    Shape: {rweights[3].shape}")
print(f"    Values: {rweights[3].T}")
print("~"*25)
print(f"Dense Output Bias: {rweights[4]}")
print(f"    Shape: {rweights[4].shape}")
print(f"    Values: {rweights[4].T}")

### LSTM

Network architecture:
- Input: number of weights equal to features, denote $p$ 
- Gates and candidate each have: 1 bias + 1 recurrent weight + $p$ input weights = (1+1+$p$) parameters
- Output neuron: 1 weight and 1 bias

So for $p=3$ features, it has a number of parameters equal to $4\cdot(1+1+3) = 20$, for a grand network total of $22$ with the output neuron parameters.

Default initializers are zeros for the bias, but adding 1 to the bias in the forget gate. See the argument `unit_forget_bias` in the Keras documentation. 

In [None]:
inputs = tf.keras.Input(batch_shape=(batch_size, timesteps, nfeatures))
x = tf.keras.layers.LSTM(1, return_sequences=True, activation="linear")(inputs)
outputs = tf.keras.layers.Dense(1)(x)
lstm = tf.keras.Model(inputs, outputs)
lstm.compile(loss = "mean_squared_error", optimizer="Adam")
lstm.summary()

## Set LSTM Weights to reproduce RNN

Simple RNN is a special case of the LSTM. Using the initialized weights in the simple RNN, we set the weights for the LSTM in such a way that they generate the same output.

The LSTM weights are a list of 3 arrays: input weights, recurrent weights, and biases. Within each array, based on stackoverflow thread, the order is `i, f, c, o` which stands for input gate, forget gate, cell gate and output gate respectively. This applies to the elements within a list of weights.

In [None]:
lweights = lstm.get_weights()
print("~"*25)
print(f"Input Weights: ")
print(f"    Shape: {lweights[0].shape}")
print(f"    Values: {lweights[0]}")
print("~"*25)
print(f"Recurrent Weight:")
print(f"    Shape: {lweights[1].shape}")
print(f"    Values: {lweights[1]}")
print("~"*25)
print(f"Recurrent Bias:")
print(f"    Shape: {lweights[2].shape}")
print(f"    Values: {lweights[2].T}")
print("~"*25)
print(f"Dense Output Weight: {lweights[3]}")
print(f"    Shape: {lweights[3].shape}")
print(f"    Values: {lweights[3].T}")
print("~"*25)
print(f"Dense Output Bias: {lweights[4]}")
print(f"    Shape: {lweights[4].shape}")
print(f"    Values: {lweights[4].T}")

We now set the weights of the LSTM to match the Simple RNN. See table above for detailed description.
- Set weights for input, output, and forget gates to zeros.
- Set biases for input, output, and forget gates to large values to get the sigmoid to approximate 0 or 1
- Set the weights and biases for the candidate to match the simple RNN weights
- Match weight and bias for output neuron

In [None]:
# Set Input Weights
lweights[0] = np.array([
    [0,0, rweights[0][0][0], 0], # Feature 1
    [0,0, rweights[0][1][0], 0], # Feature 2
    [0,0, rweights[0][2][0], 0]  # Feature 3
])

# Set Recurrent Weights
lweights[1] = np.array([[
    0,                 # Input gate
    0,                 # Forget gate
    rweights[1][0][0], # Candidate, set to reccurent weight from SimpleRNN
    0,                 # Output gate
]])

# Set Biases
lweights[2] = np.array([
    10,                # Input gate bias, set to approx 1
    -10,               # Forget gate bias, set to approx 0
    rweights[2][0],    # Candidate bias, set to bias from SimpleRNN
    10,                # Output gate bias, set to approx 1
])

# Set output neuron weights
lweights[3] = rweights[3]
lweights[4] = rweights[4]

In [None]:
lstm.set_weights(lweights)

### Compare Predictions

Check that predictions match for a few scenarios:
- randomly generated normally distributed data
- constant data for a few constant values for each feature

In [None]:
np.random.seed(123)
nbatch=10
nt = 10
X = np.random.randn(nbatch, nt, nfeatures)

In [None]:
p1 = rnn.predict(X)
p2 = lstm.predict(X)

In [None]:
print(f"Max Difference between SimpleRNN and LSTM for {len(p1.flatten())} Standard Normal Inputs")
print(np.max(np.abs(p1-p2)))

In [None]:
plt.plot(p1.flatten(), label="SimpleRNN Predictions")
plt.plot(p2.flatten(), linestyle=':', label="LSTM Predictions")
plt.legend()
plt.ylabel("Prediction")
plt.title(f"Predicted values for {len(p1.flatten())} Standard Normal Inputs")

In [None]:
np.random.seed(123)
X = np.zeros((nbatch, nt, nfeatures))
constants = [-2, -1, 0, 1, 2]

In [None]:
for i in constants:
    print("~"*50)
    print(f"Constant = {i}")
    p1 = rnn.predict(X+i)
    p2 = lstm.predict(X+i)
    print(f"Max Difference between SimpleRNN and LSTM for {len(p1.flatten())} Standard Normal Inputs")
    print(np.max(np.abs(p1-p2)))

## Physics-Initiated RNN

Using the scheme above, we recreate the physics initiated RNN in an LSTM cell.

The discretized solution to the simplified version of the timelag ODE is:

$$
m_t = (1-e^{(-1/T)}) E + e^{(-1/T)} m_{t-1}
$$

For equilibrium moisture content $E$ and timelag constant $T$. Denote the above constants $W_x$ and $W_h$ and let $b=0$, so 

$$
m_t = W_x E + W_h m_{t-1} + b
$$

These values can be used to set the weights and biases in a simple RNN to exactly reproduce the solution to the ODE. The dense output neuron is set to weight 1 and bias 0, note this isn't needed in the 1 neuron case but we keep it here to extend to when we stack multiple RNN cells to improve accuracy. Then, we will demonstrate applying it to the LSTM to approximate the solution.

In [None]:
nfeatures = 1 # Only Equilibrium MC
T = 10 # 10hr FMC 
Wx = np.exp(-1/T)
Wh = (1-Wx)
b = 0

In [None]:
# Create simple RNN and LSTM with only 1 input
batch_size=None
timesteps=None
nfeatures=1

## Simple RNN
inputs = tf.keras.Input(batch_shape=(batch_size, timesteps, nfeatures))
x = tf.keras.layers.SimpleRNN(1, return_sequences=True, activation="linear")(inputs)
outputs = tf.keras.layers.Dense(1)(x)
rnn = tf.keras.Model(inputs, outputs)
rnn.compile(loss = "mean_squared_error", optimizer="Adam")
## LSTM
inputs = tf.keras.Input(batch_shape=(batch_size, timesteps, nfeatures))
x = tf.keras.layers.LSTM(1, return_sequences=True, activation="linear")(inputs)
outputs = tf.keras.layers.Dense(1)(x)
lstm = tf.keras.Model(inputs, outputs)
lstm.compile(loss = "mean_squared_error", optimizer="Adam")

In [None]:
# Set simple RNN weights 
rweights = rnn.get_weights()

rweights[0] = np.array([[Wx]]) # Input
rweights[1] = np.array([[Wh]]) # Recurrent connection
rweights[2] = np.array([0])    # RNN Cell Bias
rweights[3] = np.array([[1]])  # Dense Output Activation
rweights[4] = np.array([0])  # Dense Output Bias

rnn.set_weights(rweights)

In [None]:
# Simulate Equilibrium moisture content under a few scenarios
    # Constant 5, 10, 20, 50 %
    # Sine waves with 24 hour period, varying intensities
nbatch = 1
nt = 24
X0 = np.zeros((nbatch, nt, 1))
constants = [5, 10, 20, 50]
Xc = np.vstack([X0 + c for c in constants])
rnn_preds = rnn.predict(Xc)

In [None]:
tsteps = np.arange(nt)

plt.plot(tsteps, rnn_preds[constants.index(50)], label="EMC=50")
plt.plot(tsteps, rnn_preds[constants.index(20)], label="EMC=20")
plt.plot(tsteps, rnn_preds[constants.index(10)], label="EMC=10")
plt.plot(tsteps, rnn_preds[constants.index(5)], label="EMC=5")

plt.ylabel("EMC (%)")
plt.xlabel("Time Step")
plt.legend()
plt.grid()
plt.title("Simple RNN with Physics-Initialized Weights")

In [None]:
# Simulate Sine Waves and Predict
nt = 24*3
mean_val = 10
amplitude = 5
period_hours = 24

timesteps = np.arange(nt)
wave = amplitude * np.sin(2 * np.pi/period_hours * timesteps) + mean_val
Xp = wave.reshape(1, -1, 1)

In [None]:
p1 = rnn.predict(Xp)

plt.plot(timesteps, Xp.squeeze(), label="EMC")
plt.plot(timesteps, p1.squeeze(), label="Predicted FMC")

plt.ylabel("EMC (%)")
plt.xlabel("Time Step")
plt.legend()
plt.grid()
plt.title("Simple RNN with Physics-Initialized Weights")

## Physics Initiated LSTM

Set weights appropriately.

Use same data as simulated above, see if we can reproduce.

In [None]:
lweights = lstm.get_weights()

# Set Input Weights
lweights[0] = np.array([
    [0,0, rweights[0][0][0], 0], # EMC weights
])

# Set Recurrent Weights
lweights[1] = np.array([[
    0,                 # Input gate
    0,                 # Forget gate
    rweights[1][0][0], # Candidate, set to reccurent weight from SimpleRNN
    0,                 # Output gate
]])

# Set Biases
lweights[2] = np.array([
    10,                # Input gate bias, set to approx 1
    -10,               # Forget gate bias, set to approx 0
    rweights[2][0],    # Candidate bias, set to bias from SimpleRNN
    10,                # Output gate bias, set to approx 1
])

# Set output neuron weights
lweights[3] = rweights[3]
lweights[4] = rweights[4]

lstm.set_weights(lweights)

In [None]:
# Predict
lstm_preds = lstm.predict(Xc)
p2 = lstm.predict(Xp)

In [None]:
np.max(np.abs(lstm_preds - rnn_preds))

In [None]:
np.max(np.abs(p1 - p2))