# Mathematical Definitions

**Definition (LSTM):** An LSTM is defined by
* a number $m$ of units and a number $k$ of features
* a $4$-tuple of matrices $W_i, W_f, W_c, W_o \in \mathbb{R}^{k \times m}$ called *input, forget, cell and output kernels*,
* a $4$-tuple of matrices $U_i, U_f, U_c, U_o \in \mathbb{R}^{m \times m}$ called *input, forget, cell and output recurrent kernels*,
* a $4$-tuple of vectors $b_i, b_f, b_c, b_o \in \mathbb{R}^{m}$ called *input, forget, cell and output bias*,
* two function $\sigma, \tau:\mathbb{R} \to \mathbb{R}$ called *activation* and *recurrent activation*




**Definition (feedforward):** Let $x_1, \ldots, x_T \in \mathbb{R}^{k} \cong \mathbb{R}^{1 \times k}$ be an input sequence. Then $h_t$

\begin{align}
i_t &:= \tau(x_t W_i + h_{t-1} U_i + b_i) \in \mathbb{R}^m\\
f_t &:= \tau(x_t W_f + h_{t-1} U_f + b_f) \in \mathbb{R}^m\\
\tilde c_t &:= \sigma(x_t W_c + h_{t-1} U_c + b_c) \in \mathbb{R}^m \\
c_t &:= f_t c_{t-1} + i_t \tilde c_t \\
o_t &:= \tau(x_t W_o  + h_{t-1} U_o + b_0) \\
h_t & := o_t \tau(c_t) \in \mathbb{R}^m
\end{align}

where we employ the conventions $h_{-1} := c_{-1} := 0$.

# Graphical Illustration

# Implementation in Keras

In [1]:
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np

Using TensorFlow backend.


## Setting up the model
Seting up an LSTM in `keras` is straightforward as `keras` has a pre-defined `LSTM` layer for that.

In [2]:
num_features = 2
num_units = 3
num_time_steps = 1

model = Sequential()
model.add(LSTM(input_shape=(num_time_steps,num_features),
               units=num_units,
               activation='tanh',
               recurrent_activation='sigmoid',
               use_bias=True))
model.compile(optimizer='adam', loss='MAE')

## Inspecting the model

In [3]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 3)                 72        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________


## Inspecting the weights

In [4]:
model.get_weights()

[array([[ 0.51042557,  0.5942211 ,  0.45578587,  0.43169022, -0.34372836,
         -0.11740583, -0.07195967, -0.02087122, -0.5128101 , -0.04139394,
          0.27040482, -0.42312205],
        [ 0.6315795 ,  0.5031016 ,  0.4387064 , -0.06581646,  0.3208986 ,
         -0.01386303, -0.38648862, -0.5013469 ,  0.06860209, -0.27259246,
          0.05693811, -0.00775212]], dtype=float32),
 array([[ 0.08577248,  0.31390995,  0.13072671,  0.12951043, -0.04111644,
          0.21332414,  0.34374285,  0.44077843,  0.02660712,  0.5432066 ,
         -0.08800218,  0.443929  ],
        [ 0.03425135,  0.15008892, -0.5896042 , -0.00473604, -0.2620307 ,
         -0.15319848, -0.4502381 , -0.05373572, -0.18822809,  0.48890838,
          0.23610525, -0.02657015],
        [ 0.02512891,  0.45125625, -0.01083384,  0.21070944,  0.16426632,
         -0.03871774,  0.32860628, -0.6362487 ,  0.16737   ,  0.02314081,
          0.41948235,  0.07368749]], dtype=float32),
 array([0., 0., 0., 1., 1., 1., 0., 0., 0., 0.

## Feedforward

In [5]:
x = np.arange(num_time_steps *num_features).reshape(num_time_steps, num_features) # this creates an input layer
X = x[np.newaxis,:,:] # usually multiple inputs are processed, which is why this additional axis is needed
Y = model.predict(X)  # computes the actual result
y = Y[0] # in this example, we only have 1 output

## Reproduce the computation

In [6]:
# extracting the weights
W, U, b = model.layers[0].get_weights()

W_i = W[:, :num_units]
W_f = W[:, num_units: num_units * 2]
W_c = W[:, num_units * 2: num_units * 3]
W_o = W[:, num_units * 3:]

U_i = U[:, :num_units]
U_f = U[:, num_units: num_units * 2]
U_c = U[:, num_units * 2: num_units * 3]
U_o = U[:, num_units * 3:]

b_i = b[:num_units]
b_f = b[num_units: num_units * 2]
b_c = b[num_units * 2: num_units * 3]
b_o = b[num_units * 3:]


# setting the activations
activation = np.tanh
recurrent_activation = lambda x : 1/(1+np.exp(-x))

In [15]:
def lstm_feedforward_step(xt, c_tm1, h_tm1):
    """
    Computes one time step in the feedforward of the LSTM.
    
    param xt: The input at time t; of shape (1, num_features)
    param c_tm1: The cell state at time t-1; of shape (1, num_features)
    param h_tm1: The carry at time t-1; of shape (1, num_features)
    
    returns: (h, c), i.e. a tuple with the cell state at time t and the carry state at time t; both of shape (1, num_features)
    """
    
    i = recurrent_activation(xt @ W_i + b_i + h_tm1 @ U_i)
    f = recurrent_activation(xt @ W_f + b_f + h_tm1 @ U_f)
    cc = activation(xt @ W_c + b_c + h_tm1 @ U_c)
    c = f * c_tm1 + i * cc
    o = recurrent_activation(xt @ W_o + b_o + h_tm1 @ U_o)
    h = o * activation(c)
    
    return h, c


h_tm1 = np.zeros(num_units)
c_tm1 = np.zeros(num_units)

for t in range(num_time_steps):
    h_tm1, c_tm1 = lstm_feedforward_step(x[np.newaxis, 0,:], c_tm1, h_tm1)
    print(t, h_tm1, c_tm1)

h = h_tm1

0 [[-0.10198686 -0.14444107  0.02072801]] [[-0.24046278 -0.28864554  0.04164138]]


In [16]:
np.testing.assert_array_almost_equal(h,Y, decimal=4)

In [17]:
h, Y

(array([[-0.10198686, -0.14444107,  0.02072801]]),
 array([[-0.10198684, -0.14444107,  0.02072801]], dtype=float32))

https://keras.io/layers/recurrent/

https://adventuresinmachinelearning.com/keras-lstm-tutorial/

https://stackoverflow.com/questions/42861460/how-to-interpret-weights-in-a-lstm-layer-in-keras

https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1863

http://deeplearning.net/tutorial/lstm.html

https://stackoverflow.com/questions/51199753/extract-cell-state-lstm-keras

https://stats.stackexchange.com/questions/221513/why-are-the-weights-of-rnn-lstm-networks-shared-across-time