# LSTM (C) - Test Implementation and Application

by Patrick Faion and Alessa Grund

## Table of Contents
[Test](#LSTM-Cell-Implementation)

## 1. Overview



With this iPython notebook we wanted to show you, how a simple LSTM Cell and a network of those cells could be implemented in Python and used for some example tasks, were the memory capacity of LSTM cells is useful.

## 2. LSTM Cell Implementation

In the following we will show our implementation for a simple LSTM cell. It is important to note some differences to the very basic ones we presented in our presentation. We choose those differences in accordance with some newer papers on optimized version of LSTMs. A very good overview about the basic architecture we used is given in the paper [Gers (2002) - Learning Precise Timing with LSTM Recurrent Networks](http://www.jmlr.org/papers/volume3/gers02a/gers02a.pdf). This paper includes the additional forget-gate and also introduces the peephole-connection between gates and cell-stated, which seem to be state of the art in many fields nowadays. Notable changes to the model we presented are:

* Addition of the forget-gate.
* Addition of peephole connections.
* No Cell blocks anymore. Currently LSTMs are used with one set of gates for every cell.

The implementation we chose here will be based on classes for all the different parts of an LSTM cell and everything will be modeled as separate objects. *This is very inefficient!* A much better way would be to accumulate everything into a few big matrices and just do linear algebra operations on them. But this hides a lot of the underlying processes, so for the sake of clarity, we chose this representation. Also we split the whole process into three parts:

1. Forward pass the input.
2. Backward pass the error.
3. Update the weights.

So every class will have three functions for the different phases.

### Activation Functions

As activation functions for this model we use the sigmoid function and its derivative.

In [4]:
import numpy as np

def sigmoid(x):
    """Calculate the logistic sigmoid function."""
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    """Calculate the derivative of the logistic sigmoid function."""
    return sigmoid(x)*(1 - sigmoid(x))

### Gates

The first thing we model is a generic class for the gates of a cell. Weights are initialized to random values in [-0.2, 0.2], as they did in the paper as well. Also we need to initialize $\Delta W$ to store the weight differences, until the update function is called.

For the biases, there will be one additional input which is constant 1. The bias weight will thus be the first weight in the weight vector. As already stated in the original LSTM-paper, different bias weights for the different gates will be beneficial, so we initilize this via a parameter.

For the forward pass, we simple apply the formulas for the gates from the paper, which are identical for all gate types. See e.g. the formulas for the input gate:
$$y_{in_j}(t) = f_{in_j}(z_{in_j}(t)) \hspace{1cm}\text{with}\hspace{1cm} z_{in_j}(t) = \sum_m w_{in_j m}y_m(t-1)$$
The update function will also be the same for all gate types, since it will just add $\Delta W$ to the weights.

In [5]:
class Gate:
    def __init__(self, inpDim, bias):
        """Create a gate object.
        
            inpDim: dimensionality of the input to the gate
            bias:   bias for this gate
        """
        # dimensionality
        self.inpDim = inpDim
        
        # weight and bias
        self.W = np.random.rand(1, self.inpDim) * 0.2 - 0.1
        self.W[0,0] = bias
        
        # activation functions
        self.f = sigmoid
        self.f_deriv = sigmoid_deriv
        
        # deltaW initialization
        self.deltaW = np.zeros(self.W.shape)
        
    def forward(self, inp):
        """Forward pass an input through the gate.
            
            inp: the input to the gate
        """
        # store input for backward pass
        self.inp = inp
        
        # calculate the sum over weight * input as matrix multiplication
        # this is the @ sign
        self.netInp = self.W @ inp
        
        # apply activation function to receive output
        self.y = self.f(self.netInp)
        return self.y
    
    def update(self):
        """Update weights according to stored deltaW."""
        self.W += self.deltaW
        # also make sure to reset deltaW afterwards
        self.deltaW = np.zeros(self.W.shape)

For the Backward pass we need to split this into three classes, since the update calculation works quite differently for the output gate, compared to the input and forget gate. The output gate can mostly calculate its gradient itself, while the input and forget gate rely on the gradient calculation by the memory cell in accordance with the memory state. We can still use the previous generic code, by creating three additional subclasses for the gate types.

The backward pass formulas are:
$$\Delta w_{out_j m}(t) = \alpha \delta_{out_j}(t) y_m(t) \hspace{1cm}\text{with}\hspace{1cm} \delta_{out_j}(t) = f'_{out_j}(z_{out_j}(t)) e_{out}(t)$$
where $e_{out}(t)$ is the incoming error into the output gate. And for input and forget gate:
$$\Delta w_{in_j m}(t) = \alpha * \text{grad}_{in_j}$$
$$\Delta w_{\phi_j m}(t) = \alpha * \text{grad}_{\phi_j}$$
where the gradients have to be calculated and passed from the memory cell.



In [6]:
class OutGate(Gate):
    def backward(self, error, learningRate):
        """Backward pass the error and calculate weight update.
        
            error: the incoming error
            learningRate: the learning rate for the weight update
        """
        self.delta = self.f_deriv(self.netInp) * error
        self.deltaW += learningRate * (self.delta @ self.inp.T)
        
class InpGate(Gate):
    def backward(self, grad, learningRate):
        """Backward pass the error gradient and calculate weight update.
        
            grad: the incoming gradient
            learningRate: the learning rate for the weight update
        """
        self.deltaW += learningRate * grad
        
class ForgetGate(Gate):
    def backward(self, grad, learningRate):
        """Backward pass the error gradient and calculate weight update.
        
            grad: the incoming gradient
            learningRate: the learning rate for the weight update
        """
        self.deltaW += learningRate * grad

### LSTM Cell

We can now start to model an LSTM cell. Weights are also initialized to random values in [-0.2, 0.2]. We don't adjust the bias weight separately. The cell state is initialized to 0. The gates will get biases of 0, -2 and +2 respectively, which are values taken from the paper. In the end we also need to initialize the derivatives for the truncated RTRL learning algorithm.

**Forward Pass**

For the forward pass, the relevant formulas here are the calculation of the net cell input:
$$ z_{c_j^v}(t) = \sum_m w_{c_j^v m}y_m(t-1) $$
as well as the state update:
$$ s_{c_j^v}(t) = y_{\phi_j}(t) s_{c_j^v}(t-1) + y_{in_j}(t) g(z_{c_j^v}(t))$$

In addition, we need to update our stored derivatives:
$$\frac{\partial s_{c_j^v}(t)}{\partial w_{c_j^v m}} = \frac{\partial s_{c_j^v}(t-1)}{\partial w_{c_j^v m}} y_{\phi_j}(t) + g'(z_{c_j^v}(t)) y_{in_j}(t)y_m(t-1)$$
$$\frac{\partial s_{c_j^v}(t)}{\partial w_{in_j m}} = \frac{\partial s_{c_j^v}(t-1)}{\partial w_{in_j m}} y_{\phi_j}(t) + g(z_{c_j^v}(t)) f'_{in_j}(z_{in_j}(t)) y_m(t-1)$$
$$\frac{\partial s_{c_j^v}(t)}{\partial w_{\phi_j m}} = \frac{\partial s_{c_j^v}(t-1)}{\partial w_{\phi_j m}} y_{\phi_j}(t) + s_{c_j^v}(t-1) f'_{\phi_j}(z_{\phi_j}(t)) y_m(t-1)$$

**Backward Pass**

For the backward pass, we need to calculate the fraction of the error caused by the output gate and pass it on to the gate:
$$ e_{out}(t) = \sum_{v=1}^{S_j}s_{c_j^v}(t) e_{cell}(t) $$, where $e_{cell}$ is the error caused by this cells output. Then we calculate the cell error and weight update:
$$e_{s_{c_j^v}}(t) = y_{out_j}(t) e_{cell}$$
$$\Delta w_{c_j^v m}(t) = \alpha e_{s_{c_j^v}}(t) \frac{\partial S_{c_j^v}(t)}{\partial w_{c_j^v m}}$$
Finally calculate the gradient for input and forget gate and pass these on:
$$\text{grad}_{in_j} = e_{s_{c_j^v}}(t) \frac{\partial S_{c_j^v}(t)}{\partial w_{in_j m}}$$
$$\text{grad}_{\phi_j} = e_{s_{c_j^v}}(t) \frac{\partial S_{c_j^v}(t)}{\partial w_{\phi_j m}}$$

**Update**

The update function will just update the weights and reset $\Delta W$ and subsequently call the update functions of the gate objects.

**Resetting**

An additional function for resetting the cell state is added in order to reset it after one training sequence.


In [8]:
class LSTMCell:
    def __init__(self, inpDim):
        """Create LSTM cell object.
        
            inpDim: dimensionality of the input
        """
        
        # --------- INIT BASIC STUFF ---------
        # input dimensionality
        self.inpDim = inpDim
        
        # state init
        self.state = np.array([[0.0]])
        
        # weight matrix and deltaW init
        self.W = np.random.rand(1, self.inpDim) * 0.2 - 0.1
        self.deltaW = np.zeros(self.W.shape)
        
        # activation functions
        self.g = sigmoid
        self.g_deriv = sigmoid_deriv
        
        
        # --------- INIT GATES ---------
        # The gate input is one larger than cell input, because of the peephole connections.
        gateDim = self.inpDim + 1
        
        # Biases for the gates were taken from the papers. They proved most successfull.
        inpBias = 0.0
        forgetBias = -2.0
        outBias = 2.0
        
        # create the gate objects
        self.inpGate = InpGate(gateDim, inpBias)
        self.forgetGate = ForgetGate(gateDim, forgetBias)
        self.outGate = OutGate(gateDim, outBias)
        
        
        # --------- INIT DERIVATIVES ---------
        self.stateDerivWRTCellWeights = np.zeros(self.W.shape)
        self.stateDerivWRTInpGateWeights = np.zeros(self.inpGate.W.shape)
        self.stateDerivWRTForgetGateWeights = np.zeros(self.forgetGate.W.shape)
        
        
    def forward(self, inp):
        """Forward pass a given input.
        
            inp: the input at the current time
                IMPORTANT: remember that the first value of the input will
                be the constant bias
        """
        
        # store input for backward pass
        self.inp = inp
        
        
        # --------- CELL INPUT ---------
        # calculate net input to the cell by weight-multiplication
        self.netInp = self.W @ inp
        
        
        # --------- PASS TO INPUT AND FORGET GATE ---------
        # append current state to input vector (peephole connection)
        inpWPrevPeep = np.append(inp, self.state, axis = 0)
        
        # pass input with peephole connection to input- and forget-gate
        self.inpGate.forward(inpWPrevPeep)
        self.forgetGate.forward(inpWPrevPeep)
        
        
        # --------- UPDATE DERIVATIVES ---------
        self.stateDerivWRTCellWeights *= self.forgetGate.y
        self.stateDerivWRTCellWeights += self.g_deriv(self.netInp) * self.inpGate.y * inp.T
        
        self.stateDerivWRTInpGateWeights *= self.forgetGate.y
        self.stateDerivWRTInpGateWeights += self.g(self.netInp) * self.inpGate.f_deriv(self.inpGate.netInp) * inpWPrevPeep.T
        
        self.stateDerivWRTForgetGateWeights *= self.forgetGate.y
        self.stateDerivWRTForgetGateWeights += self.state * self.forgetGate.f_deriv(self.forgetGate.netInp) * inpWPrevPeep.T
        
        
        # --------- UPDATE CELL STATE ---------
        self.state = self.forgetGate.y * self.state + self.inpGate.y * self.g(self.netInp)
        
        
        # --------- PASS TO OUTPUT GATE ---------
        # again append the (updated) cell state as peephole connection to the input
        inpWPostPeep = np.append(inp, self.state, axis = 0)
        self.outGate.forward(inpWPostPeep)
        
        # --------- CALCULATE CELL OUTPUT ---------
        self.y = self.outGate.y * self.state
        
        return self.y
    
    
    def backward(self, error, learningRate):
        """Backward pass the given error to calculate the weight update.
        
            error: incoming error
            learningRate: the learning rate for the weight update
        """
        
        # --------- OUTPUT GATE ---------
        # calculate error for output gate and pass it backward
        outGateError = self.state * error
        self.outGate.backward(outGateError, learningRate)
        
        # --------- CELL ERROR ---------
        # calculate internal error
        internalError = self.outGate.y * error
        
        # incoming weight adjustment
        self.deltaW += learningRate * internalError * self.stateDerivWRTCellWeights
        
        # --------- INPUT AND FORGET GATE ---------
        # calculate gradient for input gate and pass it backward
        inpGateGradient = internalError * self.stateDerivWRTInpGateWeights
        self.inpGate.backward(inpGateGradient, learningRate)
        
        # calculate gradient for forget gate and pass it backward
        forgetGateGradient = internalError * self.stateDerivWRTForgetGateWeights
        self.forgetGate.backward(forgetGateGradient, learningRate)
        
        
    def update(self):
        """Update weights with respect to stored deltas."""
        # update and reset cell weights
        self.W += self.deltaW
        self.deltaW = np.zeros(self.W.shape)
        
        # update and reset gate weights
        self.outGate.update()
        self.inpGate.update()
        self.forgetGate.update()
    
    
    def reset(self):
        """Reset the cell state and the derivatives."""
        self.state = np.array([[0.0]])
        self.stateDerivWRTCellWeights = np.zeros(self.W.shape)
        self.stateDerivWRTInpGateWeights = np.zeros(self.inpGate.W.shape)
        self.stateDerivWRTForgetGateWeights = np.zeros(self.forgetGate.W.shape)

## 3. LSTM Network Implementation

Now we want a whole network of LSTM cells. This network will be consisting of three layers:

1. Input layer
2. Hidden layer of LSTM cells
3. Fully connected output layer

### Output Layer

The input layer will not perform any computation, so we don't need to model it specifically. The hidden layer can be easily modeled with a list of LSTM cells. But we need an additional class for the output layer. This will be a layer of neurons, all fully connected to the hidden layer. We will not model them as single objects, but consider the whole layer only with one weight matrix.

In [9]:
class OutputLayer:
    def __init__(self, inpDim, outDim):
        """Create an output layer.
            
            inpDim: dimensionality of the input
            outDim: dimensionality of the output
        """
        # init dimensions
        self.inpDim = inpDim
        self.outDim = outDim
        
        # init weights and deltaW
        self.W = np.random.rand(self.outDim, self.inpDim) * 0.2 - 0.1
        self.deltaW = np.zeros(self.W.shape)
        
        # init activation functions
        self.f = sigmoid
        self.f_deriv = sigmoid_deriv
        
    
    def forward(self, inp):
        """Forward pass a given input.
        
            inp: the given input
        """
        # store input for backward pass
        self.inp = inp
        
        # calculate net input from weights
        self.netInp = self.W @ inp
        
        # calculate output
        self.y = self.f(self.netInp)
        return self.y
    
    
    def backward(self, error, learningRate):
        """Backward pass the error and calculate weight update.
            
            error: the incoming error
            learningRate: learning rate for the weight update
        """
        self.error = error
        self.grad = self.f_deriv(self.netInp) * self.error
        self.deltaW += learningRate * (self.grad @ self.inp.T)
    
    
    def update(self):
        """Update weights with respect to stored deltaW."""
        self.W += self.deltaW
        self.deltaW = np.zeros(self.W.shape)

## Sequence Prediction

## Our Example Tasks