In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np

# https://github.com/georgeyiasemis/Recurrent-Neural-Networks-from-scratch-using-PyTorch/blob/main/rnnmodels.py


# LSTM

LSTM stands for Long Short-Term Memory, which is a type of recurrent neural network (RNN) architecture. LSTMs were designed to solve the problem of the vanishing gradient, which occurs when the gradients used to update the weights of a neural network become too small, leading to slow or no learning.

LSTMs have a memory cell that can store information for long periods of time and three gates that control the flow of information: the input gate, the forget gate, and the output gate.

The input gate determines how much of the new input should be stored in the memory cell, the forget gate controls how much of the old memory should be forgotten, and the output gate determines how much of the memory should be used to compute the output.

The architecture of an LSTM allows it to selectively remember or forget information based on the current input and the past memory content. This makes LSTMs particularly well-suited for tasks that require modeling long-term dependencies, such as speech recognition, language translation, and stock price prediction.

Overall, LSTMs are a powerful tool for modeling sequential data, and have become a popular choice in many applications that require time-series analysis or natural language processing.

The LSTM architecture consists of several mathematical formulas, which are used to update the memory cell and control the flow of information through the gates. Here are the main formulas used in an LSTM:

![lstm](https://dezyre.gumlet.io/images/blog/lstm-model/Long_Short_Term_Memory_(LSTM)_Models.png?w=400&dpr=2.6)

1. Input gate formula:   
i_t = sigmoid(W_i[x_t, h_{t-1}] + b_i)  
The input gate is responsible for controlling how much of the current input x_t should be stored in the memory cell. It takes as input the concatenation of the current input x_t and the previous hidden state h_{t-1}, and produces a value between 0 and 1 using the sigmoid activation function.

2. Forget gate formula:   
f_t = sigmoid(W_f[x_t, h_{t-1}] + b_f)  
The forget gate controls how much of the previous memory cell content should be retained for the current timestep. It takes as input the concatenation of the current input x_t and the previous hidden state h_{t-1}, and produces a value between 0 and 1 using the sigmoid activation function.

3. Update cell state formula:   
\tilde{C}_t = tanh(W_C[x_t, h_{t-1}] + b_C)  
This formula computes a "candidate" memory value that will be added to the memory cell state. It takes as input the concatenation of the current input x_t and the previous hidden state h_{t-1}, and produces a value between -1 and 1 using the hyperbolic tangent activation function.

4. Update memory cell formula:   
C_t = f_t * C_{t-1} + i_t * \tilde{C}_t  
This formula updates the memory cell state for the current timestep. It combines the previous memory cell state C_{t-1} with the candidate value \tilde{C}_t, weighted by the forget gate f_t and the input gate i_t.

5. Output gate formula:   
o_t = sigmoid(W_o[x_t, h_{t-1}] + b_o) 
The output gate controls how much of the current memory cell state should be used to compute the output. It takes as input the concatenation of the current input x_t and the previous hidden state h_{t-1}, and produces a value between 0 and 1 using the sigmoid activation function.

6. Hidden state formula:   
h_t = o_t * tanh(C_t)  
Finally, the hidden state for the current timestep is computed as a combination of the current memory cell state C_t and the output gate o_t, using the hyperbolic tangent activation function.

These formulas work together to allow an LSTM to selectively retain or forget information over long time intervals, making it well-suited for tasks that require modeling long-term dependencies in sequential data.

In [3]:
"""
The LSTM cell has an input size (input_size) and a hidden size (hidden_size). bias is an optional argument that is set to True by default and adds a bias term to the linear transformations.

The nn.Linear() function is used to define two linear transformations: self.xh which takes the input and applies a linear transformation to it, and self.hh which takes the previous hidden state and applies a linear transformation to it. The 4 in the hidden_size * 4 argument is because there are four gates in an LSTM cell: input gate, forget gate, cell gate, and output gate.

The reset_parameters() function sets the parameters of the linear transformations to a uniform distribution between -std and std. The std is calculated as 1.0 / np.sqrt(self.hidden_size), which is a common initialization method for neural network weights.
...
Forward function
...
The function takes as input the input tensor of shape (batch_size, input_size) and an optional hidden state tensor hx of shape (batch_size, hidden_size).

If hx is not provided, the function initializes it to a tensor of zeros with the same shape as the input tensor.
The function then splits the hidden state tensor hx into two parts: hx and cx, which represent the hidden state and the cell state of the LSTM, respectively.

The function computes the gates (i_t, f_t, g_t, o_t) using the input tensor input and the hidden state tensor hx. These gates represent the input gate, forget gate, cell gate, and output gate of the LSTM, respectively.

The function applies the sigmoid activation function to the input gate and forget gate, and the hyperbolic tangent activation function to the cell gate and output gate.

The function computes the new cell state cy by combining the old cell state cx with the input gate i_t and the cell gate g_t, and forgetting some of the old cell state using the forget gate f_t.

The function computes the new hidden state hy by applying the output gate o_t to the cell state cy and applying the hyperbolic tangent activation function to the result.
Finally, the function returns a tuple containing the new hidden state hy and the new cell state cy.

"""

class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias 

        self.xh = nn.Linear(input_size, hidden_size*4, bias=bias)
        self.hh = nn.Linear(hidden_size, hidden_size*4, bias=bias)
        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0/np.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, input, hx=None):
        # Inputs:
        #       input: of shape (batch_size, input_size)
        #       hx: of shape (batch_size, hidden_size)
        # Outputs:
        #       hy: of shape (batch_size, hidden_size)
        #       cy: of shape (batch_size, hidden_size)

        if hx is None:
            hx = Variable(input.new_zeros(input.size(0), self.hidden_size))
            hx = (hx, hx)

        hx , cx = hx 

        gates = self.xh(input) + self.hh(hx)

        # Get gates (i_t, f_t, g_t, o_t)
        input_gate, forget_gate, cell_gate, output_gate = gates.chunk(4, 1)

        i_t = torch.sigmoid(input_gate)
        f_t = torch.sigmoid(forget_gate)
        g_t = torch.tanh(cell_gate)
        o_t = torch.sigmoid(output_gate)

        cy = cx * f_t + i_t * g_t

        hy = o_t * torch.tanh(cy)

        return (hy, cy)


In [4]:
"""
- In the constructor, the module takes as input the input size, hidden size, number of layers, bias flag, and output size of the LSTM.

- The constructor initializes a list of LSTM cells (one for each layer) using the nn.ModuleList() class.
- The first LSTM cell in the list takes the input size and hidden size as input and uses the bias flag to decide whether to include bias terms in the cell.

- The remaining LSTM cells in the list take the hidden size as input and have the same bias flag as the first cell.

- The forward() method of the module takes as input the input tensor of shape (batch_size, sequence_length, input_size) and an optional initial hidden state tensor hx of shape (num_layers, batch_size, hidden_size).

- If hx is not provided, the method initializes it to a tensor of zeros with the appropriate shape.

- The method then iterates over the sequence length of the input tensor, and for each time step, it iterates over the layers of the LSTM.

- For each layer, the method computes the hidden state and cell state using the input tensor and the hidden state and cell state of the previous layer (or the initial hidden state if this is the first layer).

- The method stores the output of the final time step in a list called outs and uses it to compute the final output of the module using a linear layer with output size output_size.

- Finally, the method returns the final output tensor of shape (batch_size, output_size).

Overall, this code defines a stacked LSTM module that takes a sequence of inputs and produces a sequence of outputs using multiple layers of LSTM cells.
"""

class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, bias, output_size):
        super(LSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size

        self.rnn_cell_list = nn.ModuleList()

        self.rnn_cell_list.append(
            LSTMCell(self.input_size, 
                     self.hidden_size, self.bias))

        for l in range(1, self.num_layers):
            self.rnn_cell_list.append(
                LSTMCell(self.hidden_size, self.hidden_size, self.bias)
            )
        
        self.fc = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hx=None):
        # Input of shape (batch_size, seqence length , input_size)
        #
        # Output of shape (batch_size, output_size)

        if hx is None:
            if torch.cuda.is_available():
                h0 = Variable(torch.zeros(self.num_layers,
                              input.size(0), self.hidden_size).cuda())
            else:
                h0 = Variable(torch.zeros(self.num_layers,
                              input.size(0), self.hidden_size))
        else:
            h0 = hx

        outs = []
        hidden = list()

        for layer in range(self.num_layers):
            hidden.append((h0[layer, :, :], h0[layer, :, :]))
        
        for t in range(input.size(1)):
            for layer in range(self.num_layers):
                if layer == 0:
                    hidden_l = self.rnn_cell_list[layer](
                        input[:, t, :],
                        (hidden[layer][0], hidden[layer][1])
                    )
                else:
                    hidden_l = self.rnn_cell_list[layer](
                        hidden[layer-1][0],
                        (hidden[layer][0], hidden[layer][1])
                    )
                
                hidden[layer] = hidden_l
            outs.append(hidden_l[0])
            
        out = outs[-1].squeeze()
        out = self.fc(out)

        return out

In [5]:
input_size = 100
hidden_size = 5
num_layers = 5
bias = True
output_size = 2

In [6]:
LSTM(input_size, hidden_size, num_layers, bias, output_size)

LSTM(
  (rnn_cell_list): ModuleList(
    (0): LSTMCell(
      (xh): Linear(in_features=100, out_features=20, bias=True)
      (hh): Linear(in_features=5, out_features=20, bias=True)
    )
    (1): LSTMCell(
      (xh): Linear(in_features=5, out_features=20, bias=True)
      (hh): Linear(in_features=5, out_features=20, bias=True)
    )
    (2): LSTMCell(
      (xh): Linear(in_features=5, out_features=20, bias=True)
      (hh): Linear(in_features=5, out_features=20, bias=True)
    )
    (3): LSTMCell(
      (xh): Linear(in_features=5, out_features=20, bias=True)
      (hh): Linear(in_features=5, out_features=20, bias=True)
    )
    (4): LSTMCell(
      (xh): Linear(in_features=5, out_features=20, bias=True)
      (hh): Linear(in_features=5, out_features=20, bias=True)
    )
  )
  (fc): Linear(in_features=5, out_features=2, bias=True)
)

In [7]:
# Define input tensor of shape (batch_size, sequence_length, input_size)
input_tensor = torch.rand((2, 5, 3))

# Define LSTM module with input size=3, hidden size=4, 2 layers, bias=True, output size=2
lstm = LSTM(input_size=3, hidden_size=4,
            num_layers=2, bias=True, output_size=2)

# Forward pass through the LSTM module
output_tensor = lstm(input_tensor)

# Print the shape of the output tensor
print(output_tensor.shape)
print(output_tensor)


torch.Size([2, 2])
tensor([[ 0.5334, -0.0565],
        [ 0.5318, -0.0566]], grad_fn=<AddmmBackward0>)
