# LSTM (Long Short Term Memory)
- These are designed to address the vanishing/exploding gradient problem in RNNs
- LSTMs use two seperate paths to generate their predictions. They don't use the same paths that are used one step at a time.
- LSTM is much more complicate dthen RNNs becasue of everything inside of a single unit. 
- LSTMs tend to not use ReLU, they tend to use Sigmoid and Tanh activation functions.
- The long term memory (cell state) has no weights and biases.
- The lack of weights allows the long-term memories to flow through a series of unrolled units without causing the gradient to explode or vanish.
- The short term memories (hidden state), are directly connected to the weights which can modify them.
- The first stage in a long short-term memory unit determines what percentage of the long-term memory is remembered. (forget gate)
- The second stage of "blocks", one figures out the % potential memory to remember. The other figures the the potential long term memory. One gets the memory the other figures out how long should we remember it in the long term memory. (Input gate)
- In the last stage of blocks, one calculates the % of potential memory to remember just like the previous step but the other takes the long term memory multiples by the result we got from the % of potential memory to remeber and we get the new short term memory. (Output gate)
- You can unroll LSTMs more times than RNNs and accomidate them to longer sequences of data.
- The amazing thing about long term memory is that steps from a lot earlier time steps can be useful a lot later.
- As the lookback gap grows, this is where LSTMs are more useful than standard RNNs

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch import Tensor
import torch.nn.functional as F
import math

In [None]:
class Cell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(Cell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.input = nn.Linear(input_size, 4*hidden_size, bias=bias)
        self.hidden = nn.Linear(hidden_size, 4*hidden_size, bias=bias)
        self.tensor = Tensor(hidden_size*3)
        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std,std)
    
    def forward(self, x, hidden):
        hx, cx = hidden

        x = x.view(-1, x.size(1))
        gates = self.input(x) + self.hidden(hx)
        gates = gates.squeeze()

        c2c = self.tensor.unsqueeze(0)
        ci,cf,co = c2c.chunk(3,1)
        ingate, forgetgate,cellgate, outgate = gates.chunk(4,1)

        ingate = torch.sigmoid(ingate+ci*cx)
        forgetgate = torch.sigmoid(forgetgate + cf *cx)
        cellgate = forgetgate*cx + ingate* torch.tanh(cellgate)
        outgate = torch.sigmoid(outgate+ co*cellgate)
        
        hm = outgate * F.tanh(cellgate)
        return (hm, cellgate)

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, bias=True):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.lstm = Cell(input_dim, hidden_dim, layer_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        hidden = Tensor(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to("mps"))
        cell = Tensor(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to("mps"))

        outs = []

        celln = cell[0,:,:]
        hiddenn = hidden[0,:,:]

        for seq in range(x.size(1)):
            celln, hiddenn = self.lstm(x[:, seq, :], (hiddenn, celln))
            outs.append(hiddenn)
        
        out = outs[-1].sequeeze()
        out = self.fc(out)
        return out
