# Long Short-Term Memory (LSTM) with PyTorch

This notebook will demonstrate how to create, optimize and make predictions using the Long Short-Term Memory (LSTM) network.

Specifically, we will implement the Long Short-Term Memory unit seen below, that predicts sequential data to predict the value of two different companies.

<img src='images/lstm_image.001.png' style='width: 720px'>

The training data (below) consist of stock prices for two different companies, **Company A** and **Company B**. The goal is to use the data from the first **4** days to predict what the price will be on the **5th** day. If we look closely at the data, we'll see that the only differences in the prices occur on Day **1** and Day **5**. So the LSTM has to remember what happened on Day **1** in order to predict what will happen on Day **5**.


<img src='images/company_a_data.png' style='width: 360px'> <img src='images/company_b_data.png' style='width: 360px'>


-----

## Import the modules that will do all the work

The very first thing we need to do is load a bunch of Python modules. These modules give us extra functionality to create a Long Short-Term Memory (LSTM) neural network and optimize the neural network's parameters.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

from torch.utils.data import TensorDataset, DataLoader

import matplotlib.pyplot as plt
import seaborn as sns

-----

## Build a Long Short-Term Memory unit by hand using PyTorch

A Long Short-Term Memory (LSTM) unit is a type of neural network, and that means we need to create a new class. 
We'll create the following methods:
- `__init__()` to initialize the Weights and Biases and keep track of a few other house keeping things.
- `lstm_unit()` to do the LSTM math.
- `forward()` to make a forward pass through the unrolled LSTM.


In [2]:
class LSTM(nn.Module):
    def __init__(self):
        super().__init__()

        torch.manual_seed(seed=18)
        
        mean = torch.tensor(0.0)
        std = torch.tensor(1.0)

        self.wlr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wlr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.blr1 = nn.Parameter(torch.tensor(0.0), requires_grad=True)

        self.wpr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wpr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bpr1 = nn.Parameter(torch.tensor(0.0), requires_grad=True)

        self.wp1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wp2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bp1 = nn.Parameter(torch.tensor(0.0), requires_grad=True)

        self.wo1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wo2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bo1 = nn.Parameter(torch.tensor(0.0), requires_grad=True)

    def lstm_unit(self, input_value, long_memory, short_memory):
        long_remember_percent = torch.sigmoid((short_memory * self.wlr1) + 
                                             (input_value * self.wlr2) + 
                                             self.blr1)

        potential_remember_percent = torch.sigmoid((short_memory * self.wpr1) +
                                                   (input_value * self.wpr2) +
                                                   self.bpr1)
        
        potential_memory = torch.tanh((short_memory * self.wp1) + 
                                      (input_value * self.wp2) +
                                      self.bp1)

        updated_long_memory = ((long_memory * long_remember_percent)+
                               (potential_remember_percent * potential_memory))

        output_percent = torch.sigmoid((short_memory * self.wo1) +
                                       (input_value * self.wo2))

        updated_short_memory = torch.tanh(updated_long_memory) * output_percent

        return ([updated_long_memory, updated_short_memory])


    def forward(self, X):
        long_memory = 0
        short_memory = 0

        day1 = X[0]
        day2 = X[1]
        day3 = X[2]
        day4 = X[3]

        long_memory, short_memory = self.lstm_unit(day1, long_memory, short_memory)
        long_memory, short_memory = self.lstm_unit(day2, long_memory, short_memory)
        long_memory, short_memory = self.lstm_unit(day3, long_memory, short_memory)
        long_memory, short_memory = self.lstm_unit(day4, long_memory, short_memory)

        return short_memory

Once we have created the class that defines an LST, we can use it to create a model and print out the randomly initialized Weights and Biases. Then, we'll see what those random Weights and Biases predict for **Company A** and **Company B**. If they are good predictions, then we're done! However, the chances of getting good predictions from random values is very small.

In [3]:
model = LSTM()
model.eval()
print('Paramaters before optimization')
for name, parameter in model.named_parameters():
    print(f'{name} : {parameter.data}')

with torch.no_grad():
    print('Company A: Observed = 0, Predicted =', model(torch.tensor([0., 0.5, 0.25, 1.])))
    print('Company B: Observed = 1, Predicted =', model(torch.tensor([1., 0.5, 0.25, 1.])))

Paramaters before optimization
wlr1 : 0.5940740704536438
wlr2 : -0.12711703777313232
blr1 : 0.0
wpr1 : -0.7286937236785889
wpr2 : 0.7211949229240417
bpr1 : 0.0
wp1 : -0.566031277179718
wp2 : 0.5780901908874512
bp1 : 0.0
wo1 : 0.30693256855010986
wo2 : 0.6139482259750366
bo1 : 0.0
Company A: Observed = 0, Predicted = tensor(0.2416)
Company B: Observed = 1, Predicted = tensor(0.2482)


With the unoptimized parameters, the predicted value for **Company A**, **0.2416**, is quite far from the observed value, **0**. On the other hand, the predicted value for **Company B**, **0.2482**, is terrible, because it is relatively far from the observed value, **1**. So, that means we need to train the LSTM.

-----

## Train the LSTM unit and use TensorBoard to evaluate

In [4]:
inputs = torch.tensor([[0., 0.5, 0.25, 1.], [1., 0.5, 0.25, 1.]])
labels = torch.tensor([0., 1.])

dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset)

**NOTE:** We are starting with **20** epochs. This may be enough to successfully optimize all of the parameters, but it might not.

In [5]:
import os
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

model_path = os.path.join(os.path.abspath(os.getcwd()), 'models/lstm_last.pt')
epochs = 20

optimizer = Adam(model.parameters(), lr=0.1)

model.train()

for epoch in range(epochs):
    total_loss = 0
    outputs = [0, 0]
    for batch, (X, y) in enumerate(dataloader):
        output = model(X[0])
        loss = (y - output) ** 2

        total_loss += loss.item()
        
        outputs[batch] = output

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    writer.add_scalar('Loss/Train', total_loss, epoch + 1)
    writer.add_scalar('Out0/Train', outputs[0], epoch + 1)
    writer.add_scalar('Out1/Train', outputs[1], epoch + 1)

    print(f'Epoch: {epoch + 1} / {epochs} | Loss: {total_loss}')
    torch.save({
            'epoch': epoch + 1,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': total_loss,
            'outputs': outputs
            }, model_path)
    # print(f'Model saved at {model_path}')

writer.flush()

model.eval()
with torch.no_grad():
    print('Company A: Observed = 0, Predicted =', model(torch.tensor([0., 0.5, 0.25, 1.])))
    print('Company B: Observed = 1, Predicted =', model(torch.tensor([1., 0.5, 0.25, 1.])))

Epoch: 1 / 20 | Loss: 0.7896659038960934
Epoch: 2 / 20 | Loss: 0.6475228294730186
Epoch: 3 / 20 | Loss: 0.5497229248285294
Epoch: 4 / 20 | Loss: 0.4833335056900978
Epoch: 5 / 20 | Loss: 0.4527108073234558
Epoch: 6 / 20 | Loss: 0.4500923603773117
Epoch: 7 / 20 | Loss: 0.4613257348537445
Epoch: 8 / 20 | Loss: 0.4745378643274307
Epoch: 9 / 20 | Loss: 0.4838353246450424
Epoch: 10 / 20 | Loss: 0.487785741686821
Epoch: 11 / 20 | Loss: 0.48706820607185364
Epoch: 12 / 20 | Loss: 0.4830686151981354
Epoch: 13 / 20 | Loss: 0.477227047085762
Epoch: 14 / 20 | Loss: 0.4707200676202774
Epoch: 15 / 20 | Loss: 0.4642821252346039
Epoch: 16 / 20 | Loss: 0.45812784135341644
Epoch: 17 / 20 | Loss: 0.4519989490509033
Epoch: 18 / 20 | Loss: 0.445335254073143
Epoch: 19 / 20 | Loss: 0.43752019107341766
Epoch: 20 / 20 | Loss: 0.4281167984008789
Company A: Observed = 0, Predicted = tensor(0.4432)
Company B: Observed = 1, Predicted = tensor(0.5343)


Unfortunately, these predictions are terrible. So, it seems like we'll have to do more training. However, it would be great if we could be confident that more training will actually improve the predictions. If not, we can spare ourselves a lot of time, and potentially money, and just give up. So, before we dive into more training, let's look at the loss values and predictions that we saved in logfiles with **TensorBoard**. **TensorBoard** will graph everything that we logged during training, making it easier to see if things are headed in the right direction or not.

Below are the graphs of **loss** (`train_loss`), the predictions for **Company A** (`out_0`), and the predictions for **Company B** (`out_1`). Remember for **Company A**, we want to predict **0** and for **Company B**, we want to predict **1**.

<img src='images/trainloss_20_epochs.png' style='width: 360px'> <img src='images/out_0_20_epochs.png' style='width: 360px'> <img src='images/out_1_20_epochs.png' style='width: 360px'>

If we look at the **loss** (`Loss/Train`), we see that it went down, which is good, but it started rising after a few epochs. When we look at the predictions for **Company A** (`Out0/Train`), we see that they started out pretty good, close to **0**, but then got really bad early on in training, shooting all the way up to **0.57**, but are starting to get smaller. In contrast, when we look at the predictions for **Company B** (`Out1/Train`), we see that they started out really bad, close to **0**, but have been getting better ever since and look like they could continue to get better if we kept training.

In summary, the graphs seem to suggest that if we continued training our model, the predictions would improve.

## Adding More Epochs without Starting Over

Since we are saving a general checkpoint of the model during training, we can pick up where we left off training without having to start from scratch. This is because when we save the model using `torch.save()`, we are saving a file that keeps track of the Weights and Biases, and any additional information we want as they change. As a result, all we have to do to pick up where we left off is to tell `torch.load` where the checkpoint file is located. Since we want to add **20** more epochs to the training,

In [6]:
writer = SummaryWriter()

path_to_checkpoint = model_path

epochs = 40

model = LSTM()
optimizer = Adam(model.parameters(), lr=0.1)

checkpoint = torch.load(model_path)

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
outputs = checkpoint['outputs']

writer.add_scalar('Loss/Train', loss, epoch)
writer.add_scalar('Out0/Train', outputs[0], epoch)
writer.add_scalar('Out1/Train', outputs[1], epoch)

model.train()

for epoch in range(epoch, epochs):
    total_loss = 0
    outputs = [0, 0]
    for batch, (X, y) in enumerate(dataloader):
        output = model(X[0])
        loss = (y - output) ** 2

        total_loss += loss.item()
        
        outputs[batch] = output

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    writer.add_scalar('Loss/Train', total_loss, epoch + 1)
    writer.add_scalar('Out0/Train', outputs[0], epoch + 1)
    writer.add_scalar('Out1/Train', outputs[1], epoch + 1)

    print(f'Epoch: {epoch + 1} / {epochs} | Loss: {total_loss}')

    torch.save({
        'epoch': epoch + 1,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        'outputs': outputs
    }, model_path)

writer.flush()
        
model.eval()
with torch.no_grad():
    print('Company A: Observed = 0, Predicted =', model(torch.tensor([0., 0.5, 0.25, 1.])))
    print('Company B: Observed = 1, Predicted =', model(torch.tensor([1., 0.5, 0.25, 1.])))

Epoch: 21 / 40 | Loss: 0.41700156033039093
Epoch: 22 / 40 | Loss: 0.40432092547416687
Epoch: 23 / 40 | Loss: 0.39026327431201935
Epoch: 24 / 40 | Loss: 0.37473373115062714
Epoch: 25 / 40 | Loss: 0.35710014402866364
Epoch: 26 / 40 | Loss: 0.33615151047706604
Epoch: 27 / 40 | Loss: 0.3103083521127701
Epoch: 28 / 40 | Loss: 0.27800555527210236
Epoch: 29 / 40 | Loss: 0.23809392750263214
Epoch: 30 / 40 | Loss: 0.19031646102666855
Epoch: 31 / 40 | Loss: 0.13708627596497536
Epoch: 32 / 40 | Loss: 0.08659545425325632
Epoch: 33 / 40 | Loss: 0.04928428865969181
Epoch: 34 / 40 | Loss: 0.028094633697037352
Epoch: 35 / 40 | Loss: 0.017632327799219638
Epoch: 36 / 40 | Loss: 0.011430630402173847
Epoch: 37 / 40 | Loss: 0.007879333381424658
Epoch: 38 / 40 | Loss: 0.006646450608968735
Epoch: 39 / 40 | Loss: 0.004935314325848594
Epoch: 40 / 40 | Loss: 0.004065959437866695
Company A: Observed = 0, Predicted = tensor(-0.0188)
Company B: Observed = 1, Predicted = tensor(0.9403)


<img src='images/trainloss_40_epochs.png' style='width: 360px'> <img src='images/out_0_40_epochs.png' style='width: 360px'> <img src='images/out_1_40_epochs.png' style='width: 360px'>

The graphs are much better than before. The blue lines in each graph represents the values we logged during the extra **20** epochs. The **loss** is getting smaller and the predictions for both companies are improving. However, because it looks like there is even more room for improvement, let's add **60** more epochs to the training.

In [7]:
writer = SummaryWriter()

path_to_checkpoint = model_path

epochs = 100

model = LSTM()
optimizer = Adam(model.parameters(), lr=0.1)

checkpoint = torch.load(model_path)

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
outputs = checkpoint['outputs']

writer.add_scalar('Loss/Train', loss, epoch)
writer.add_scalar('Out0/Train', outputs[0], epoch)
writer.add_scalar('Out1/Train', outputs[1], epoch)

model.train()

for epoch in range(epoch, epochs):
    total_loss = 0
    outputs = [0, 0]
    for batch, (X, y) in enumerate(dataloader):
        output = model(X[0])
        loss = (y - output) ** 2

        total_loss += loss.item()
        
        outputs[batch] = output

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    writer.add_scalar('Loss/Train', total_loss, epoch + 1)
    writer.add_scalar('Out0/Train', outputs[0], epoch + 1)
    writer.add_scalar('Out1/Train', outputs[1], epoch + 1)

    print(f'Epoch: {epoch + 1} / {epochs} | Loss: {total_loss}')

    torch.save({
        'epoch': epoch + 1,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        'outputs': outputs
    }, model_path)

writer.flush()
        
model.eval()

with torch.no_grad():
    print('Company A: Observed = 0, Predicted =', model(torch.tensor([0., 0.5, 0.25, 1.]).detach()))
    print('Company B: Observed = 1, Predicted =', model(torch.tensor([1., 0.5, 0.25, 1.]).detach()))

Epoch: 41 / 100 | Loss: 0.0036251981218811125
Epoch: 42 / 100 | Loss: 0.0028253277869225712
Epoch: 43 / 100 | Loss: 0.002830698009347543
Epoch: 44 / 100 | Loss: 0.0022479911965547217
Epoch: 45 / 100 | Loss: 0.002265945033286698
Epoch: 46 / 100 | Loss: 0.0018887358705796942
Epoch: 47 / 100 | Loss: 0.0019249996839789674
Epoch: 48 / 100 | Loss: 0.0016605532772473452
Epoch: 49 / 100 | Loss: 0.001677868094702717
Epoch: 50 / 100 | Loss: 0.001494286176111359
Epoch: 51 / 100 | Loss: 0.001512537884991616
Epoch: 52 / 100 | Loss: 0.001385268818012264
Epoch: 53 / 100 | Loss: 0.0013735894935962278
Epoch: 54 / 100 | Loss: 0.0013034240091656102
Epoch: 55 / 100 | Loss: 0.0012723032541543944
Epoch: 56 / 100 | Loss: 0.0012392868302413262
Epoch: 57 / 100 | Loss: 0.0011907595173852314
Epoch: 58 / 100 | Loss: 0.0011813984219770646
Epoch: 59 / 100 | Loss: 0.0011354731411188368
Epoch: 60 / 100 | Loss: 0.0011213664838578552
Epoch: 61 / 100 | Loss: 0.0010925589126600244
Epoch: 62 / 100 | Loss: 0.00107057728314

<img src='images/trainloss_100_epochs.png' style='width:360px'> <img src='images/out_0_100_epochs.png' style='width:360px'> <img src='images/out_1_100_epochs.png' style='width:360px'>

After **100** epochs, the graphs have converged. The prediction for **Company A** is super close to **0**, which is exactly what we want, and the prediction for **Company B** is close to **1**, which is also what we want.


The red lines show how things changed when we added an additional **60** epochs to the training, for a total of **100** epochs. Now we see that the **loss** (`Loss/Train`) and the predictions for each company appear to be tapering off, suggesting that adding more epochs may not improve the predictions much, so we're done with the training.


Finally, let's print out the final estimates for the Weights and Biases.

In [8]:
print('Paramaters after optimization')
for name, parameter in model.named_parameters():
    print(f'{name} : {parameter.data}')

Paramaters after optimization
wlr1 : 2.893209457397461
wlr2 : 1.0485767126083374
blr1 : 1.544776439666748
wpr1 : 1.0238780975341797
wpr2 : 1.5359022617340088
bpr1 : 0.6787620186805725
wp1 : 2.1890065670013428
wp2 : 1.3072530031204224
bp1 : -0.39732617139816284
wo1 : 2.5714805126190186
wo2 : 1.557863712310791
bo1 : 0.0
