# Long short-term memory

Long short-term memory (LSTM) is a variety of recurrent neural networks. It was invented by Hochreiter and Schmidhuber in 1997 and set accuracy records in multiple applications domains.  LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. 
Vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. 
The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.
The advantage of an LSTM cell compared to a common recurrent unit is its cell memory unit. The cell vector has the ability to encapsulate the notion of forgetting part of its previously stored memory, as well as to add part of the new information. 

In [1]:
import torch
import torch.nn as nn 
import torch.optim as optim 
import torch.nn.functional as F  

from torch.utils.data import DataLoader

import torchvision.datasets as datasets
import torchvision.transforms as transforms

Let's define our hyperparameters that will be used:

In [2]:
input_size = 28
hidden_size = 256
num_layers = 2
num_classes = 10
sequence_length = 28
learning_rate = 0.001
batch_size = 64
num_epochs = 3

Load data:

In [3]:
train_dataset = datasets.MNIST('', train=True,
                               transform=transforms.ToTensor(), 
                               download=True)


test_dataset = datasets.MNIST('', train=False,
                               transform=transforms.ToTensor(), 
                               download=True)


train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)


In [4]:
class RNN_LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN_LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * sequence_length, num_classes)

    def forward(self, x):
        # Set initial hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

        # Forward propagate LSTM
        out, _ = self.lstm(
            x, (h0, c0)
        )  # out: tensor of shape (batch_size, seq_length, hidden_size)
        out = out.reshape(out.shape[0], -1)

        # Decode the hidden state of the last time step
        out = self.fc(out)
        return out



Set the device:

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Initialize network:

In [6]:
model = RNN_LSTM(input_size, hidden_size, num_layers, num_classes).to(device)

Let's calculate loss and specify the optimizer:

In [7]:
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

## Training

In [8]:
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        # Get data to cuda if possible
        data = data.to(device=device).squeeze(1)
        targets = targets.to(device=device)

        # forward
        scores = model(data)
        loss = loss_function(scores, targets)

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent or adam step
        optimizer.step()
    print(loss)

tensor(0.1488, grad_fn=<NllLossBackward>)
tensor(0.0020, grad_fn=<NllLossBackward>)
tensor(0.0006, grad_fn=<NllLossBackward>)


Let's check accuracy on training & test to see how good our model is:

In [9]:
def check_accuracy(loader, model):
    if loader.dataset.train:
        print("Accuracy on training data:")
    else:
        print("Accuracy on test data:")

    num_correct = 0
    num_samples = 0

    # Set model to eval
    model.eval()

    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device).squeeze(1)
            y = y.to(device=device)

            scores = model(x)
            _, predictions = scores.max(1)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

        print(
            f"Got {num_correct} / {num_samples} with accuracy {float(num_correct)/float(num_samples)*100:.2f}"
        )
    # Set model back to train
    model.train()


check_accuracy(train_loader, model)
check_accuracy(test_loader, model)


Accuracy on training data:
Got 59455 / 60000 with accuracy 99.09
Accuracy on test data:
Got 9865 / 10000 with accuracy 98.65
