# Lec_21_LSTM and GRU with MNIST


<font size=5><b><b></font>
<div align='right'> Hoe Sung Ryu ( 류 회 성 ) </div>
<div align='right'> Minsuk Sung ( 성 민 석) </div>
    
    
    
> Author: Hoe Sung Ryu, Minsuk Sung  <p>
> Tel: 010-6636-7275 / skainf23@gamil.com // 010-5134-3621 / mssung94@gmail.com  <p>
> 본 내용은 파이토치를 활용한 딥러닝 과외 자료입니다. 본 내용을 제작자의 동의없이 무단으로 복제하는 행위는 금합니다.
    

---

Syllabus
    
|Event Type|Date|Topic|
|--:|:---:|:---|
|1 |July 27| Environment setting and Python basic|
|2 |July 28| Pytorch basic and Custom Data load |
|3 |July 29| Traditional Machine Learning(1) |
|4 |July 30| Traditional Machine Learning(2) |
|5 |July 31| CNN(Convolutional Neural Network)(1)  |
|6 |Aug 03| CNN(Convolutional NeuralNetwork)(2) |
|7 |Aug 04|  RNN(Recurrent Neural Networks)(1) |
|8 |Aug 05|  RNN(Recurrent Neural Networks)(2) |
|9 |Aug 06|  Transfer learning(VGG pertained on ImageNEt for CIfar-10)| 
|10|Aug 07|**Mini_Kaggle**: Facial Expression Recognition on `AffectNet` | 
|11|Aug 08|`Awards` and `Closing`| 


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-is-Long-Short-Term-Memory-(LSTM)?" data-toc-modified-id="What-is-Long-Short-Term-Memory-(LSTM)?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What is Long Short Term Memory (LSTM)?</a></span></li></ul></div>

## What is Long Short Term Memory (LSTM)?
Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks, which makes it easier to remember past data in memory. The vanishing gradient problem of RNN is resolved here. LSTM is well-suited to classify, process and predict time series given time lags of unknown duration. It trains the model by using back-propagation. In an LSTM network, three gates are present:


<img src=https://miro.medium.com/max/1400/1*MwU5yk8f9d6IcLybvGgNxA.jpeg width=60%>



- 1st, Input gate — discover which value from input should be used to modify the memory. Sigmoid function decides which values to let through 0,1. and tanh function gives weightage to the values which are passed deciding their level of importance ranging from-1 to 1.


- 2nd, Forget gate — discover what details to be discarded from the block. It is decided by the sigmoid function. it looks at the previous state(ht-1) and the content input(Xt) and outputs a number between 0(omit this)and 1(keep this)for each number in the cell state Ct−1.


- 3rd, Output gate — the input and the memory of the block is used to decide the output. Sigmoid function decides which values to let through 0,1. and tanh function gives weightage to the values which are passed deciding their level of importance ranging from-1 to 1 and multiplied with output of Sigmoid.


In [18]:
import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
sequence_length = 28
input_size = 28
hidden_size = 128

num_layers = 2
num_classes = 10
batch_size = 64

num_epochs = 2
learning_rate = 0.01

In [19]:
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data/',
                                          train=False, 
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

In [24]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)  # Batchsize(N) x time_seq x features 
        self.fc = nn.Linear(hidden_size*sequence_length, num_classes)
        
        
    def forward(self,x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        
        # forward prop. 
        out, _ = self.rnn(x, h0)
        out = out.reshape(out.shape[0],-1)
        out = self.fc(out)
        return out
        

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)  # Batchsize(N) x time_seq x features 
        self.fc = nn.Linear(hidden_size, num_classes) # take last on 
        
        
    def forward(self,x):
        # hidden 
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        # seperate 
        c0 = torch.zeors(self.num_layers, x.size(0), self.hidden_size).to(device)
        
        # forward prop. 
        out, _ = self.lstm(x, (h0,c0))
        out = self.fc(out[:,-1,:]) # last hidden state # doesnot need concat 
        return out
        

In [None]:
class GRU(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)  # Batchsize(N) x time_seq x features 
        self.fc = nn.Linear(hidden_size, num_classes) # take last on 
        
        
    def forward(self,x):
        # hidden 
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        # seperate 
        c0 = torch.zeors(self.num_layers, x.size(0), self.hidden_size).to(device)
        
        # forward prop. 
        out, _ = self.gru(x, (h0,c0))
        out = self.fc(out[:,-1,:]) # last hidden state # doesnot need concat 
        return out
        

In [25]:
model = LSTM(input_size, hidden_size, num_layers, num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [30]:
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
#         print(images.shape)
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
#         print(images.shape)
#         break
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total)) 

Epoch [1/2], Step [100/938], Loss: 9.3489
Epoch [1/2], Step [200/938], Loss: 12.2893
Epoch [1/2], Step [300/938], Loss: 8.0958
Epoch [1/2], Step [400/938], Loss: 19.3385
Epoch [1/2], Step [500/938], Loss: 13.4461
Epoch [1/2], Step [600/938], Loss: 10.1456


KeyboardInterrupt: 

In [None]:
# Save the model checkpoint
# torch.save(model.state_dict(), 'model.ckpt')