# Introduction to RNNs

Goal of the lab is to:
    * Implement simple RNN
    * Understand vanishing gradients in RNNs
    * Revisit code convetions: PyTorch data loader, functional model construction, more standard trainig loop
    
References:
    * http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
    * Exercise base code is from https://github.com/MorvanZhou/PyTorch-Tutorial/blob/master/tutorial-contents-notebooks/402_RNN.ipynb

# Whiteboard exercises

(Plus any left out exercise from the previous labs)

<img width=500 src="http://www.wildml.com/wp-content/uploads/2015/10/rnn-bptt-with-gradients.png">

* (0.5) Derive expression for $\frac{\partial E}{\partial W}$. See http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/ for reference.

* (0.5) Consider expression for $\frac{\partial E}{\partial W}$ and its dependence on $k$. Discuss based on this problem of vanishing gradient in RNNs.

* (0.5) Find  in literature at least two ways to combat vanishing gradients in RNNs *without* changing the architecture. Describe them and argue why they should work. 

* (0.5) Describe how Back Propagation Through Time (BPTT) would work for the network in the figure. Argue why BPTT, with correctly chosen cutoff, can be an effective method (e.g. not leading to inaccurate gradients and not slowing convergence) of training?


In [7]:
import torch
from torch import nn
from torch.autograd import Variable
import torchvision.datasets as dsets
import torchvision.transforms as transforms

# WIP

In [8]:
# Hyper Parameters
EPOCH = 1               # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 64
TIME_STEP = 28          # rnn time step / image height
INPUT_SIZE = 28         # rnn input size / image width
LR = 0.01               # learning rate
DOWNLOAD_MNIST = True   # set to True if haven't download the data

In [9]:
# Mnist digital dataset
train_data = dsets.MNIST(
    root='./mnist/',
    train=True,                         # this is training data
    transform=transforms.ToTensor(),    # Converts a PIL.Image or numpy.ndarray to
                                        # torch.FloatTensor of shape (C x H x W) and normalize in the range [0.0, 1.0]
    download=DOWNLOAD_MNIST,            # download it if you don't have it
)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


In [11]:
train_loader = torch.utils.data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)

In [12]:
# convert test data into Variable, pick 2000 samples to speed up testing
test_data = dsets.MNIST(root='./mnist/', train=False, transform=transforms.ToTensor())
test_x = Variable(test_data.test_data, volatile=True).type(torch.FloatTensor)[:2000]/255.   # shape (2000, 28, 28) value in range(0,1)
test_y = test_data.test_labels.numpy().squeeze()[:2000]    # covert to numpy array

In [13]:
class RNN(nn.Module):
    def __init__(self):
        super(RNN, self).__init__()

        self.rnn = nn.LSTM(         # if use nn.RNN(), it hardly learns
            input_size=INPUT_SIZE,
            hidden_size=64,         # rnn hidden unit
            num_layers=1,           # number of rnn layer
            batch_first=True,       # input & output will has batch size as 1s dimension. e.g. (batch, time_step, input_size)
        )

        self.out = nn.Linear(64, 10)

    def forward(self, x):
        # x shape (batch, time_step, input_size)
        # r_out shape (batch, time_step, output_size)
        # h_n shape (n_layers, batch, hidden_size)
        # h_c shape (n_layers, batch, hidden_size)
        r_out, (h_n, h_c) = self.rnn(x, None)   # None represents zero initial hidden state

        # choose r_out at the last time step
        out = self.out(r_out[:, -1, :])
        return out

In [14]:
rnn = RNN()
print(rnn)

RNN (
  (rnn): LSTM(28, 64, batch_first=True)
  (out): Linear (64 -> 10)
)


In [15]:
optimizer = torch.optim.Adam(rnn.parameters(), lr=LR)   # optimize all cnn parameters
loss_func = nn.CrossEntropyLoss()                       # the target label is not one-hottedoptimizer = torch.optim.Adam(rnn.parameters(), lr=LR)   # optimize all cnn parameters
loss_func = nn.CrossEntropyLoss()                       # the target label is not one-hotted

In [17]:
# training and testing
for epoch in range(EPOCH):
    for step, (x, y) in enumerate(train_loader):        # gives batch data
        b_x = Variable(x.view(-1, 28, 28))              # reshape x to (batch, time_step, input_size)
        b_y = Variable(y)                               # batch y

        output = rnn(b_x)                               # rnn output
        loss = loss_func(output, b_y)                   # cross entropy loss
        optimizer.zero_grad()                           # clear gradients for this training step
        loss.backward()                                 # backpropagation, compute gradients
        optimizer.step()                                # apply gradients

        if step % 50 == 0:
            test_output = rnn(test_x)                   # (samples, time_step, input_size)
            pred_y = torch.max(test_output, 1)[1].data.numpy().squeeze()
            accuracy = sum(pred_y == test_y) / float(test_y.size)
            print('Epoch: ', epoch, '| train loss: %.4f' % loss.data[0], '| test accuracy: %.2f' % accuracy)

('Epoch: ', 0, '| train loss: 0.1055', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1112', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1039', '| test accuracy: 0.95')
('Epoch: ', 0, '| train loss: 0.0329', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1611', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.0620', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1159', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.0384', '| test accuracy: 0.95')
('Epoch: ', 0, '| train loss: 0.1854', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.2838', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1550', '| test accuracy: 0.97')
('Epoch: ', 0, '| train loss: 0.0121', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1202', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.0447', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1529', '| test accuracy: 0.96')
('Epoch: ', 0, '| train loss: 0.1298', '

KeyboardInterrupt: 