In this lab we will review some basics of pytorch. Save your answers for this lab as they will be used for part of Lab 2. 



(1) Create a dataloader for the MNIST training data using torchvision package. Have your dataloader iterate over the training set outputing mini-batches of size 256 image samples. Note you do not need to use the image labels in this lab. You may follow the example in the official pytorch examples: 

https://github.com/pytorch/examples/blob/master/mnist/main.py#L112-L120


In [None]:
from torchvision import datasets,transforms
import torch
dataset1 = datasets.MNIST('../data', train=True, download=True, transform=transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(dataset1, 
                                           batch_size=256, 
                                           shuffle=True,
                                           drop_last=True)

device='cpu'

for (data, target) in train_loader:
  data = data.to(device)
  target = target.to(device)
print(data.shape)
print(target.shape)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw
Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)



torch.Size([256, 1, 28, 28])
torch.Size([256])




(2) Using only torch primitives (e.g. torch.matmul, torch._relu, etc) implement a simple feedforward neural network with 2 hidden layers that takes as input MNIST digits and outputs a single scalar value. You may select the hidden layer width (greater than 20) and activations (tanh, relu, sigmoid, others) as desired.  Initialize the weights and biases with uniform random values in the range -1 to 1. Avoid using any functions from torch.nn class. Using the loop from (1) Forward pass through the dataset in mini-batches of 256 and record the time this takes. 

In [None]:
import torch

## Initialize and track the parameters using a list or dictionary
h1_size = 50
h2_size = 50
param_dict = {
    "W0": torch.rand(784, h1_size)*2-1,
    "b0": torch.rand(h1_size)*2-1, 
    "W1": torch.rand(h1_size, h2_size)*2-1,
    "b1": torch.rand(h2_size)*2-1,
    "W2": torch.rand(h2_size,1)*2-1,
    "b2": torch.rand(1)*2-1,
    }

for name, param in param_dict.items():
  param_dict[name] = param.to(device)
  param.requires_grad=True

## Define the network
def my_nn(input, param_dict):
    r"""Performs a single forward pass of a Neural Network with the given 
    parameters in param_dict.

    Args:
        input (torch.tensor): Batch of images of shape (B, H, W), where B is 
            the number of input samples,and H and W are the image height and 
            width respectively.
        param_dict (dict of torch.tensor): Dictionary containing the parameters
            of the neural network. Expects dictionary keys to be of format 
            "W#" and "b#" where # is the layer number.

    Returns:
        torch.tensor: Neural network output of shape (B, )
    """


    x = input.view(-1 , 28*28) 

    # layer 1
    x = torch.relu_(x @ param_dict['W0'] + param_dict['b0'])

    # layer 2
    x = torch.relu_(x @ param_dict['W1'] + param_dict['b1'])

    # output 
    x = x @ param_dict['W2']  + param_dict['b2']

    return x.view(-1)


## Perform forward pass
my_nn(data, param_dict).shape

torch.Size([256])

(3) Implement a new torch.nn.module that performs the equivalent of the network in (2). Initialize it with the same weights and validate the outputs of this network is the same as the one in (2) on MNIST training set.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self, h1_siz, h2_siz):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(28*28, h1_siz)
        self.linear2 = nn.Linear(h1_siz, h2_siz)
        self.linear3 = nn.Linear(h2_siz, 1)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        return self.linear3(x).view(-1)

model = Model(h1_size,h2_size)

In [None]:
import copy

#We can access all the variables in model and manipulate them directly
#Note here we do a deepcopy just to make sure this model is separate from the one in the above cell
model.linear1.weight.data = copy.deepcopy(param_dict['W0'].data.T)
model.linear1.bias.data = copy.deepcopy(param_dict['b0'].data)
model.linear2.weight.data = copy.deepcopy(param_dict['W1'].data.T)
model.linear2.bias.data = copy.deepcopy(param_dict['b1'].data)
model.linear3.weight.data = copy.deepcopy(param_dict['W2'].data.T)
model.linear3.bias.data = copy.deepcopy(param_dict['b2'].data)



In [None]:
for i,(data, _) in enumerate(train_loader):
  assert(((model(data)-my_nn(data, param_dict))**2).mean()<1e-4) # check that all the outputs are roughly equal

(4) For a batch of 256 random samples, compute the gradient of the average of the neural network outputs (over the batch) w.r.t to the weights using torch autograd. Compute the gradients for the torch.nn based model in (3) and validate the gradients match those from those computed with (2). 

**Note**: The network here is $f: \mathcal{R}^{HW}\rightarrow\mathcal{R}$, with $256$ samples you should obtain $o=\frac{1}{256}\sum_{i=0}^{255}f(x_i)$. You are asked to find $\nabla_w o$ for all the parameters $w$.

In [None]:


# For nn module defined version
output = model(data).mean()
output.backward()

#For model using only torch autograd/tensors
output2 = my_nn(data, param_dict).mean()
output2.backward()

epsilon=1e-3
assert(torch.norm(param_dict['W0'].grad.T - model.linear1.weight.grad)<epsilon)
assert(torch.norm(param_dict['b0'].grad - model.linear1.bias.grad)<epsilon)
assert(torch.norm(param_dict['W1'].grad.T - model.linear2.weight.grad)<epsilon)
assert(torch.norm(param_dict['b1'].grad - model.linear2.bias.grad)<epsilon)
assert(torch.norm(param_dict['W2'].grad.T - model.linear3.weight.grad)<epsilon)
assert(torch.norm(param_dict['b2'].grad - model.linear3.bias.grad)<epsilon)

In [None]:
# For completenes this cell is helpful if you are reruning the answers and thus need to clear the previous gradient
# There is a need to clear the gradient buffers before computing the backward pass
model.zero_grad() # for nn module

for (_,param) in param_dict.items():
  if param.grad is not None: # grad buffer doesnt exist until the first backward pass
    param.grad.detach_() # by default the gradient is in the computation graph
    param.grad.zero_()

(5) Perform  the forward and backward passes from (3), 10 times on cpu and 10 times on gpu, report the average time for both. Repeat this for just the forward pass. In the end you should obtain 4 average run times (forward and backward, forward only) x (cpu, gpu) 

In [None]:
#We may need to use a bigger model here to see the difference
model = Model(1000 , 1000)

In [None]:
import time as timer
data = data.to('cpu')
model.cpu()

print('Running on CPU')

start = timer.time()
for _ in range(10):
  model(data)
print("Time taken forward", timer.time() - start) 

start = timer.time()
for _ in range(10):
  out = model(data).mean()
  out.backward()
print("Time taken forward/backward", timer.time() - start) 

Running on CPU
Time taken forward 0.12662315368652344
Time taken forward/backward 0.3139340877532959


In [None]:
#init cuda
data = data.to('cuda')
model.cuda()
model(data)
print('Running on GPU')


start = timer.time()
for _ in range(10):
  model(data)
torch.cuda.synchronize()

print("Time taken", timer.time() - start) 

start = timer.time()
for _ in range(10):
  out = model(data).mean()
  out.backward()
torch.cuda.synchronize()
print("Time taken forward/backward", timer.time() - start) 

Running on GPU
Time taken 0.005411863327026367
Time taken forward/backward 0.014293193817138672


Note the backward is roughly 2x the forward. GPU is more than an order of magnitude faster for this setting. 