# Section 1: PyTorch Tutorial (Tensors, Automatic Differentation, Linear Regression, MNIST Classification)



## Goal

1. Learn to work with/manipulate tensors in PyTorch
2. Learn about automatic differentiation
3. Fit a linear regression model
3. Build a simple classifier on MNIST


In [None]:
import torch
import numpy as np

## Working with Tensors

Tensors, or multidimensional arrays, are fundamental objects that you will be working with in Torch (and deep learning in general). Let's create some.

In [None]:
x = torch.Tensor(5,5) #create a 5 x 5 tensor
print(x)
x = torch.LongTensor(5,2) #often our input data consists of integers (word indices), so it's helpful to use LongTensors
print(x)
#we can also do other initializations
x = torch.randn(3,4) #initialize from standard normal
x = torch.ones(3,4) #all ones
x = torch.zeros(3,4) #all zeros
x = torch.eye(3) # identity matrix

You can go back between numpy/torch tensors with ease.

In [None]:
x_numpy = np.random.randn(5)
print(x_numpy)
x_torch = torch.from_numpy(x_numpy)
print(x_torch)
x_numpy2 = x_torch.numpy()
print(x_numpy2)

Accessing torch tensors is essentially identitical to the case in numpy.

In [None]:
x = torch.randn(4,3)
print(x[0, 0]) 
print(x[:, 0]) 

There are various ways to manipulate tensors. Of particular utility is the *view* function, which reshapes a tensor in memory. See [here](http://pytorch.org/docs/master/tensors.html) for more operations.

In [None]:
print(x.view(2,6)) #reshape the 4x3 tensor into a 2x6 tensor
print(x.view(-1)) # -1 always reshapes to a 1d tensor

We can use operations on tensors as in numpy. Operations that have an underscore _ are *in-place* and modify the original tensor. Other operations will create a new tensor in memory.

In [None]:
x = torch.randn(5, 5)
y = torch.randn(5, 5)
z = torch.mm(x, y) #matrix multiply
z = 2*x # scalar multiplication
z = x + y #addition, creates a new tensor
print(x)
z = x.add(y) #same
print(x)
x.add_(y) #modifies x by adding y to it
print(x)
#there are other operations such as x.mul_(), x.div_() etc.


## Automatic Differentiation

Cool, so the tensor object is basically the same as a numpy array. What's neat is that you can define **computation graphs** with tensors and backpropagate gradients through this computational graph automatically. This is known as automatic differentation. To do this however, we need to use a `Variable` object which is basically a wrapper around a tensor object.

In [None]:
from torch.autograd import Variable
x = Variable(torch.randn(5), requires_grad=True) #requires grad means that we want to calculate gradients
print(x.data) #a tensor object
print(x.grad) #the gradient. right now we haven't calculated any gradients so there is None

Let's define a simple computation graph

In [None]:
x = Variable(torch.randn(5), requires_grad=True) 
y = x.mul(2) #y is now a variable. almost any operation you can do on tensors you can do on Variables
print(y)
y.backward(Variable(torch.ones(y.size()))) #this is a tricky concept, but we essentially backprop a vector of ones
#to simulate the calculation of dy[i] / dx for all i
print(x.grad) #this should be a vector of 2s

Now in practice, we almost always calculate the derivatve with respect to a **scalar**. In this case we can simply call `.backward()` without having to backpropagate a 1x1 vector of 1.

In [None]:
x.grad.data.zero_() #gradients are always accumulated, so we need to zero them out manually
y = x.mul(2).sum() # now y is a scalar. In most cases this would be your average loss
y.backward() #this is equivalent to y.backward(Variable(torch.ones(y.size())))
print(x.grad)

## Linear Regression

Let's fit a simple least squares model on a synthetic dataset with gradient descent. First let's generate synthetic data.

In [None]:
num_points = 1000
num_features = 5
w_star = torch.randn(num_features) #true weight vector
x = torch.randn(num_points, num_features) #input data
y = torch.mm(x, w_star.view(5, 1)) #torch.mm expects both inputs to be matrices so we cast w_star to be a column vector

This problem has an analytic solution which we can calculate directly.

In [None]:
w_ols = torch.mm(torch.mm(torch.inverse(torch.mm(x.t(), x)), x.t()), y) #(X^T X)^{-1} X^T Y
print(w_ols)
print(w_star)

But let's see if we can obtain (approximately) the same solution with gradient descent.

In [None]:
w_sgd = Variable(torch.randn(num_features), requires_grad=True) #randomly initialize
x_sgd = Variable(x) #remember, we need to convert everything to Variables to work with automatic differentiation
y_sgd = Variable(y) #we don't need to calculate gradients with respect to these guys

num_iters = 50
learning_rate = 0.1

for i in range(num_iters):
    if w_sgd.grad is not None:
        w_sgd.grad.zero_() #gradients get accumulated so we have to manually zero them out
    y_pred = torch.mm(x_sgd, w_sgd.unsqueeze(1)) #unsqueeze adds an extra dimension
    error = (y_sgd - y_pred)**2
    error_avg = error.mean()
    if i % 10 == 0:
        print(i, error_avg.data[0])
    error_avg.backward()    
    w_sgd.data = w_sgd.data - learning_rate*w_sgd.grad.data
        
print(w_sgd)
print(w_star)

Ok, that worked, but it's a bit annoying to separately define the weights as `Variables` and manually apply matrix-multiplies. Fortunately, torch provides an `nn` package which provides abstractions for almost all of the layers that we will use in the course. Let's see a few examples.

In [None]:
import torch.nn as nn

lin = nn.Linear(num_features, 1, bias=False) #this defines a linear layer that goes from num_feature dimensions to 1
print(lin.weight) #weight parameter of the linear layer. if bias = True, then we can access bias with lin.bias
y_pred = lin(x_sgd) #lin(x_sgd) automatically calls forward on the input x_sgd
print(y_pred)

The list of `nn` layers can be found [here](http://pytorch.org/docs/master/nn.html). You should try to familiarize yourself with pretty much all the `nn` layers. Torch also provides an `optim` package that makes parameter updates easier. We will be working with SGD in this tutorial, but other optimization algorithms (Adam, Adagrad, RMSProp, etc) can be found [here](http://pytorch.org/docs/master/optim.html). Let's try to fit the linear regression model with these abstractions now.

In [None]:
optimizer = torch.optim.SGD(lin.parameters(), lr = learning_rate) #initialize with parameters
for i in range(num_iters):
    optimizer.zero_grad() #this will zero out the gradients of all your parameters
    y_pred = lin(x_sgd)
    error = (y_sgd - y_pred)**2
    error_avg = error.mean()    
    error_avg.backward()    
    optimizer.step()    
print(lin.weight)

## MNIST Classification

First download the data

In [None]:
import torchvision.datasets as datasets
import torchvision.transforms as transforms
train_dataset = datasets.MNIST(root='./data/',
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)
val_dataset = datasets.MNIST(root='./data/',
                           train=False, 
                           transform=transforms.ToTensor())

In [None]:
image, label = train_dataset[0]
print(image.size())
print(label)

Torch provides a loader around tensor datasets so you can create/access mini-batches easily

In [None]:
batch_size = 100
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)
for (image, label) in train_loader: #iterates through the dataset in mini-batches
    print(image.size())
    print(label.size())
    break

Great, now we are ready to create our first model, which will be a simple multilayer perceptron. To do this, it helps to define a `class` with all the layers inside it.

In [None]:
class MLP(nn.Module):
    def __init__(self, num_layers = 1, input_dim = 28*28, output_dim = 10, hidden_dim = 10):
        super(MLP, self).__init__()                
        self.hidden_to_output = nn.Linear(hidden_dim, output_dim)
        hidden_layers = []
        for l in range(num_layers):
            dim = input_dim if l == 0 else hidden_dim #first layer is input to hidden layer
            hidden_layers.append(nn.Linear(dim, hidden_dim))
            hidden_layers.append(nn.ReLU()) #let's work with relu nonlinearities for now
        self.hidden_layers = nn.Sequential(*hidden_layers) #Sequential module will apply the layers in sequence
        self.logsoftmax = nn.LogSoftmax() #softmax will turn the output into probabilities, but log is more convnient
        
    def forward(self, x): #MLP(x) is shorthand for MLP.forward(x)
        x_flatten = x.view(x.size(0), -1) #need to flatten batch_size x 1 x 28 x 28 to batch_size x 28*28
        out = self.hidden_layers(x_flatten)
        out = self.hidden_to_output(out) #you can redefine variables
        return self.logsoftmax(out)
        
mlp = MLP(num_layers = 1)
print(mlp)
for p in mlp.parameters(): #all the parameters that were defined inside the module can be accessed like this
    print(p.size())

Using the network is easy.

In [None]:
y_pred = mlp(Variable(image))
print(y_pred) #these will be log probabilities over each of the 10 classes
print(y_pred[0].exp()) #let's make sure

It's also convenient to define a test function that we can call periodically to check performance.

In [None]:
criterion = nn.NLLLoss() #this is the negative log-likelihood for multi-class classification

def test(model, data):
    correct = 0.
    num_examples = 0.
    nll = 0.
    for (image, label) in data:
        image, label = Variable(image), Variable(label) #annoying, but necessary        
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)
        nll += nll_batch.data[0] * image.size(0) #by default NLL is averaged over each batch
        y_pred_max, y_pred_argmax = torch.max(y_pred, 1) #prediction is the argmax
        correct += (y_pred_argmax.data == label.data).sum() 
        num_examples += image.size(0) 
    return nll/num_examples, correct/num_examples
nll, accuracy = test(mlp, val_loader)
print('Validation performance. NLL: %.4f, Accuracy: %.4f'% (nll, accuracy))

Now we are ready to train!

In [None]:
mlp = MLP(num_layers = 1)
optim = torch.optim.SGD(mlp.parameters(), lr =0.5)
num_epochs = 20
for e in range(num_epochs):
    for (image, label) in train_loader:
        optim.zero_grad()
        image, label = Variable(image), Variable(label) 
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)    
        nll_batch.backward()
        optim.step()
    nll_train, accuracy_train = test(mlp, train_loader) #you never wanna do this in practice, since this will take forever
    nll_val, accuracy_val = test(mlp, val_loader)
    print('Training performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_train, accuracy_train))
    print('Validation performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_val, accuracy_val))    

Let's try with more hidden units

In [None]:
mlp = MLP(num_layers = 1, hidden_dim = 100)
print(mlp)
optim = torch.optim.SGD(mlp.parameters(), lr = 0.5)
num_epochs = 20
for e in range(num_epochs):
    for (image, label) in train_loader:
        optim.zero_grad()
        image, label = Variable(image), Variable(label) 
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)    
        nll_batch.backward()
        optim.step()
    nll_train, accuracy_train = test(mlp, train_loader) 
    nll_val, accuracy_val = test(mlp, val_loader)
    print('Training performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_train, accuracy_train))
    print('Validation performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_val, accuracy_val))  

Great. For the rest of section, try implementing a ConvNet. You may need to use more sophisticated optimization algorithms (`optim.Adam(model.parameters(), lr = 0.001)` should work well enough for most models)