# Section 1: PyTorch Tutorial (Tensors, Automatic Differentation, Linear Regression, MNIST Classification)



## Goal

1. Learn to work with/manipulate tensors in PyTorch
2. Learn about automatic differentiation
3. Fit a linear regression model
3. Build a simple classifier on MNIST


In [1]:
import torch
import numpy as np

## Working with Tensors

Tensors, or multidimensional arrays, are fundamental objects that you will be working with in Torch (and deep learning in general). Let's create some.

In [3]:
x = torch.Tensor(5,5) #create a 5 x 5 tensor
print(x)
x = torch.LongTensor(5,2) #often our input data consists of integers (word indices), so it's helpful to use LongTensors
print(x)
x = torch.ByteTensor(1,1)
print(x)
#we can also do other initializations
x = torch.randn(3,4) #initialize from standard normal
x = torch.ones(3,4) #all ones
x = torch.zeros(3,4) #all zeros
x = torch.eye(3) # identity matrix


-5.7846  0.0000 -5.7846  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
[torch.FloatTensor of size 5x5]


 1.3962e+14  1.3962e+14
 1.0000e+00  3.7332e+11
 0.0000e+00  4.2950e+09
 0.0000e+00  3.3000e+01
 3.7331e+11  1.3962e+14
[torch.LongTensor of size 5x2]


 136
[torch.ByteTensor of size 1x1]



You can go back between numpy/torch tensors with ease.

In [5]:
x_numpy = np.random.randn(5)
print(x_numpy)
x_torch = torch.from_numpy(x_numpy)
print(x_torch)
x_torch[0] = 0
print(x_numpy)


[ 0.42960673  1.50988904  0.69349265  1.23092394 -2.54778714]

 0.4296
 1.5099
 0.6935
 1.2309
-2.5478
[torch.DoubleTensor of size 5]

[ 0.          1.50988904  0.69349265  1.23092394 -2.54778714]


Accessing torch tensors is essentially identitical to the case in numpy.

In [8]:
x = torch.randn(4,3)
print(x[0, 0]) 
print(x[:, 0]) 
print(x[[0,2], :])
x[[0,2], :] = 100
print(x)

-1.2424112558364868

-1.2424
 1.0344
-0.6358
 0.4173
[torch.FloatTensor of size 4]


-1.2424 -1.0825 -2.4514
-0.6358  0.7898  0.4812
[torch.FloatTensor of size 2x3]


 100.0000  100.0000  100.0000
   1.0344   -1.5117   -0.6687
 100.0000  100.0000  100.0000
   0.4173   -0.2641    1.2155
[torch.FloatTensor of size 4x3]



There are various ways to manipulate tensors. Of particular utility is the *view* function, which reshapes a tensor in memory. See [here](http://pytorch.org/docs/master/tensors.html) for more operations.

In [11]:
x = torch.randn(12)
print(x.view(2,6)) #reshape the 4x3 tensor into a 2x6 tensor
print(x.view(-1)) # -1 always reshapes to a 1d tensor

x = torch.randn(3,4,5)
x.view(12, -1)


 1.0687  0.6584 -0.8776 -0.0458 -1.8898  0.9141
 2.8086 -1.5170 -0.7729  0.4793 -1.0196  0.5272
[torch.FloatTensor of size 2x6]


 1.0687
 0.6584
-0.8776
-0.0458
-1.8898
 0.9141
 2.8086
-1.5170
-0.7729
 0.4793
-1.0196
 0.5272
[torch.FloatTensor of size 12]




-1.6459 -0.1662 -0.1612  0.8961 -0.3470
 0.9678  0.1589 -1.7035  0.4087  0.5826
-1.5407  0.2940 -0.7082 -1.1668 -0.0078
 0.6252 -1.2950  0.6264  0.7838 -0.3938
 2.4792 -1.7871 -0.9470  1.6338  0.0845
-2.3959 -0.0946 -0.9082 -0.1912  1.1389
-0.8046 -0.7894 -0.9371  0.1127  1.5833
-1.7731 -1.4199  0.1563  0.6043  0.3891
 0.1100  1.8553  1.6423 -0.3854 -0.9171
 1.1319 -1.1795 -1.5360  0.2060 -1.0844
-0.6217 -0.9680 -1.8815 -0.8383 -0.2641
 0.7409 -0.8969 -0.6261 -1.2428 -0.8121
[torch.FloatTensor of size 12x5]

We can use operations on tensors as in numpy. Operations that have an underscore _ are *in-place* and modify the original tensor. Other operations will create a new tensor in memory.

In [12]:
x = torch.randn(5, 5)
y = torch.randn(5, 5)
z = torch.mm(x, y) #matrix multiply
z = 2*x # scalar multiplication
z = x + y #addition, creates a new tensor
print(x)
z = x.add(y) #same
print(x)
x.add_(y) #modifies x by adding y to it
print(x)
#there are other operations such as x.mul_(), x.div_() etc.



-0.5994 -0.2923  1.2099  0.0675 -0.5873
 0.0973  1.3780  1.6060  1.1143  1.9891
 1.9164  0.1174  0.0359 -0.6382  1.0067
 0.7691  0.4132 -0.2730  0.3485 -0.0722
-1.3032  1.4591  0.9668  0.8160 -0.3060
[torch.FloatTensor of size 5x5]


-0.5994 -0.2923  1.2099  0.0675 -0.5873
 0.0973  1.3780  1.6060  1.1143  1.9891
 1.9164  0.1174  0.0359 -0.6382  1.0067
 0.7691  0.4132 -0.2730  0.3485 -0.0722
-1.3032  1.4591  0.9668  0.8160 -0.3060
[torch.FloatTensor of size 5x5]


-0.1464 -0.7828  1.5252  0.6232 -0.8830
-0.5210 -0.3740  2.3736  1.2032  1.3623
 3.2040 -0.9582 -1.0819 -0.2666  0.8687
 0.6533 -0.5783 -1.6463  0.2718 -0.4932
-1.5652  2.4988  0.1165  0.0668 -1.9983
[torch.FloatTensor of size 5x5]



## Automatic Differentiation

Cool, so the tensor object is basically the same as a numpy array. What's neat is that you can define **computation graphs** with tensors and backpropagate gradients through this computational graph automatically. This is known as automatic differentation. To do this however, we need to use a `Variable` object which is basically a wrapper around a tensor object.

In [13]:
from torch.autograd import Variable
x = Variable(torch.randn(5), requires_grad=True) #requires grad means that we want to calculate gradients
print(x.data) #a tensor object
print(x.grad) #the gradient. right now we haven't calculated any gradients so there is None


 0.2476
 1.9059
-0.7489
 0.9712
 1.0852
[torch.FloatTensor of size 5]

None


Let's define a simple computation graph

In [20]:
x = Variable(torch.randn(5), requires_grad=True) 
y = x.mul(2) #y is now a variable. almost any operation you can do on tensors you can do on Variables
print(y)
y.backward(Variable(torch.ones(y.size()))) #this is a tricky concept, but we essentially backprop a vector of ones
#to simulate the calculation of dy[i] / dx for all i
print(x.grad) #this should be a vector of 2s

Variable containing:
 1.8822
 2.7576
 2.7958
 1.4661
-2.6445
[torch.FloatTensor of size 5]

Variable containing:
 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]



Now in practice, we almost always calculate the derivatve with respect to a **scalar**. In this case we can simply call `.backward()` without having to backpropagate a 1x1 vector of 1.

In [19]:
#x = Variable(torch.randn(5), requires_grad=True)
 #gradients are always accumulated, so we need to zero them out manually
y = x.mul(2).sum() # now y is a scalar. In most cases this would be your average loss
y.backward() #this is equivalent to y.backward(Variable(torch.ones(y.size())))
print(x.grad)
print(x)

Variable containing:
 6
 6
 6
 6
 6
[torch.FloatTensor of size 5]

Variable containing:
 0.1798
 0.0978
 1.1039
-0.6699
 1.0253
[torch.FloatTensor of size 5]



## Linear Regression

Let's fit a simple least squares model on a synthetic dataset with gradient descent. First let's generate synthetic data.

In [21]:
num_points = 1000
num_features = 5
w_star = torch.randn(num_features) #true weight vector
x = torch.randn(num_points, num_features) #input data
y = torch.mm(x, w_star.view(5, 1)) #torch.mm expects both inputs to be matrices so we cast w_star to be a column vector

This problem has an analytic solution which we can calculate directly.

In [22]:
w_ols = torch.mm(torch.mm(torch.inverse(torch.mm(x.t(), x)), x.t()), y) #(X^T X)^{-1} X^T Y
print(w_ols)
print(w_star)


 1.5207
-0.0331
 0.2725
-0.9453
 1.2381
[torch.FloatTensor of size 5x1]


 1.5207
-0.0331
 0.2725
-0.9453
 1.2381
[torch.FloatTensor of size 5]



But let's see if we can obtain (approximately) the same solution with gradient descent.

In [23]:
w_sgd = Variable(torch.randn(num_features), requires_grad=True) #randomly initialize
x_sgd = Variable(x) #remember, we need to convert everything to Variables to work with automatic differentiation
y_sgd = Variable(y) #we don't need to calculate gradients with respect to these guys

num_iters = 50
learning_rate = 0.1

for i in range(num_iters):
    if w_sgd.grad is not None:
        w_sgd.grad.zero_() #gradients get accumulated so we have to manually zero them out
    y_pred = torch.mm(x_sgd, w_sgd.unsqueeze(1)) #unsqueeze adds an extra dimension
    error = (y_sgd - y_pred)**2
    error_avg = error.mean()
    if i % 10 == 0:
        print(i, error_avg.data[0])
    error_avg.backward()    
    w_sgd.data = w_sgd.data - learning_rate*w_sgd.grad.data
        
print(w_sgd)
print(w_star)

0 15.81375503540039
10 0.18027447164058685
20 0.002335350727662444
30 3.43216561304871e-05
40 5.65595200896496e-07
Variable containing:
 1.5206
-0.0331
 0.2725
-0.9453
 1.2381
[torch.FloatTensor of size 5]


 1.5207
-0.0331
 0.2725
-0.9453
 1.2381
[torch.FloatTensor of size 5]



Ok, that worked, but it's a bit annoying to separately define the weights as `Variables` and manually apply matrix-multiplies. Fortunately, torch provides an `nn` package which provides abstractions for almost all of the layers that we will use in the course. Let's see a few examples.

In [27]:
import torch.nn as nn

lin = nn.Linear(num_features, 1, bias=True) #this defines a linear layer that goes from num_feature dimensions to 1
print(lin.weight) #weight parameter of the linear layer. if bias = True, then we can access bias with lin.bias
print(lin.bias)
y_pred = lin(x_sgd) #lin(x_sgd) automatically calls forward on the input x_sgd
print(y_pred)
print(lin.forward(x_sgd))

Parameter containing:
 0.1456  0.3603  0.2571 -0.2601 -0.1434
[torch.FloatTensor of size 1x5]

Parameter containing:
 0.1307
[torch.FloatTensor of size 1]

Variable containing:
 1.0946
-0.0716
-1.1583
   ⋮    
-0.8441
 0.6244
-0.0349
[torch.FloatTensor of size 1000x1]

Variable containing:
 1.0946
-0.0716
-1.1583
   ⋮    
-0.8441
 0.6244
-0.0349
[torch.FloatTensor of size 1000x1]



The list of `nn` layers can be found [here](http://pytorch.org/docs/master/nn.html). You should try to familiarize yourself with pretty much all the `nn` layers. Torch also provides an `optim` package that makes parameter updates easier. We will be working with SGD in this tutorial, but other optimization algorithms (Adam, Adagrad, RMSProp, etc) can be found [here](http://pytorch.org/docs/master/optim.html). Let's try to fit the linear regression model with these abstractions now.

In [28]:
optimizer = torch.optim.SGD(lin.parameters(), lr = learning_rate) #initialize with parameters
for i in range(num_iters):
    optimizer.zero_grad() #this will zero out the gradients of all your parameters
    y_pred = lin(x_sgd)
    error = (y_sgd - y_pred)**2
    error_avg = error.mean()    
    error_avg.backward()    
    optimizer.step()    
print(lin.weight)

Parameter containing:
 1.5206 -0.0331  0.2725 -0.9453  1.2381
[torch.FloatTensor of size 1x5]



## MNIST Classification

First download the data

In [29]:
#conda install torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms
train_dataset = datasets.MNIST(root='./data/',
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)
val_dataset = datasets.MNIST(root='./data/',
                           train=False, 
                           transform=transforms.ToTensor())

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


In [30]:
image, label = train_dataset[0]
print(image.size())
print(label)

torch.Size([1, 28, 28])
5


Torch provides a loader around tensor datasets so you can create/access mini-batches easily

In [31]:
batch_size = 100
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)
for (image, label) in train_loader: #iterates through the dataset in mini-batches
    print(image.size())
    print(label.size())
    break

torch.Size([100, 1, 28, 28])
torch.Size([100])


Great, now we are ready to create our first model, which will be a simple multilayer perceptron. To do this, it helps to define a `class` with all the layers inside it.

In [35]:
class MLP(nn.Module):
    def __init__(self, num_layers = 1, input_dim = 28*28, 
                 output_dim = 10, hidden_dim = 10):
        super(MLP, self).__init__()                
        self.hidden_to_output = nn.Linear(hidden_dim, output_dim)
        hidden_layers = []
        for l in range(num_layers):
            dim = input_dim if l == 0 else hidden_dim #first layer is input to hidden layer
            hidden_layers.append(nn.Linear(dim, hidden_dim))
            hidden_layers.append(nn.ReLU()) #let's work with relu nonlinearities for now
        self.hidden_layers = nn.Sequential(*hidden_layers) #Sequential module will apply the layers in sequence
        self.logsoftmax = nn.LogSoftmax() #softmax will turn the output into probabilities, but log is more convnient
        
    def forward(self, x): #MLP(x) is shorthand for MLP.forward(x)
        x_flatten = x.view(x.size(0), 28*28) #need to flatten batch_size x 1 x 28 x 28 to batch_size x 28*28
        out = self.hidden_layers(x_flatten)
        out = self.hidden_to_output(out) #you can redefine variables
        return self.logsoftmax(out)
        
mlp = MLP(num_layers = 1)
print(mlp)
for p in mlp.parameters(): #all the parameters that were defined inside the module can be accessed like this
    print(p.size())

MLP(
  (hidden_to_output): Linear(in_features=10, out_features=10)
  (hidden_layers): Sequential(
    (0): Linear(in_features=784, out_features=10)
    (1): ReLU()
  )
  (logsoftmax): LogSoftmax()
)
torch.Size([10, 10])
torch.Size([10])
torch.Size([10, 784])
torch.Size([10])


Using the network is easy.

In [36]:
y_pred = mlp(Variable(image))
print(y_pred) #these will be log probabilities over each of the 10 classes
print(y_pred[0].exp()) #let's make sure

Variable containing:
-2.2134 -2.5517 -2.4334  ...  -2.0741 -2.5961 -2.3051
-2.1313 -2.5291 -2.4315  ...  -2.1081 -2.5845 -2.2901
-2.2552 -2.5871 -2.3916  ...  -2.1364 -2.4609 -2.4157
          ...             ⋱             ...          
-2.1082 -2.5726 -2.4199  ...  -2.1017 -2.6651 -2.3162
-2.2398 -2.4848 -2.4240  ...  -2.1305 -2.5736 -2.3076
-2.1989 -2.5101 -2.3996  ...  -2.0990 -2.5569 -2.3216
[torch.FloatTensor of size 100x10]

Variable containing:
 0.1093
 0.0780
 0.0877
 0.0780
 0.1283
 0.0873
 0.1315
 0.1257
 0.0746
 0.0997
[torch.FloatTensor of size 10]





It's also convenient to define a test function that we can call periodically to check performance.

In [39]:
criterion = nn.NLLLoss() #this is the negative log-likelihood for multi-class classification

def test(model, data):
    correct = 0.
    num_examples = 0.
    nll = 0.
    for (image, label) in data:
        image, label = Variable(image), Variable(label) #annoying, but necessary        
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)
        nll += nll_batch.data[0] * image.size(0) #by default NLL is averaged over each batch
        y_pred_max, y_pred_argmax = torch.max(y_pred, 1) #prediction is the argmax
        correct += (y_pred_argmax.data == label.data).sum() 
        num_examples += image.size(0) 
    return nll/num_examples, correct/num_examples
nll, accuracy = test(mlp, val_loader)
print('Validation performance. NLL: %.4f, Accuracy: %.4f'% (nll, accuracy))



Validation performance. NLL: 0.3026, Accuracy: 0.9113


Now we are ready to train!

In [41]:
criterion = nn.NLLLoss()
mlp = MLP(num_layers = 1)
optim = torch.optim.SGD(mlp.parameters(), lr =0.5)
num_epochs = 20
for e in range(num_epochs):
    for (image, label) in train_loader:
        optim.zero_grad()
        image, label = Variable(image), Variable(label) 
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)    
        nll_batch.backward()
        optim.step()
    nll_train, accuracy_train = test(mlp, train_loader) #you never wanna do this in practice, since this will take forever
    nll_val, accuracy_val = test(mlp, val_loader)
    print('Training performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_train, accuracy_train))
    print('Validation performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_val, accuracy_val))    



Training performance after epoch 1: NLL: 0.2917, Accuracy: 0.9130
Validation performance after epoch 1: NLL: 0.2904, Accuracy: 0.9154
Training performance after epoch 2: NLL: 0.2638, Accuracy: 0.9235
Validation performance after epoch 2: NLL: 0.2710, Accuracy: 0.9230
Training performance after epoch 3: NLL: 0.2402, Accuracy: 0.9292
Validation performance after epoch 3: NLL: 0.2436, Accuracy: 0.9272
Training performance after epoch 4: NLL: 0.2436, Accuracy: 0.9276
Validation performance after epoch 4: NLL: 0.2537, Accuracy: 0.9255


KeyboardInterrupt: 

Let's try with more hidden units

In [42]:
mlp = MLP(num_layers = 3, hidden_dim = 100)
print(mlp)
optim = torch.optim.SGD(mlp.parameters(), lr = 0.5)
num_epochs = 20
for e in range(num_epochs):
    for (image, label) in train_loader:
        optim.zero_grad()
        image, label = Variable(image), Variable(label) 
        y_pred = mlp(image)
        nll_batch = criterion(y_pred, label)    
        nll_batch.backward()
        optim.step()
    nll_train, accuracy_train = test(mlp, train_loader) 
    nll_val, accuracy_val = test(mlp, val_loader)
    print('Training performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_train, accuracy_train))
    print('Validation performance after epoch %d: NLL: %.4f, Accuracy: %.4f'% (e+1, nll_val, accuracy_val))  

MLP(
  (hidden_to_output): Linear(in_features=100, out_features=10)
  (hidden_layers): Sequential(
    (0): Linear(in_features=784, out_features=100)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=100)
    (3): ReLU()
    (4): Linear(in_features=100, out_features=100)
    (5): ReLU()
  )
  (logsoftmax): LogSoftmax()
)




Training performance after epoch 1: NLL: 0.1355, Accuracy: 0.9587
Validation performance after epoch 1: NLL: 0.1380, Accuracy: 0.9557
Training performance after epoch 2: NLL: 0.1054, Accuracy: 0.9677
Validation performance after epoch 2: NLL: 0.1329, Accuracy: 0.9601
Training performance after epoch 3: NLL: 0.0678, Accuracy: 0.9786
Validation performance after epoch 3: NLL: 0.0977, Accuracy: 0.9694
Training performance after epoch 4: NLL: 0.0552, Accuracy: 0.9826
Validation performance after epoch 4: NLL: 0.0988, Accuracy: 0.9696
Training performance after epoch 5: NLL: 0.0357, Accuracy: 0.9887
Validation performance after epoch 5: NLL: 0.0810, Accuracy: 0.9765
Training performance after epoch 6: NLL: 0.0445, Accuracy: 0.9857
Validation performance after epoch 6: NLL: 0.0929, Accuracy: 0.9736
Training performance after epoch 7: NLL: 0.0362, Accuracy: 0.9884
Validation performance after epoch 7: NLL: 0.0896, Accuracy: 0.9747
Training performance after epoch 8: NLL: 0.0380, Accuracy: 0.9

Great. For the rest of section, try implementing a ConvNet. You may need to use more sophisticated optimization algorithms (`optim.Adam(model.parameters(), lr = 0.001)` should work well enough for most models)