(Optional) PyTorch Introduction
================
 
<div class="alert alert-info">
    <strong>Note:</strong> This exercise is optional and only serves as an introduction and cheatsheet to the general concepts of PyTorch.
</div>

PyTorch is a scientific computing package for Python:

-  Tensor and Neural Network computations (inparticular deep learning)
-  Research oriented (in comparison to e.g. TensorFlow)
-  Dynamic computational graph (in comparison to e.g. TensorFlow)
-  “NumPy on the GPU”
-  Backend and API heavily inspired by the original Torch written in Lua

An in-depth tutorial of the concepts described in this notebook can be found [here](https://github.com/jcjohnson/pytorch-examples).

In [1]:
%matplotlib inline
import numpy as np
import torch

print(torch.__version__)  # This should print 0.4.0

0.4.0


Tensors
=====

The PyTorch `Tensor` class is very similar to the NumPy `ndarray` class. Their main distinction is the ability of PyTorch Tensors to be used on a GPU which lets them benefit from vastly accelerated and parallelized computations. In order to work with PyTorch it is crucial to understand the basic behavior of its `Tensor` class.

Let's start with the initialization of a regular `5x3` matrix `Tensor`:

In [40]:
x = torch.Tensor(5, 3)
print(x)

tensor(1.00000e-31 *
       [[-1.1037,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,     nan, -0.7269],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000]])


The same matrix can be initialized with random entries:



In [43]:
x = torch.rand(5, 3)
print(x)

tensor([[ 0.1671,  0.1817,  0.9722],
        [ 0.7967,  0.1151,  0.2003],
        [ 0.3129,  0.4862,  0.7979],
        [ 0.4785,  0.2378,  0.1379],
        [ 0.9201,  0.6908,  0.1795]])


The size of a `Tensor` can be retrieved with:



In [44]:
print(x.size())

torch.Size([5, 3])


<div class="alert alert-info">
    <h3>Note</h3>
    <p>In contrast to a static computational graph of for example Tensorflow the dynamic graph of PyTorch allows to retrieve information such as its size at any time during runtime.</p>
</div>

Tensor Operations
--------

There are multiple syntaxes for `Tensor` operations. We illustrate the different options on the example of `Tensor` addition.

Regular (NumPy) syntax:



In [45]:
y = torch.rand(5, 3)
print(x + y)

tensor([[ 0.6971,  0.4786,  1.5201],
        [ 0.9817,  0.8346,  1.1267],
        [ 0.8129,  1.1197,  1.7121],
        [ 0.7151,  1.1437,  0.6138],
        [ 1.5725,  0.8295,  1.0015]])


PyTorch syntax:



In [46]:
print(torch.add(x, y))

tensor([[ 0.6971,  0.4786,  1.5201],
        [ 0.9817,  0.8346,  1.1267],
        [ 0.8129,  1.1197,  1.7121],
        [ 0.7151,  1.1437,  0.6138],
        [ 1.5725,  0.8295,  1.0015]])


PyTorch syntax with specific output variable:



In [47]:
result = torch.Tensor(5, 3)
torch.add(x, y, out=result)
print(result)

tensor([[ 0.6971,  0.4786,  1.5201],
        [ 0.9817,  0.8346,  1.1267],
        [ 0.8129,  1.1197,  1.7121],
        [ 0.7151,  1.1437,  0.6138],
        [ 1.5725,  0.8295,  1.0015]])


PyTorch syntax for inplace operations:

In [48]:
# adds x to y
y.add_(x)
print(y)

tensor([[ 0.6971,  0.4786,  1.5201],
        [ 0.9817,  0.8346,  1.1267],
        [ 0.8129,  1.1197,  1.7121],
        [ 0.7151,  1.1437,  0.6138],
        [ 1.5725,  0.8295,  1.0015]])


<div class="alert alert-info">
    <h3>Note</h3>
    <p>Any operation that mutates a `Tensor` in-place is post-fixed with an ``_``.</p>
    <p>For example: ``x.copy_(y)``, ``x.t_()``, will copy  ``y`` to ``x``.</p>
</div>

`Tensor` indexing works just like standard NumPy indexing. And since recently PyTorch even supports `Tensor` [broadcasting](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html)!



In [49]:
print(x[:, 1])

tensor([ 0.1817,  0.1151,  0.4862,  0.2378,  0.6908])


NumPy: There and back again
---------------------------

Converting a PyTorch `Tensor` to a NumPy `ndarray` and vice versa is a very simple. The `Tensor` and the `ndarray` will share the location of the underlying memory, and changing one will also change the other.

Converting a `Tensor` to a `ndarray` works by simply calling the `Tensor.numpy()` method:

In [50]:
a = torch.ones(5)
b = a.numpy()
print(a)
print(b)

tensor([ 1.,  1.,  1.,  1.,  1.])
[1. 1. 1. 1. 1.]


Changing the `Tensor` effects the `ndarray` as well:

In [51]:
a.add_(1)
print(a)
print(b)

tensor([ 2.,  2.,  2.,  2.,  2.])
[2. 2. 2. 2. 2.]


The conversion from a `ndarray` to a `Tensor` is just as simple and holds the same properties:

In [52]:
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)

[2. 2. 2. 2. 2.]
tensor([ 2.,  2.,  2.,  2.,  2.], dtype=torch.float64)


Every `Tensor` allocated on the CPU (except the `torch.CharTensor`) support converting to
NumPy and back.

Tensors on the GPU
------------------

PyTorch Tensors can be moved onto a GPU using the ``Tensor.to()`` method. Before converting a GPU `Tensor` to NumPy it has to be moved back to the CPU by calling the ``Tensor.to()`` method again.



In [54]:
# first check if cuda is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if device == torch.device("cuda:0"):
    x = x.to(device)
    y = y.to(device)
    z = x + y
    
    print(z)
    print(z.to(device).detach().numpy())
else:
    print("CUDA not available.")

CUDA not available.


More on PyTorch Tensors
-----------------------

The documentation of many more `Tensor` operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random numbers can be found [here](http://pytorch.org/docs/torch).


Autograd - automatic differentiation
===================================

Central to all neural networks in PyTorch is the ``autograd`` package. The package provides automatic differentiation for all operations on Tensors. PyTorch is a define-by-run framework, which means that the calculation of gradients ( e.g. during backpropagation) is defined at runtime and can be different at every single iteration.

Since pytorch 0.4.0, the Tensor class includes the Variable class, and supports nearly all of its operations. Once a computation graph für Tensor that requires gradients is executed the ``Tensor.backward()`` method can be used to automatically compute all the gradients.

If the ``Tensor`` is not a scalar, the ``backward()`` method requires an additional ``grad_output`` argument which matches the shape of the ``Tensor``. ``grad_output`` is supposed to be the gradient w.r.t the given output. For a scalar ``Tensor`` ``grad_output`` is assumed to be `torch.Tensor([1.0])`.

The `autograd` package additionally provides a `Function` class which encodes a complete history of computation. Each `Tensor` with the `required_grad` attribute has a ``Tensor.grad_fn`` attribute which references the ``Function`` (e.g. an operation such as addition) that created the respective ``Tensor`` and thereby determines its gradient. For Tensors that were created by the user and not as a result of an operation the ``grad_fn`` attribute is ``None``.

The following simple examples will illustrate the basic concepts of the ``autograd`` package.

In [71]:
# In general, tensors don't track gradients
x = torch.ones(1)
x.requires_grad

False

In [62]:
# Thus we can't call the backward function on these tensors
try:
    x.backwards()
except AttributeError:
    print("Doesn't work...")

Doesn't work...


In [73]:
# Enable gradient tracking
x = torch.ones((2, 2), requires_grad=True)
print(x)

tensor([[ 1.,  1.],
        [ 1.,  1.]])


Apply an operation to the `Tensor`:



In [74]:
y = x + 2
print(y)

tensor([[ 3.,  3.],
        [ 3.,  3.]])


Since ``y`` was created as a result of an operation it has a ``grad_fn`` attribute (`Function`) unequal to `None`:



In [75]:
print(y.grad_fn)

<AddBackward0 object at 0x7fe3baf20358>


Applying more operations to `y` increases the computational graph:



In [76]:
z = y * y * 3
out = z.mean()

print(z)
print(out)

tensor([[ 27.,  27.],
        [ 27.,  27.]])
tensor(27.)


Gradients
---------
The gradient w.r.t the input `x` can now be computed (backpropagated) with ``out.backward()``. Remember for a scalar this is equivalent to doing ``out.backward(torch.Tensor([1.0]))``.



In [77]:
out.backward()

The input `x` was a `2x2` `Tensor` and therefore $\frac{d(out)}{dx}$ yields a matrix with the same shape:



In [78]:
print(x.grad)

tensor([[ 4.5000,  4.5000],
        [ 4.5000,  4.5000]])


For such a small computation graph the solution can easily be verified:

The output w.r.t. the input is given as 
$$
\begin{align}
    out =& \frac{1}{4}\sum_i z_i \\
        =& \frac{1}{4}\sum_i 3y_i y_i \\
        =& \frac{1}{4}\sum_i 3(x_i+2)^2
\end{align}
$$.

Therefore the gradient is $\frac{\partial out}{\partial x_i} = \frac{3}{2}(x_i+2)$, which yields
$\frac{\partial out}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$ for a particular input $x_i=1$.



The `autograd` package in combination with the dynamic graph structure allow to do crazy things such as:



In [85]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x * 2
while y.norm() < 1000:
    y = y * 2
    

print(y)

tensor([-0.5074,  0.8810,  1.2325])
tensor([ -519.6064,   902.0929,  1262.0619])



Neural Networks
===============

The `Tensor` class in combination with the `autograd` package build the foundation for constructing Neural Networks (NNs) with PyTorch. To further fascilitate the construction and training of a NN the ``torch.nn`` package, which depends on `autograd` to define NN models and differentiate them, includes additional NN-specifc classes and helper functions.

For example the ``nn.Module`` class which works as a boilerplate NN model class and eventually contains all the individual layers and the ``Module.forward(x)`` method that infers the input ``x`` and returns the output of a NN.

The following is a grapical illustration of the infamous *LeNet* NN from Yann LeCun. This NN was trained to classify the MNIST dataset of handwritten digit images:

It is a simple feed-forward network which takes the input, feeds it through several layers one after the other, and then finally produces the classification output.


Define *LeNet* with PyTorch
--------------------------

The following is an example implementation of the classification network above: 



In [29]:
import numpy as np
a = np.arange(81).reshape(3,3,3,3)
print(a)

[[[[ 0  1  2]
   [ 3  4  5]
   [ 6  7  8]]

  [[ 9 10 11]
   [12 13 14]
   [15 16 17]]

  [[18 19 20]
   [21 22 23]
   [24 25 26]]]


 [[[27 28 29]
   [30 31 32]
   [33 34 35]]

  [[36 37 38]
   [39 40 41]
   [42 43 44]]

  [[45 46 47]
   [48 49 50]
   [51 52 53]]]


 [[[54 55 56]
   [57 58 59]
   [60 61 62]]

  [[63 64 65]
   [66 67 68]
   [69 70 71]]

  [[72 73 74]
   [75 76 77]
   [78 79 80]]]]


In [44]:
a[0][0][2][0]

6

In [46]:
b = a.transpose(0,2,3,1).reshape(3*3*3,3)


In [47]:
b

array([[ 0,  9, 18],
       [ 1, 10, 19],
       [ 2, 11, 20],
       [ 3, 12, 21],
       [ 4, 13, 22],
       [ 5, 14, 23],
       [ 6, 15, 24],
       [ 7, 16, 25],
       [ 8, 17, 26],
       [27, 36, 45],
       [28, 37, 46],
       [29, 38, 47],
       [30, 39, 48],
       [31, 40, 49],
       [32, 41, 50],
       [33, 42, 51],
       [34, 43, 52],
       [35, 44, 53],
       [54, 63, 72],
       [55, 64, 73],
       [56, 65, 74],
       [57, 66, 75],
       [58, 67, 76],
       [59, 68, 77],
       [60, 69, 78],
       [61, 70, 79],
       [62, 71, 80]])

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(7)

class LeNet(nn.Module):

    def __init__(self):
        """
        Class constructor which preinitializes NN layers with trainable
        parameters.
        """
        super(LeNet, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # conv kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        #self.conv1.weight.data.mul_(0.001)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        """
        Forwards the input x through each of the NN layers and outputs the result.
        """
        # Max pooling over a (2, 2) window
        print("begin", x.size())
        x= F.relu(self.conv1(x))
        #x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        print("conv1 - relu", x.size())
        x = F.max_pool2d(x, (2, 2))
        #print(x)
        print("conv1 - relu - max_pool2", x.size())
        x =  F.relu(self.conv2(x))
        print("conv2 - relu", x.size())
        
        # If the size is a square you can only specify a single number
        #x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = F.max_pool2d(x, 2)
        print("conv2 - relu - max_pool2", x.size())
        
        #print(x)
        # An efficient transition from spatial conv layers to flat 1D fully 
        # connected layers is achieved by only changing the "view" on the
        # underlying data and memory structure.
        print(x.size(), self.num_flat_features(x))
        x = x.view(-1, self.num_flat_features(x))
        #print(x)
        print(x.size())
        x = F.relu(self.fc1(x))
        #print(x)
        x = F.relu(self.fc2(x))
        #print(x)
        x = self.fc3(x)
        #print(x)
        return x

    def num_flat_features(self, x):
        """
        Computes the number of features if the spatial input x is transformed
        to a 1D flat input.
        """
        print(x.size())
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = LeNet()
print(net)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [48]:
print(net.conv1.weight.size())

torch.Size([6, 1, 5, 5])


Due to the `autograd` package a NN merely requires the definition of the ``Module.forward()`` method. The ``.backward()`` function (which backpropagtes the gradients) is automatically defined. Any `Tensor` operation is allowed in the ``forward`` function.

The learnable parameters of a model are returned by ``Module.parameters()``:



In [4]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight
print(params[1].size()) 
print(params[2].size()) 
print(params[3].size()) 
print(params[4].size()) 
print(params[5].size()) 
print(params[6].size()) 
print(params[7].size()) 
print(params[8].size()) 


#print(params)

10
torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 400])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])


Check input and output

In [28]:
x = torch.randn((1, 1, 32, 32))
output = net(x)
print(output)

begin torch.Size([1, 1, 32, 32])
conv1 - relu torch.Size([1, 6, 28, 28])
conv1 - relu - max_pool2 torch.Size([1, 6, 14, 14])
conv2 - relu torch.Size([1, 16, 10, 10])
conv2 - relu - max_pool2 torch.Size([1, 16, 5, 5])
torch.Size([1, 16, 5, 5])
torch.Size([1, 16, 5, 5]) 400
torch.Size([1, 16, 5, 5])
torch.Size([1, 400])
tensor([[ 0.1583,  0.0690,  0.0635, -0.0333,  0.0208,  0.0295, -0.1378,
         -0.0389,  0.0911, -0.0405]])


##### net.num_flat_features(x)

Before backpropagating for example a random gradient, the gradient buffers of all parameters should be set to zero:



In [136]:
net.zero_grad()
output.backward(torch.randn(1, 10))

<div class="alert alert-info">
    <h3>Note</h3>
    <p>Calling the ``Tensor.backward()`` method a second time before new inputs are forwarded will through an error. This is due to PyTorch deleting all the intermediary results in order to reduce memory consumption. Calling the ``.backward()`` method with the `retain_graph=True` argument keeps those results.
    </p>
</div>

<div class="alert alert-info">
    <h3>Note</h3>
    <p>The entire ``torch.nn`` package only supports inputs that are a mini-batch of samples, and not a single sample.

    For example, ``nn.Conv2d`` will take in a 4D Tensor of ``nSamples x nChannels x Height x Width``.

    If you have a single sample, just use ``x.unsqueeze(0)`` to add a fake batch dimension.
    </p>
</div>

Loss Function
-------------
A loss function takes the (output, target) pair as inputs, and computes a value that estimates "how far away" the output is from the target.

There are several different loss functions predefined under the `torch.nn` package. An example of a simple loss is the ``nn.MSELoss`` which computes the mean-squared error between the input and the target value.

More examples of predefined losses are documented [here](http://pytorch.org/docs/nn.html#loss-functions).

A MSE loss example:

In [137]:
output = net(x)
target = torch.arange(1, 11).unsqueeze(0)  # a dummy target with 10 classes
print(target)
criterion = nn.MSELoss()
print(criterion)

loss = criterion(output, target)
print(loss)

tensor([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.]])
MSELoss()
tensor(38.4517)


When ``loss.backward()`` is called, the whole graph is differentiated w.r.t. the loss, and all Tensors with gradients in the graph will have their ``Tensor.grad`` attribute accumulated with the gradient.

Backpropagate the Loss
--------------------

A curical step for optimizing the network weights is the backpropogation of the loss. The nature of a computational graph makes this as easy as calling ``loss.backward()``. But since the gradients will be accumulated to already existing gradients one has to clear them first.

In [138]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([ 0.,  0.,  0.,  0.,  0.,  0.])
conv1.bias.grad after backward
tensor(1.00000e-02 *
       [ 4.0760,  5.8347,  1.8398,  6.2326, -0.8733, -5.4431])


Weights Optimization
------------------
The simplest update rule used in practice for optimizing the weights of a NN is the Stochastic Gradient Descent (SGD):

``weight = weight - learning_rate * gradient``

Like any other NN component the optimization step can be implemented with the basic PyTorch classes.

For example:

In [139]:
def sgd_step(net):
    learning_rate = 0.01
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

However, the PyTorch framework contains a small optimization package called ``torch.optim``. It includes various predefined update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>Common optimization options such as the L2-regularization (see ``weight_decay`` argument) are already included in the predefined optimization schemes.</p>
</div>

Using it is very simple:

In [141]:
import torch.optim as optim

# create an optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01, weight_decay=1e-3)
print(optimizer)
# a single step of an example training loop
optimizer.zero_grad()   # zero the gradient buffers
output = net(x)
print(output)
loss = criterion(output, target)
print(loss)
loss.backward()
print(loss)
print("first step")
optimizer.step()# Does the update based on the accumalted gradients  
output = net(x)
print(output)#
loss = criterion(output, target)
print(loss)

print("second step")
optimizer.step()  
output = net(x)
print(output)
loss = criterion(output, target)
print(loss)

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.01
    momentum: 0
    nesterov: False
    weight_decay: 0.001
)
tensor([[-0.0232,  0.0689,  0.0891,  0.1467,  0.1607,  0.1010,  0.0694,
          0.0934,  0.1056,  0.2350]])
tensor(37.1328)
tensor(37.1328)
first step
tensor([[-0.0420,  0.0645,  0.1461,  0.2349,  0.2300,  0.1549,  0.1568,
          0.2146,  0.1840,  0.3427]])
tensor(36.2499)
second step
tensor([[-0.0671,  0.0553,  0.2144,  0.3445,  0.3169,  0.2301,  0.2654,
          0.3671,  0.2877,  0.4848]])
tensor(35.1325)


Recap
==============

  -  ``torch.Tensor`` - A multi-dimensional array with a `requires_grad` option to record the history of operations applied to it.
  -  ``nn.Module`` - Neural network module. Convenient way of
     encapsulating parameters, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of `Tensor`, that is automatically
     registered as a parameter when assigned as an attribute to a
     ``Module``.
  -  ``autograd.Function`` - Implements forward and backward definitions
     of an autograd operation. Every ``Tensor`` operation that requires gradients, creates at
     least a single ``Function`` node, that connects to functions that
     created a ``Tensor`` and encodes its history.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>The `torchvision` package includes many predefined helper funcitons specifally designed for solving computer vision problems.</p>
</div>