# Introduction to Deep Learning with PyTorch

In this notebook, you'll get introduced to [PyTorch](http://pytorch.org/), a framework for building and training neural networks. PyTorch in a lot of ways behaves like the arrays you love from Numpy. These Numpy arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for the faster processing needed when training neural networks. It also provides a module that automatically calculates gradients (for backpropagation!) and another module specifically for building neural networks. All together, PyTorch ends up being more coherent with Python and the Numpy/Scipy stack compared to TensorFlow and other frameworks.



<img src="assets/andrej.png" width=700px>

# Keras vs PyTorch

* Keras is without a doubt the easier option if you want a plug & play framework: to quickly build, train, and evaluate a model, without spending much time on mathematical implementation details.

* PyTorch offers a lower-level approach and more flexibility for the more mathematically-inclined users.

# Head to Head

Consider this head-to-head comparison of how a simple convolutional network is defined in Keras and PyTorch:

### Keras

In [None]:
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(MaxPool2D())
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

### PyTorch

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
    
        self.conv1 = nn.Conv2d(3, 32, 3)
        self.conv2 = nn.Conv2d(32, 16, 3)
        self.fc1 = nn.Linear(16 * 6 * 6, 10) 
        self.pool = nn.MaxPool2d(2, 2)
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 6 * 6)
        x = F.log_softmax(self.fc1(x), dim=-1)

        return x

model = Net()

## Neural Networks

Deep Learning is based on artificial neural networks which have been around in some form since the late 1950s. The networks are built from individual parts approximating neurons, typically called units or simply "neurons." Each unit has some number of weighted inputs. These weighted inputs are summed together (a linear combination) then passed through an activation function to get the unit's output.

<img src="assets/simple_neuron.png" width=400px>

Mathematically this looks like: 

$$
\begin{align}
y &= f(w_1 x_1 + w_2 x_2 + b) \\
y &= f\left(\sum_i w_i x_i +b \right)
\end{align}
$$

With vectors this is the dot/inner product of two vectors:

$$
h = \begin{bmatrix}
x_1 \, x_2 \cdots  x_n
\end{bmatrix}
\cdot 
\begin{bmatrix}
           w_1 \\
           w_2 \\
           \vdots \\
           w_n
\end{bmatrix}
$$

## Tensors

It turns out neural network computations are just a bunch of linear algebra operations on *tensors*, a generalization of matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor (RGB color images for example). The fundamental data structure for neural networks are tensors and PyTorch (as well as pretty much every other deep learning framework) is built around tensors.

<img src="assets/tensor_examples.svg" width=600px>

With the basics covered, it's time to explore how we can use PyTorch to build a simple neural network.

In [1]:
def activation(x):
    """ Sigmoid activation function 
    
        Arguments
        ---------
        x: torch.Tensor
    """
    return 1/(1+torch.exp(-x))

# Warm Up : NN using Numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [3]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500): #change as per convenience
  # Forward pass: compute predicted y
  h = x.dot(w1)
  h_relu = np.maximum(h, 0)
  y_pred = h_relu.dot(w2)
  
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
 
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

0 742.7541596357506
1 741.8979733544786
2 741.043367345029
3 740.1903377214019
4 739.3388806095695
5 738.4889921474314
6 737.6406684847705
7 736.7939057832097
8 735.9487002161668
9 735.1050479688124
10 734.2629452380262
11 733.4223882323525
12 732.583373171959
13 731.7458962885926
14 730.9099538255379
15 730.0755420375737
16 729.2426571909311
17 728.4112955632522
18 727.5814534435467
19 726.7531271321513
20 725.9263129406884
21 725.1010071920225
22 724.2772062202225
23 723.4549063705174
24 722.6341039992581
25 721.814795473874
26 720.9969771728356
27 720.1806454856112
28 719.3657968126291
29 718.552427565236
30 717.7405341656581
31 716.9301130469614
32 716.1211606530112
33 715.3136734384352
34 714.5076478685809
35 713.7030804194794
36 712.8999675778059
37 712.0983058408403
38 711.2980917164298
39 710.4993217229496
40 709.7019923892658
41 708.9061002546958
42 708.1116418689725
43 707.3186137922048
44 706.5270125948417
45 705.7368348576335
46 704.9480771715953
47 704.1607361379706
48 703

482 460.779985458144
483 460.3877285664482
484 459.9960313977991
485 459.60489289330496
486 459.2143119965814
487 458.82428765374397
488 458.4348188134005
489 458.0459044266448
490 457.65754344704897
491 457.2697348306568
492 456.8824775359766
493 456.495770523974
494 456.1096127580653
495 455.7240032041103
496 455.3389408304057
497 454.95442460767754
498 454.5704535090749
499 454.18702651016315


# PyTorch: NN using Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won't be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Any computation you might want to perform with numpy can also be accomplished with PyTorch Tensors; you should think of them as a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you use the device argument when constructing a Tensor to place the Tensor on a GPU.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we manually implement the forward and backward passes through the network, using operations on PyTorch Tensors:

In [7]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

0 34768548.0
1 31551300.0
2 33164876.0
3 33016912.0
4 27827224.0
5 18581480.0
6 10324608.0
7 5259785.5
8 2820194.75
9 1712625.25
10 1187073.5
11 904270.0625
12 728773.1875
13 605760.0
14 512347.4375
15 438081.125
16 377513.9375
17 327269.65625
18 285114.78125
19 249548.5625
20 219243.265625
21 193288.0625
22 170939.234375
23 151608.21875
24 134826.0
25 120195.9921875
26 107396.453125
27 96160.203125
28 86275.640625
29 77553.5703125
30 69835.9375
31 62995.7265625
32 56912.8984375
33 51499.83984375
34 46666.41796875
35 42344.2578125
36 38472.7734375
37 34996.81640625
38 31873.78515625
39 29063.787109375
40 26530.9140625
41 24243.96875
42 22175.810546875
43 20300.99609375
44 18602.189453125
45 17061.767578125
46 15662.4169921875
47 14389.748046875
48 13231.091796875
49 12175.1455078125
50 11212.1171875
51 10332.736328125
52 9529.0966796875
53 8794.2255859375
54 8121.57421875
55 7502.99609375
56 6936.109375
57 6416.255859375
58 5939.1474609375
59 5500.90869140625
60 5097.92578125
61 4727.5

395 0.0007036046590656042
396 0.000682456127833575
397 0.0006622145883738995
398 0.0006434269016608596
399 0.0006244323449209332
400 0.0006048176437616348
401 0.0005881496472284198
402 0.0005728085525333881
403 0.0005575393442995846
404 0.0005416148924268782
405 0.0005273196729831398
406 0.000512141443323344
407 0.0004989765584468842
408 0.0004839517641812563
409 0.0004723244928754866
410 0.0004591959295794368
411 0.00044632560457102954
412 0.0004354284901637584
413 0.00042318148189224303
414 0.00041317573050037026
415 0.00040257518412545323
416 0.00039167635259218514
417 0.00038164720172062516
418 0.0003712526522576809
419 0.0003628657723311335
420 0.0003529731184244156
421 0.00034392467932775617
422 0.0003365102456882596
423 0.0003277621290180832
424 0.0003196651232428849
425 0.0003129307006020099
426 0.0003050506638828665
427 0.0002984178136102855
428 0.0002910658367909491
429 0.00028413295513018966
430 0.0002787251432891935
431 0.0002720911870710552
432 0.0002658977173268795
433 0.

# Computational Graph and Autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it's pretty simple to use in practice. If we want to compute gradients with respect to some Tensor, then we set requires_grad=True when constructing that Tensor. Any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation through the graph. If x is a Tensor with requires_grad=True, then after backpropagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.

Sometimes you may wish to prevent PyTorch from building computational graphs when performing certain operations on Tensors with requires_grad=True; for example we usually don't want to backpropagate through the weight update steps when training a neural network. In such scenarios we can use the torch.no_grad() context manager to prevent the construction of a computational graph.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

<img src="assets/autograd1.png" width=700px>

<img src="assets/variable1.png" width=700px>

In [2]:
from torch import FloatTensor
from torch.autograd import Variable


# Define the leaf nodes
a = Variable(FloatTensor([4]))

weights = [Variable(FloatTensor([i]), requires_grad=True) for i in (2, 5, 9, 7)]

# unpack the weights for nicer assignment
w1, w2, w3, w4 = weights

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)

L.backward()

for index, weight in enumerate(weights, start=1):
    gradient, *_ = weight.grad.data
    print(f"Gradient of w{index} w.r.t to L: {gradient}")

Gradient of w1 w.r.t to L: -36.0
Gradient of w2 w.r.t to L: -28.0
Gradient of w3 w.r.t to L: -8.0
Gradient of w4 w.r.t to L: -20.0
