## Introduction to PyTorch

### Use Numpy to implement a neural network

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [23]:
import numpy as np
import torch
import random

In [2]:
#suppress warning
import warnings
warnings.filterwarnings('ignore')

Consider a 2-layer neural network. The input dimension is $D_{in}$, the hidden dimension is $H$ and the output dimension is $D_{out}$. Assume that the data is $(X,y)$, where $X$ is $N \times D_{in}$ is the input matrix and $y$ is $N \times D_{out}$ is the output matrix respectively.

The weights for the two levels are $w_{1}$ and $w_{2}$ respectively, where $w_{1}$ is a $D_{in} \times H$ matrix and $w_{2}$ is a $H \times D_{out}$ matrix.

The neural network can be expressed in terms of the following equations:

- $h=Xw_{1}$ is a $N \times H$ matrix


- $U = \max(h,0)$ is a $N \times H$ matrix


- $\hat{y} = Uw_{2}$ is a $N \times D_{out}$ matrix


The loss function is given by $\sum_{i} (y_{i} - \hat{y}_{i})^{2} = (y - \hat{y})^{'}(y - \hat{y})$.

To find the weights $w_{1}$ and $w_{2}$, we use the gradient descent method via Backpropogation.


$$\dfrac{\partial \text{ loss}}{\partial \hat{y}} = 2(\hat{y} - y) \text{  is a } N \times D_{out} \text{ matrix}$$

$$\dfrac{\partial \text{ loss}}{\partial w_{2}} = \dfrac{\partial \hat{y}}{\partial w_{2}} \dfrac{\partial \text{ loss}}{\partial \hat{y}} = U^{'}\dfrac{\partial \text{ loss}}{\partial \hat{y}} \text{ is a } H \times D_{out} \text{ matrix}$$

$$\dfrac{\partial \text{ loss}}{\partial U} = \dfrac{\partial \text{ loss}}{\partial \hat{y}}\dfrac{\partial \hat{y}}{\partial U}  = \dfrac{\partial \text{ loss}}{\partial \hat{y}} w_{2}^{'} \text{ is a } N \times H \text{ matrix}$$


$$\dfrac{\partial \text{ loss}}{\partial h} = \dfrac{\partial \text{ loss}}{\partial U} I(h \ge 0) \text{ is a } N \times H \text{ matrix}$$


$$\dfrac{\partial \text{ loss}}{\partial w_{1}} = \dfrac{\partial h}{\partial w_{1}} \dfrac{\partial \text{ loss}}{\partial h} = X^{'} \dfrac{\partial \text{ loss}}{\partial h} \text{ is a } D_{in} \times H \text{ matrix}$$

The gradient descent method updates $w_{1}$ and $w_{2}$ are 

$$w_{1} = w_{1} - \rho \dfrac{\partial \text{ loss}}{\partial w_{1}}$$

$$w_{2} = w_{2} - \rho \dfrac{\partial \text{ loss}}{\partial w_{2}}$$

where $\rho$ is the learning rate.

In [3]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [4]:
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

In [6]:
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [7]:
learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 28908068.676232096
1 27401964.90130323
2 29060529.966337197
3 29008988.44365628
4 24918574.41545503
5 17255514.003793757
6 10073991.23405539
7 5318258.092370378
8 2871420.148797653
9 1703272.7972052472
10 1142285.6322121155
11 846460.7947730802
12 670183.7787463254
13 551377.7231601854
14 463688.7240273949
15 395140.05664888455
16 339702.69440430816
17 293907.4514248923
18 255616.72015301566
19 223356.30699773735
20 195955.08582367323
21 172529.2478398161
22 152381.36103706632
23 134998.09782399941
24 119918.59738612093
25 106810.49579031486
26 95348.15281985156
27 85304.71719637387
28 76470.59795137786
29 68682.79079477963
30 61796.99720591422
31 55700.02528138531
32 50291.15682675305
33 45477.86202216404
34 41181.68247377885
35 37341.80541449768
36 33901.424963738005
37 30817.75086886153
38 28047.31178068516
39 25553.42289624719
40 23306.967299917098
41 21278.496167453828
42 19445.447401456116
43 17786.910193460364
44 16284.975609795049
45 14922.665527348581
46 13685.46580580746
47

458 3.954438932903153e-07
459 3.748082743515283e-07
460 3.552607205906018e-07
461 3.3673823303229735e-07
462 3.19181714078132e-07
463 3.0254115539689626e-07
464 2.867535314074212e-07
465 2.7179688447749006e-07
466 2.576169122444179e-07
467 2.4419483816309677e-07
468 2.314721012143004e-07
469 2.194137055064996e-07
470 2.0797729760603707e-07
471 1.9713592792421398e-07
472 1.8685854537573734e-07
473 1.771229931069082e-07
474 1.6789695316901582e-07
475 1.5915148110397434e-07
476 1.5086705339175134e-07
477 1.430098190375031e-07
478 1.355606558948906e-07
479 1.285025720106064e-07
480 1.2180937023385036e-07
481 1.1546580575756068e-07
482 1.0945946328889373e-07
483 1.0376096750286422e-07
484 9.83651498129012e-08
485 9.324730251242051e-08
486 8.839325735215783e-08
487 8.379154806854282e-08
488 7.943243477819605e-08
489 7.530093631579551e-08
490 7.138481211401022e-08
491 6.767356738345244e-08
492 6.415689983675569e-08
493 6.081860864573555e-08
494 5.76558453637801e-08
495 5.465789866895397e-08
4

### PyTorch Tensors

A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [9]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 1217.822021484375
199 24.060001373291016
299 0.7113990783691406
399 0.023574290797114372
499 0.0011385310208424926


### PyTorch: Tensors and autograd

The autograd package in PyTorch provides automatic differentiation to automate the computation of backward passes in neural networks. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

Each Tensor represents a node in a computational graph. If x is a Tensor that has x.requires_grad = True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [10]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 1020.8902587890625
199 11.472747802734375
299 0.17377793788909912
399 0.0031679735984653234
499 0.00019329070346429944


### PyTorch: nn module

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In TensorFlow, packages like Keras, TensorFlow-Slim, and TFLearn provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the nn package serves this same purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In [15]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param = param - learning_rate * param.grad


99 688.3653564453125
199 688.3653564453125
299 688.3653564453125
399 688.3653564453125
499 688.3653564453125


### PyTorch: optim

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the nn package to define our model as before, but we will optimize the model using the Adam algorithm provided by the optim package:

In [16]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

99 53.15220260620117
199 0.7771395444869995
299 0.005432159174233675
399 2.9338394597289152e-05
499 7.045740346711682e-08


### PyTorch: Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing nn.Module and defining a forward which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

In this example we implement our two-layer network as a custom Module subclass:

In [21]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

In [22]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 2.480579376220703
199 0.04392130300402641
299 0.001858148374594748
399 0.00010960386862279847
499 7.437600288540125e-06


### PyTorch: Control Flow + Weight Sharing

As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by simply reusing the same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [24]:
class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred

In [25]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 7.496952533721924
199 5.202394485473633
299 0.999541163444519
399 0.43073761463165283
499 0.3226214349269867
