We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

### Table of Contents

- Warm-up: numpy
- PyTorch: Tensors
- PyTorch: Variables and autograd
- PyTorch: Defining new autograd functions
- TensorFlow: Static Graphs
- PyTorch: nn
- PyTorch: optim
- PyTorch: Custom nn Modules
- PyTorch: Control Flow and Weight Sharing

## Warm-up: numpy

Before introducing PyTorch, we will first implement the network using numpy.
Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [1]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 36294917.5557
1 33699684.455
2 31807891.1345
3 26298033.019
4 18260184.2436
5 10742985.3586
6 5915192.79333
7 3342683.77593
8 2084109.99123
9 1446845.17603
10 1094216.41467
11 874209.93957
12 721744.599221
13 607863.552077
14 518199.881396
15 445574.966818
16 385767.254377
17 335711.601487
18 293487.31396
19 257633.545637
20 227001.477578
21 200673.059475
22 177940.693687
23 158236.161608
24 141082.248777
25 126098.926801
26 112969.741609
27 101430.300857
28 91245.6650215
29 82236.1905469
30 74254.8020085
31 67168.8466515
32 60857.5574358
33 55230.4422646
34 50197.906334
35 45687.7836579
36 41635.6221961
37 37988.8002882
38 34702.7976095
39 31737.8655908
40 29058.2204536
41 26643.882248
42 24461.5472184
43 22479.8846206
44 20679.6229739
45 19042.0041861
46 17548.6609127
47 16185.978359
48 14942.3460043
49 13805.2791895
50 12764.3298789
51 11810.9015558
52 10936.2457326
53 10133.3437261
54 9395.78428317
55 8717.34402183
56 8092.96979775
57 7518.16406271
58 6988.12829116
59 6499.116709

## PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of [50x or greater](https://github.com/jcjohnson/cnn-benchmarks), so unfortunately numpy won't be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Like numpy arrays, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing.

However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [2]:
import torch

dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 29126067.13111648
1 26349763.890004296
2 27470426.752154827
3 27996757.734803893
4 25059644.27072954
5 18595697.35826277
6 11563421.694134854
7 6408758.770751238
8 3495005.0491381437
9 2037336.4893597476
10 1326381.703181915
11 959993.963221347
12 750722.6961148025
13 616244.0283457797
14 520283.92721932335
15 446643.213206273
16 387452.3405096517
17 338601.46642259625
18 297499.50133071846
19 262541.9784091521
20 232583.63726777676
21 206705.3618355831
22 184232.78335948265
23 164643.65243452336
24 147500.891054754
25 132449.3263606677
26 119188.43887156644
27 107477.81532664876
28 97092.07796235426
29 87863.24299409322
30 79644.86112038331
31 72308.63069390476
32 65753.90563654387
33 59879.17466427782
34 54605.92082267729
35 49861.130956617126
36 45589.96370653974
37 41737.47556608585
38 38253.912664197574
39 35100.05163908473
40 32238.731641755876
41 29639.077780490217
42 27273.41135514937
43 25120.736049674895
44 23160.28624062547
45 21369.690476023214
46 19733.144253220158
47 18

## PyTorch: Variables and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation)
to automate the computation of backward passes in neural networks. The **autograd** package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a **computational graph**; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it's pretty simple to use in practice. We wrap our PyTorch Tensors in **Variable** objects; a Variable represents a node in a computational graph. If `x` is a Variable then `x.data` is a Tensor, and `x.grad` is another Variable holding the gradient of `x` with respect to some scalar value.

PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation that you can perform on a Tensor also works on Variables; the difference is that using Variables defines a computational graph, allowing you to automatically compute gradients.

Here we use PyTorch Variables and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [3]:
import torch
from torch.autograd import Variable

dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs, and wrap them in Variables.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create random Tensors for weights, and wrap them in Variables.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Variables; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Variables.
    # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
    # (1,); loss.data[0] is a scalar value holding the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Variables with requires_grad=True.
    # After this call w1.grad and w2.grad will be Variables holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Update weights using gradient descent; w1.data and w2.data are Tensors,
    # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
    # Tensors.
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    # Manually zero the gradients after running the backward pass
    w1.grad.data.zero_()
    w2.grad.data.zero_()

0 28151838.0
1 23855840.0
2 22798790.0
3 21746852.0
4 19143714.0
5 14923601.0
6 10347043.0
7 6597578.0
8 4072913.0
9 2555915.25
10 1689332.5
11 1193177.75
12 898067.4375
13 711312.9375
14 584619.1875
15 492756.09375
16 422344.53125
17 366147.15625
18 319967.78125
19 281289.65625
20 248485.421875
21 220389.0625
22 196125.359375
23 175041.9375
24 156634.328125
25 140525.609375
26 126352.3828125
27 113854.9921875
28 102794.2421875
29 92976.03125
30 84237.796875
31 76446.578125
32 69480.140625
33 63238.72265625
34 57641.59765625
35 52606.9296875
36 48084.3203125
37 44008.3359375
38 40323.6484375
39 36987.8671875
40 33963.69921875
41 31216.7109375
42 28719.669921875
43 26445.7734375
44 24372.70703125
45 22481.83203125
46 20754.330078125
47 19174.357421875
48 17729.115234375
49 16404.478515625
50 15189.0810546875
51 14073.3447265625
52 13048.5771484375
53 12105.685546875
54 11237.39453125
55 10437.640625
56 9700.52734375
57 9020.4404296875
58 8392.2724609375
59 7811.83935546875
60 7275.20605

386 0.0009792167693376541
387 0.0009477005223743618
388 0.0009204429807141423
389 0.0008941081468947232
390 0.000866335816681385
391 0.0008403147221542895
392 0.0008148514316417277
393 0.0007927618571557105
394 0.0007705118041485548
395 0.0007485965616069734
396 0.0007282397709786892
397 0.0007061169599182904
398 0.0006872809608466923
399 0.0006675703916698694
400 0.0006500589661300182
401 0.0006320748943835497
402 0.0006142815691418946
403 0.0005973376100882888
404 0.0005818820791319013
405 0.0005658938898704946
406 0.0005506458692252636
407 0.0005366129917092621
408 0.0005223175976425409
409 0.0005092177307233214
410 0.00049572967691347
411 0.0004835558938793838
412 0.00047139517846517265
413 0.0004595947393681854
414 0.0004478179616853595
415 0.0004372907569631934
416 0.00042554727406241
417 0.000415329122915864
418 0.00040455549606122077
419 0.00039509855560027063
420 0.0003853017115034163
421 0.00037658834480680525
422 0.00036822122638113797
423 0.00035928821307606995
424 0.000350

## PyTorch: Defining new autograd functions
Under the hood, each primitive autograd operator is really two functions that
operate on Tensors. The **forward** function computes output Tensors from input
Tensors. The **backward** function receives the gradient of some scalar value
with respect to the output Tensors, and computes the gradient
of that same scalar value with respect to the input Tensors.

In PyTorch we can easily define our own autograd operator by defining a subclass
of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
We can then use our new autograd operator by constructing an instance and calling it
like a function, passing Variables containing input data.

In this example we define our own custom autograd function for performing the ReLU
nonlinearity, and use it to implement our two-layer network:

In [5]:
import torch
from torch.autograd import Variable


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    def forward(self, input):
        """
        In the forward pass we receive a Tensor containing the input and return a
        Tensor containing the output. You can cache arbitrary Tensors for use in the
        backward pass using the save_for_backward method.
        """
        self.save_for_backward(input)
        return input.clamp(min=0)

    def backward(self, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = self.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create random Tensors for weights, and wrap them in Variables.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Construct an instance of our MyReLU class to use in our network
    relu = MyReLU()

    # Forward pass: compute predicted y using operations on Variables; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    # Manually zero the gradients after running the backward pass
    w1.grad.data.zero_()
    w2.grad.data.zero_()

0 31147192.0
1 31969982.0
2 41728408.0
3 52850288.0
4 52601052.0
5 34836024.0
6 14943354.0
7 5112910.5
8 2114901.75
9 1263391.5
10 949747.875
11 779061.5625
12 658667.3125
13 564229.25
14 487367.65625
15 423746.125
16 370458.21875
17 325587.25
18 287462.6875
19 254852.703125
20 226719.25
21 202340.6875
22 181133.484375
23 162592.640625
24 146308.859375
25 131972.890625
26 119305.4140625
27 108074.078125
28 98100.921875
29 89208.4765625
30 81256.1875
31 74131.859375
32 67731.984375
33 61973.61328125
34 56786.73046875
35 52102.96875
36 47865.61328125
37 44024.359375
38 40535.9375
39 37365.2890625
40 34478.44921875
41 31845.4921875
42 29441.39453125
43 27244.583984375
44 25233.3046875
45 23390.478515625
46 21698.79296875
47 20146.435546875
48 18719.291015625
49 17405.7265625
50 16195.4052734375
51 15079.7294921875
52 14052.001953125
53 13101.6728515625
54 12223.3857421875
55 11410.0888671875
56 10656.7392578125
57 9958.43359375
58 9310.779296875
59 8709.8642578125
60 8151.55810546875
61 7

494 0.00018602116324473172
495 0.0001826491643441841
496 0.00017937015218194574
497 0.00017638799909036607
498 0.00017329351976513863
499 0.00017029284208547324


## TensorFlow: Static Graphs
PyTorch autograd looks a lot like TensorFlow: in both frameworks we define
a computational graph, and use automatic differentiation to compute gradients.
The biggest difference between the two is that TensorFlow's computational graphs
are **static** and PyTorch uses **dynamic** computational graphs.

In TensorFlow, we define the computational graph once and then execute the same
graph over and over again, possibly feeding different input data to the graph.
In PyTorch, each forward pass defines a new computational graph.

Static graphs are nice because you can optimize the graph up front; for example
a framework might decide to fuse some graph operations for efficiency, or to
come up with a strategy for distributing the graph across many GPUs or many
machines. If you are reusing the same graph over and over, then this potentially
costly up-front optimization can be amortized as the same graph is rerun over
and over.

One aspect where static and dynamic graphs differ is control flow. For some models
we may wish to perform different computation for each data point; for example a
recurrent network might be unrolled for different numbers of time steps for each
data point; this unrolling can be implemented as a loop. With a static graph the
loop construct needs to be a part of the graph; for this reason TensorFlow
provides operators such as `tf.scan` for embedding loops into the graph. With
dynamic graphs the situation is simpler: since we build graphs on-the-fly for
each example, we can use normal imperative flow control to perform computation
that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to
fit a simple two-layer net:


In [6]:
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

3.11165e+07
2.65015e+07
2.45489e+07
2.18674e+07
1.77339e+07
1.27473e+07
8.38358e+06
5.22151e+06
3.26027e+06
2.10956e+06
1.44911e+06
1.05884e+06
816835.0
656479.0
543287.0
458713.0
392745.0
339644.0
295892.0
259308.0
228372.0
201940.0
179219.0
159547.0
142441.0
127494.0
114381.0
102843.0
92660.8
83644.0
75656.5
68546.6
62204.8
56536.8
51459.5
46904.2
42807.0
39114.1
35788.6
32793.9
30084.4
27625.2
25391.8
23361.7
21513.3
19827.8
18291.4
16887.4
15608.0
14436.1
13360.8
12374.1
11467.3
10633.5
9866.03
9159.07
8507.63
7906.89
7351.88
6839.17
6366.03
5928.3
5523.27
5148.02
4800.26
4477.67
4178.37
3900.53
3642.48
3402.69
3179.76
2972.45
2779.52
2599.97
2432.81
2276.98
2131.7
1996.26
1870.01
1752.25
1642.29
1539.65
1443.74
1354.16
1270.45
1192.14
1118.89
1050.37
986.268
926.277
870.09
817.463
768.159
721.966
678.677
638.093
600.024
564.34
530.852
499.449
469.951
442.261
416.281
391.896
368.99
347.476
327.26
308.256
290.401
273.612
257.823
242.978
229.01
215.903
203.606
192.026
181.123
170.87


## PyTorch: nn
Computational graphs and autograd are a very powerful paradigm for defining
complex operators and automatically taking derivatives; however for large
neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation
into **layers**, some of which have **learnable parameters** which will be
optimized during learning.

In TensorFlow, packages like [Keras](https://github.com/fchollet/keras),
[TensorFlow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim),
and [TFLearn](http://tflearn.org/) provide higher-level abstractions over
raw computational graphs that are useful for building neural networks.

In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of
**Modules**, which are roughly equivalent to neural network layers. A Module receives
input Variables and computes output Variables, but may also hold internal state such as
Variables containing learnable parameters. The `nn` package also defines a set of useful
loss functions that are commonly used when training neural networks.

In this example we use the `nn` package to implement our two-layer network:

In [7]:
import torch
from torch.autograd import Variable

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Variables for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Variable of input data to the Module and it produces
    # a Variable of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Variables containing the predicted and true
    # values of y, and the loss function returns a Variable containing the loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.data[0])

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Variables with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Variable, so
    # we can access its data and gradients like we did before.
    for param in model.parameters():
        param.data -= learning_rate * param.grad.data

0 702.580078125
1 651.6387939453125
2 607.560302734375
3 568.309326171875
4 533.0369873046875
5 501.1276550292969
6 472.29827880859375
7 445.7300109863281
8 421.1839294433594
9 398.20654296875
10 376.595703125
11 356.2938232421875
12 337.18096923828125
13 319.1058349609375
14 301.9778137207031
15 285.7295227050781
16 270.2890319824219
17 255.64804077148438
18 241.7335662841797
19 228.49624633789062
20 215.84336853027344
21 203.7827606201172
22 192.30438232421875
23 181.40830993652344
24 171.0581817626953
25 161.2638702392578
26 151.9705352783203
27 143.14361572265625
28 134.77232360839844
29 126.84070587158203
30 119.35781860351562
31 112.28185272216797
32 105.5811996459961
33 99.2617416381836
34 93.30645751953125
35 87.69295501708984
36 82.4083480834961
37 77.43061065673828
38 72.74105834960938
39 68.34416961669922
40 64.21847534179688
41 60.34305191040039
42 56.70635986328125
43 53.2939567565918
44 50.092041015625
45 47.089046478271484
46 44.26726150512695
47 41.626007080078125
48 39

452 1.654035077081062e-05
453 1.6083244190667756e-05
454 1.563990008435212e-05
455 1.5207281649054494e-05
456 1.4788025509915315e-05
457 1.4380444554262795e-05
458 1.3981664778839331e-05
459 1.3597156794276088e-05
460 1.3223243513493799e-05
461 1.2857514775532763e-05
462 1.2502604477049317e-05
463 1.2159844118286856e-05
464 1.1823163731605746e-05
465 1.1498834282974713e-05
466 1.1182150046806782e-05
467 1.087487271433929e-05
468 1.05765739135677e-05
469 1.0284378731739707e-05
470 1.00021279649809e-05
471 9.726709322421812e-06
472 9.460111868975218e-06
473 9.200055501423776e-06
474 8.947410606197082e-06
475 8.702300874574576e-06
476 8.46303919388447e-06
477 8.231932952185161e-06
478 8.004983101272956e-06
479 7.786981768731494e-06
480 7.573593393317424e-06
481 7.366648333118064e-06
482 7.16417389412527e-06
483 6.968447451072279e-06
484 6.777252565370873e-06
485 6.592854788323166e-06
486 6.411809408746194e-06
487 6.236769422685029e-06
488 6.066030891815899e-06
489 5.899820735066896e-06
49

## PyTorch: optim
Up to this point we have updated the weights of our models by manually mutating the
`.data` member for Variables holding learnable parameters. This is not a huge burden
for simple optimization algorithms like stochastic gradient descent, but in practice
we often train neural networks using more sophisiticated optimizers like AdaGrad,
RMSProp, Adam, etc.

The `optim` package in PyTorch abstracts the idea of an optimization algorithm and
provides implementations of commonly used optimization algorithms.

In this example we will use the `nn` package to define our model as before, but we
will optimize the model using the Adam algorithm provided by the `optim` package:

In [8]:
import torch
from torch.autograd import Variable

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Variables it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.data[0])

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable weights
    # of the model)
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its parameters
    optimizer.step()

0 691.1389770507812
1 673.6463012695312
2 656.6524658203125
3 640.1995239257812
4 624.2630004882812
5 608.7924194335938
6 593.7603759765625
7 579.1121826171875
8 564.8551635742188
9 550.9869995117188
10 537.5398559570312
11 524.4890747070312
12 511.7931213378906
13 499.49383544921875
14 487.5394592285156
15 475.9281311035156
16 464.6668395996094
17 453.71185302734375
18 443.0300598144531
19 432.6459655761719
20 422.541259765625
21 412.7095031738281
22 403.16778564453125
23 393.8221435546875
24 384.69378662109375
25 375.7666015625
26 367.12664794921875
27 358.7503662109375
28 350.55828857421875
29 342.54888916015625
30 334.7229919433594
31 327.0707092285156
32 319.6160888671875
33 312.3048095703125
34 305.1597900390625
35 298.1591796875
36 291.3157958984375
37 284.651611328125
38 278.12713623046875
39 271.72705078125
40 265.4615478515625
41 259.334228515625
42 253.3516387939453
43 247.49485778808594
44 241.7464599609375
45 236.10508728027344
46 230.5500946044922
47 225.1023406982422
48 

412 1.162743956228951e-05
413 1.0835625289473683e-05
414 1.00934912552475e-05
415 9.402783689438365e-06
416 8.757166142459027e-06
417 8.15442490420537e-06
418 7.591708708787337e-06
419 7.066854323056759e-06
420 6.5762769736466e-06
421 6.11859741184162e-06
422 5.692126251233276e-06
423 5.295875780575443e-06
424 4.9272166506852955e-06
425 4.585966507875128e-06
426 4.267594249540707e-06
427 3.97053872802644e-06
428 3.6933440696884645e-06
429 3.434506879784749e-06
430 3.194583086951752e-06
431 2.9702227948291693e-06
432 2.761559699138161e-06
433 2.567298679423402e-06
434 2.3867596610216424e-06
435 2.2184569843375357e-06
436 2.0617815152945695e-06
437 1.915999291668413e-06
438 1.7802731235860847e-06
439 1.6538073168703704e-06
440 1.535865294499672e-06
441 1.4268407539930195e-06
442 1.3250569281808566e-06
443 1.2300869229875389e-06
444 1.1418879921620828e-06
445 1.0606025853121537e-06
446 9.839861832006136e-07
447 9.133615321843536e-07
448 8.474823971482692e-07
449 7.863673658903281e-07
450 

## PyTorch: Custom nn Modules
Sometimes you will want to specify models that are more complex than a sequence of
existing Modules; for these cases you can define your own Modules by subclassing
`nn.Module` and defining a `forward` which receives input Variables and produces
output Variables using other modules or other autograd operations on Variables.

In this example we implement our two-layer network as a custom Module subclass:


In [9]:
import torch
from torch.autograd import Variable


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.data[0])

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 684.0302734375
1 632.9011840820312
2 589.2053833007812
3 550.6514892578125
4 516.4931030273438
5 485.6251525878906
6 457.6509704589844
7 432.03619384765625
8 408.41546630859375
9 386.3865051269531
10 365.82257080078125
11 346.482666015625
12 328.1611022949219
13 310.845458984375
14 294.4609680175781
15 278.8294677734375
16 263.9429016113281
17 249.68043518066406
18 236.066650390625
19 223.0780487060547
20 210.73768615722656
21 198.9150390625
22 187.64071655273438
23 176.8770751953125
24 166.57550048828125
25 156.75015258789062
26 147.42864990234375
27 138.59051513671875
28 130.21261596679688
29 122.2490005493164
30 114.70915222167969
31 107.59922790527344
32 100.88062286376953
33 94.53604888916016
34 88.56356048583984
35 82.92398071289062
36 77.62141418457031
37 72.6456069946289
38 67.97760772705078
39 63.60685348510742
40 59.523406982421875
41 55.693912506103516
42 52.11708450317383
43 48.784725189208984
44 45.670387268066406
45 42.766719818115234
46 40.054832458496094
47 37.5260086

396 5.239243546384387e-05
397 5.0784328777808696e-05
398 4.922419975628145e-05
399 4.771211388288066e-05
400 4.62480602436699e-05
401 4.4828026148024946e-05
402 4.345230991020799e-05
403 4.212060593999922e-05
404 4.082996747456491e-05
405 3.9577196730533615e-05
406 3.8362846680684015e-05
407 3.718750667758286e-05
408 3.60496123903431e-05
409 3.494817792670801e-05
410 3.38763820764143e-05
411 3.2839794585015625e-05
412 3.183340959367342e-05
413 3.0858143873047084e-05
414 2.9916895073256455e-05
415 2.9001406801398844e-05
416 2.8116601242800243e-05
417 2.7257796318735927e-05
418 2.642454819579143e-05
419 2.561689689173363e-05
420 2.4835158910718746e-05
421 2.4077322450466454e-05
422 2.3342854547081515e-05
423 2.262845191580709e-05
424 2.1938190911896527e-05
425 2.1270814613671973e-05
426 2.0621324438252486e-05
427 1.9993234673165716e-05
428 1.938419882208109e-05
429 1.8794013158185408e-05
430 1.822041485866066e-05
431 1.7668204236542806e-05
432 1.7129152183770202e-05
433 1.660824273130856

## PyTorch: Control Flow + Weight Sharing
As an example of dynamic graphs and weight sharing, we implement a very strange
model: a fully-connected ReLU network that on each forward pass chooses a random
number between 1 and 4 and uses that many hidden layers, reusing the same weights
multiple times to compute the innermost hidden layers.

For this model can use normal Python flow control to implement the loop, and we
can implement weight sharing among the innermost layers by simply reusing the
same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [10]:
import random
import torch
from torch.autograd import Variable


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.data[0])

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 616.4950561523438
1 624.13232421875
2 612.376953125
3 612.3579711914062
4 605.9171142578125
5 610.0383911132812
6 608.7595825195312
7 607.425048828125
8 606.0736694335938
9 604.7122192382812
10 575.9383544921875
11 638.8399047851562
12 601.6058349609375
13 600.787841796875
14 589.0020751953125
15 587.5262451171875
16 488.610595703125
17 543.8460083007812
18 596.8078002929688
19 381.94781494140625
20 522.242919921875
21 306.5039367675781
22 571.390625
23 232.33773803710938
24 195.53819274902344
25 158.1525421142578
26 123.24190521240234
27 584.6608276367188
28 74.57864379882812
29 431.55987548828125
30 410.3700256347656
31 62.86027908325195
32 565.00390625
33 498.255126953125
34 88.38285064697266
35 277.1981506347656
36 90.76847839355469
37 79.22393798828125
38 58.87898635864258
39 208.7838897705078
40 483.92718505859375
41 466.6461486816406
42 165.0334014892578
43 418.14154052734375
44 130.67935180664062
45 111.69049072265625
46 249.76788330078125
47 298.7792053222656
48 259.45672607

385 0.8836936950683594
386 0.23063074052333832
387 1.4905411005020142
388 0.31758615374565125
389 0.31696560978889465
390 0.26454928517341614
391 0.1838505119085312
392 1.5824717283248901
393 1.5122206211090088
394 0.5749168992042542
395 0.9176751375198364
396 1.9435492753982544
397 0.8304876685142517
398 2.260817766189575
399 0.8937113881111145
400 0.58250492811203
401 0.16422437131404877
402 2.8824357986450195
403 4.575019836425781
404 1.1119542121887207
405 0.5746039748191833
406 1.4114466905593872
407 9.314682006835938
408 2.8874685764312744
409 0.3766322433948517
410 0.8344153761863708
411 0.5082651376724243
412 7.473647117614746
413 2.590726375579834
414 0.8216308951377869
415 0.5173829793930054
416 2.9165520668029785
417 0.7470154166221619
418 7.997786998748779
419 2.9960172176361084
420 0.2547634243965149
421 3.80987548828125
422 3.431053400039673
423 0.6235431432723999
424 0.3510955572128296
425 5.40863037109375
426 6.703663349151611
427 1.4078397750854492
428 3.01272964477539