In [7]:
import random
import torch
%matplotlib inline


## 1. nn

- Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

- When building neural networks we frequently think of arranging the computation into __layers__, some of which have __learnable parameters__ which will be optimized during learning.

- In TensorFlow, packages like Keras, TensorFlow-Slim, and TFLearn provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

- ``nn`` package 
    - defines a set of __Modules, which are roughly equivalent to neural network layers.__ A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. 
    - defines a set of __useful loss functions__ that are commonly used when training neural networks.

- In this example we use the nn package to implement our two-layer network:

In [8]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 664.2083740234375
1 613.20263671875
2 569.9618530273438
3 532.9740600585938
4 500.2718200683594
5 471.0628662109375
6 444.5059509277344
7 420.1546936035156
8 397.9674072265625
9 377.5189514160156
10 358.4212341308594
11 340.35308837890625
12 323.27618408203125
13 307.2858581542969
14 292.22259521484375
15 277.8974304199219
16 264.3196716308594
17 251.36045837402344
18 239.02615356445312
19 227.1866455078125
20 215.84506225585938
21 204.9761199951172
22 194.56529235839844
23 184.64404296875
24 175.15016174316406
25 166.0818328857422
26 157.4344024658203
27 149.1746063232422
28 141.2951202392578
29 133.7926788330078
30 126.61939239501953
31 119.76414489746094
32 113.26631164550781
33 107.07380676269531
34 101.20528411865234
35 95.65251922607422
36 90.3940200805664
37 85.40951538085938
38 80.7071304321289
39 76.25898742675781
40 72.0533447265625
41 68.08563232421875
42 64.34671783447266
43 60.824951171875
44 57.511436462402344
45 54.39257049560547
46 51.45261764526367
47 48.6845703125
4

## 2. optim

- Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with ``torch.no_grad()`` or ``.data`` to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

- The ``optim`` package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

- In this example we will use the ``nn`` package to define our model as before, but we will optimize the model using the Adam algorithm provided by the ``optim`` package:

In [4]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 757.7068481445312
1 738.934814453125
2 720.7792358398438
3 703.296875
4 686.3145141601562
5 669.7926635742188
6 653.6849975585938
7 637.9871826171875
8 622.8261108398438
9 608.034912109375
10 593.6647338867188
11 579.776123046875
12 566.3594360351562
13 553.32177734375
14 540.6869506835938
15 528.3517456054688
16 516.3473510742188
17 504.65093994140625
18 493.3089599609375
19 482.2464599609375
20 471.4505615234375
21 460.91064453125
22 450.7049255371094
23 440.7840881347656
24 431.0689392089844
25 421.6276550292969
26 412.4748229980469
27 403.5263977050781
28 394.77154541015625
29 386.2185974121094
30 377.8403015136719
31 369.61151123046875
32 361.56732177734375
33 353.6968688964844
34 345.9980163574219
35 338.5141296386719
36 331.1979064941406
37 324.01470947265625
38 316.9681701660156
39 310.0837097167969
40 303.3055419921875
41 296.65509033203125
42 290.1338806152344
43 283.73553466796875
44 277.4610290527344
45 271.3208923339844
46 265.29095458984375
47 259.3559875488281
48 253.5

375 0.00025778700364753604
376 0.0002470437902957201
377 0.00023678482102695853
378 0.00022697688837070018
379 0.00021760260278824717
380 0.00020863892859779298
381 0.00020006504200864583
382 0.00019186495046596974
383 0.0001840223849285394
384 0.00017651332018431276
385 0.00016932982543949038
386 0.00016245577717199922
387 0.00015587166126351804
388 0.00014957028906792402
389 0.00014353357255458832
390 0.00013775443949270993
391 0.00013221637345850468
392 0.0001269185886485502
393 0.00012183503713458776
394 0.00011696080036927015
395 0.0001122969770221971
396 0.0001078205241356045
397 0.00010353610559832305
398 9.942062752088532e-05
399 9.547619265504181e-05
400 9.169567056233063e-05
401 8.806530240690336e-05
402 8.458482625428587e-05
403 8.124515443341807e-05
404 7.804319466231391e-05
405 7.496486068703234e-05
406 7.201651169452816e-05
407 6.918263534316793e-05
408 6.646468682447448e-05
409 6.385561573551968e-05
410 6.13503361819312e-05
411 5.894368950976059e-05
412 5.663415868184529

## 3. Custom nn Modules
- Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing ``nn.Module`` and defining a ``forward`` which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

- In this example we implement our two-layer network as a custom Module subclass:

In [5]:
# -*- coding: utf-8 -*-
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 753.0281372070312
1 695.0186157226562
2 644.6880493164062
3 600.869140625
4 562.0409545898438
5 526.9856567382812
6 495.1033020019531
7 465.87457275390625
8 438.85223388671875
9 414.0561218261719
10 390.8634338378906
11 369.11578369140625
12 348.5247802734375
13 329.1452941894531
14 310.7804260253906
15 293.3653564453125
16 276.7524108886719
17 260.88262939453125
18 245.7884063720703
19 231.39659118652344
20 217.69639587402344
21 204.62905883789062
22 192.15101623535156
23 180.30838012695312
24 169.07070922851562
25 158.4312744140625
26 148.36309814453125
27 138.88475036621094
28 129.93531799316406
29 121.48394012451172
30 113.52424621582031
31 106.04120635986328
32 99.02017211914062
33 92.4327163696289
34 86.25572967529297
35 80.47956085205078
36 75.06902313232422
37 70.01944732666016
38 65.30133056640625
39 60.908119201660156
40 56.80889892578125
41 52.99382400512695
42 49.444557189941406
43 46.14909362792969
44 43.0823860168457
45 40.233028411865234
46 37.59233093261719
47 35.1342

## PyTorch: Control Flow + Weight Sharing
As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by simply reusing the same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [None]:
class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()