In [7]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden layer dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

#Create random input and output data 
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

#Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    #Forward pass compute the predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    #Compute loss and print it
    loss = np.square(y_pred - y).sum()
    print(t,loss)
    
    #Compute the gradients
    y_grad = 2.0 * (y_pred - y)
    w2_grad = h_relu.T.dot(y_grad)
    h_relu_grad = y_grad.dot(w2.T)
    h_grad = h_relu_grad.copy()
    h_grad[h<0] = 0
    w1_grad = x.T.dot(h_grad)
    
    #Update the gradients
    w1 -= learning_rate * w1_grad
    w2 -= learning_rate * w2_grad

0 33416688.731573217
1 29774551.625532843
2 29136253.50555642
3 26900217.472406194
4 21769544.80425965
5 14973572.327006321
6 9145007.780169202
7 5268953.264142139
8 3111015.546390496
9 1973010.9041388873
10 1369696.723663126
11 1027960.3916607009
12 816452.1477389182
13 672444.0665158469
14 566643.4442578931
15 484369.33157412196
16 418046.6214056208
17 363230.5803970932
18 317287.5086784802
19 278356.6036456486
20 245066.16900404275
21 216430.03899445065
22 191679.67692930577
23 170202.13401744084
24 151500.0246984833
25 135178.82359028666
26 120819.90121314477
27 108217.00342918732
28 97121.43410311024
29 87330.92730560421
30 78667.57135275748
31 70981.27457946676
32 64151.07089488104
33 58070.99984924669
34 52651.958107883205
35 47810.63937171379
36 43477.395220625694
37 39594.60873337394
38 36103.30375547569
39 32958.358185596604
40 30122.00930667408
41 27561.282209922207
42 25243.540380302584
43 23146.80307678914
44 21245.606533706723
45 19519.008535450324
46 17949.100577036355
4

#PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of <b>50x or greater</b>, so unfortunately numpy won't be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they're also useful as a generic tool for scientific computing.

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network.




In [35]:
import torch

dtype = torch.float
device = torch.device('cpu')
#device = torch.device('cuda:0')
#In the case of using GPU

# N is batch size; D_in is input dimension
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

#Create random input and output data
x = torch.randn(N, D_in, dtype=dtype, device=device)
y = torch.randn(N, D_out, dtype=dtype, device=device)

#Randomly initialize weights
w1 = torch.randn(D_in, H, dtype=dtype, device=device)
w2 = torch.randn(H, D_out, dtype=dtype, device=device)

#Compute the forward-pass
learning_rate =1e-6
for t in range(500):
    h = x.mm(w1)
    h_relu = h.clamp(min = 0)
    y_pred = h_relu.mm(w2)
    
    #Calculation of Loss Function
    loss = (y_pred - y).pow(2).sum().item()
    print(t,loss)
    
    #Backprop to compute gradients of w1 and w2 
    y_grad = 2.0 * (y_pred - y)
    w2_grad = h_relu.t().mm(y_grad)
    h_relu_grad = y_grad.mm(w2.t())
    h_relu = h_relu_grad.clone()
    h_relu[h < 0] = 0
    w1_grad = x.t().mm(h_relu)
    
    #Updating weights
    w1 -= learning_rate * w1_grad
    w2 -= learning_rate * w2_grad

0 30386300.0
1 25800668.0
2 23062528.0
3 19440752.0
4 14967786.0
5 10374851.0
6 6764020.5
7 4303229.0
8 2803734.25
9 1912248.875
10 1382378.875
11 1053831.875
12 839214.625
13 690149.5625
14 580817.5625
15 496696.9375
16 429588.25
17 374686.34375
18 328944.9375
19 290329.21875
20 257361.96875
21 228932.1875
22 204297.984375
23 182826.796875
24 164011.5625
25 147475.78125
26 132923.734375
27 120044.5234375
28 108611.703125
29 98435.5390625
30 89353.078125
31 81227.203125
32 73941.109375
33 67392.296875
34 61495.9375
35 56177.6328125
36 51376.25
37 47034.78515625
38 43101.66015625
39 39534.0703125
40 36293.75
41 33347.2890625
42 30663.134765625
43 28216.57421875
44 25983.443359375
45 23947.388671875
46 22085.990234375
47 20381.62109375
48 18819.875
49 17388.22265625
50 16073.6337890625
51 14864.6337890625
52 13753.5068359375
53 12731.2236328125
54 11790.3193359375
55 10923.3603515625
56 10124.1376953125
57 9387.15625
58 8707.4677734375
59 8079.5205078125
60 7499.9091796875
61 6963.960937

# AUTOGRAD
## PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex network. 

Thankfully, we can use <font color='red'>automatic differentation</font> to automate the computation of backward passes in neural networks. The <b>autograd</b> package in Pytorch provides exactly this functionality. When using autograd, the forward pass of your network will defina a <b>computational graph </b>; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients. 

This sounds complicated, it's pretty simple to use in practise. Each Tensor represents a node in a computational graph. If <b>x</b> is a Tensor that has <b>x.requires_grad=True</b> then <b>x.grad</b> is another Tensor holding the gradient of <b>x</b> with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network.

In [7]:
import torch

dtype = torch.float
device = torch.device('cpu')

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients.
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights
# Settins requires_grad = True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.

w1 = torch.randn(D_in, H, dtype=dtype, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: Compute predicted y using operations on Tensors
    # these are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermetiate valus since
    # we are not implementing the backward pass by hand.
    
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss
    
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass. This call will compute the gradient of loss with respect to all Tensors
    # with requires_grad = True
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data 
    # Recall that tensor.data gives a tensor that shares the storage with tensor,
    #but doesn't track history
    # You can also use torch.optim.SGD to achieve this
    
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        #Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()



0 35135760.0
1 36815172.0
2 43598112.0
3 46070864.0
4 37546428.0
5 21762346.0
6 9693062.0
7 4060842.25
8 2020838.0
9 1273380.375
10 946319.4375
11 762355.8125
12 636379.0625
13 540421.875
14 463463.53125
15 400233.875
16 347605.6875
17 303414.375
18 266048.34375
19 234280.5
20 207095.109375
21 183715.1875
22 163502.609375
23 145954.5625
24 130652.2109375
25 117243.796875
26 105467.1328125
27 95086.3984375
28 85917.5390625
29 77814.25
30 70615.421875
31 64198.94921875
32 58464.484375
33 53328.49609375
34 48721.36328125
35 44581.50390625
36 40850.78515625
37 37486.921875
38 34445.0859375
39 31687.849609375
40 29183.880859375
41 26907.39453125
42 24835.537109375
43 22946.375
44 21222.69921875
45 19646.515625
46 18204.865234375
47 16885.916015625
48 15675.9443359375
49 14564.248046875
50 13541.853515625
51 12601.9541015625
52 11735.8857421875
53 10936.837890625
54 10199.08984375
55 9517.29296875
56 8886.5400390625
57 8302.515625
58 7761.33251953125
59 7259.4365234375
60 6793.5341796875
61 

# PyTorch: Defining new autograd functions

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The <b>forward</b> function computes output Tensors from input Tensors. The <b>backward</b> function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value. 

In PyTorch we can easily define our own autograd operator by defining a subclass of <b>torch.autograd.Function</b> and implementing the <b>forward</b> and <b>backward</b> functions. We can then use our new autograd operator by constructing an instane and calling it like a function, passing Tensors containing input data. 

In this example we define our own custom autograd function for performing ReLU nonlinearity, and use it to implment our two layer network.

In [14]:
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 29198410.0
1 23361508.0
2 21252806.0
3 19660410.0
4 17021246.0
5 13402487.0
6 9525141.0
7 6314527.5
8 4032687.75
9 2594911.25
10 1725700.0
11 1207565.125
12 889986.9375
13 687185.5625
14 550486.5625
15 453159.21875
16 380384.96875
17 323699.25
18 278126.75
19 240711.25
20 209547.703125
21 183240.484375
22 160828.84375
23 141601.125
24 125019.2734375
25 110659.71875
26 98189.8984375
27 87318.2734375
28 77808.3203125
29 69463.6328125
30 62123.92578125
31 55655.32421875
32 49942.859375
33 44884.203125
34 40403.5234375
35 36421.09765625
36 32875.375
37 29712.10546875
38 26884.1171875
39 24353.224609375
40 22083.44140625
41 20048.193359375
42 18217.46484375
43 16571.005859375
44 15087.5830078125
45 13750.416015625
46 12540.521484375
47 11451.0615234375
48 10476.37890625
49 9593.5654296875
50 8793.185546875
51 8066.9482421875
52 7406.24951171875
53 6805.5625
54 6259.0322265625
55 5760.8251953125
56 5305.71533203125
57 4890.51416015625
58 4510.83447265625
59 4163.845703125
60 3845.874267578

### Using nn package to implement two-layer network.

In [22]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 737.6576538085938
1 682.9744873046875
2 636.355712890625
3 595.7308959960938
4 559.5975341796875
5 526.9066162109375
6 497.07208251953125
7 469.4734802246094
8 443.84698486328125
9 419.92279052734375
10 397.3864440917969
11 375.9966125488281
12 355.5753173828125
13 336.17425537109375
14 317.5948791503906
15 299.9095764160156
16 283.0068054199219
17 266.9005432128906
18 251.49017333984375
19 236.72706604003906
20 222.6747283935547
21 209.28921508789062
22 196.59278869628906
23 184.50180053710938
24 173.05487060546875
25 162.22669982910156
26 151.95916748046875
27 142.26490783691406
28 133.07485961914062
29 124.42667388916016
30 116.29397583007812
31 108.66077423095703
32 101.50205993652344
33 94.78702545166016
34 88.50553131103516
35 82.64152526855469
36 77.15548706054688
37 72.03953552246094
38 67.27945709228516
39 62.822391510009766
40 58.647972106933594
41 54.76618957519531
42 51.155189514160156
43 47.80793380737305
44 44.69593048095703
45 41.80229568481445
46 39.11026382446289
47 

0
1
