## Tensors 

In [2]:
#using numpy
import numpy as np
# N batch_size; D_in input dimention
# H hidden dimention; D_out output dimention

N, D_in, H, D_out = 64, 1000, 100, 10

#create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

#randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # backprop to compute gradient of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h<0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

(0, 34528932.135845914)
(1, 32233559.504211232)
(2, 33637379.32163124)
(3, 32729881.233941883)
(4, 26543979.304093506)
(5, 17349589.611975364)
(6, 9544432.137678977)
(7, 4954848.776044527)
(8, 2728856.241914607)
(9, 1709070.0824030486)
(10, 1213497.9399404726)
(11, 940729.4771918561)
(12, 767401.122286196)
(13, 643744.5150915214)
(14, 548679.3419562577)
(15, 472433.7791253474)
(16, 409773.0596568176)
(17, 357342.43863683194)
(18, 313073.36784072954)
(19, 275472.0360758031)
(20, 243267.23757001435)
(21, 215526.6762221221)
(22, 191538.43857844072)
(23, 170716.87442380248)
(24, 152554.29188618512)
(25, 136658.0918551303)
(26, 122692.71781732657)
(27, 110388.20657360664)
(28, 99530.46966605597)
(29, 89910.03271076197)
(30, 81379.6314915779)
(31, 73790.68902420388)
(32, 67015.2552729126)
(33, 60952.709488922104)
(34, 55525.6496128752)
(35, 50651.25484846148)
(36, 46262.99761551886)
(37, 42306.0591430617)
(38, 38732.16030688332)
(39, 35500.81493079493)
(40, 32572.262539095434)
(41, 29912.732

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [11]:
# pytorch tensor version
%time
import torch
dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0")

# N batch_size; D_in input dimention
# H hidden dimention; D_out output dimention
N, D_in, H, D_out = 64, 1000, 100, 10

# create random input and output data
x = torch.randn(N, D_in, dtype=dtype, device=device)
y = torch.randn(N, D_out, dtype=dtype, device=device)

#randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for i in range(500):
    #forward 
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    #compute loss
    loss = (y_pred - y).pow(2).sum().item()
    print(i, loss)
    
    #backprop
    grad_y_pred = 2.0 * (y_pred -y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0]=0
    grad_w1 = x.t().mm(grad_h)
    
    #update
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    

CPU times: user 1 µs, sys: 1 µs, total: 2 µs
Wall time: 4.05 µs
(0, 31173308.0)
(1, 30147892.0)
(2, 33895816.0)
(3, 36641920.0)
(4, 33344694.0)
(5, 23637898.0)
(6, 13210258.0)
(7, 6450733.0)
(8, 3230327.75)
(9, 1859098.25)
(10, 1257263.75)
(11, 953752.3125)
(12, 771352.5625)
(13, 644351.75)
(14, 547311.125)
(15, 469421.375)
(16, 405337.84375)
(17, 351865.5625)
(18, 306817.46875)
(19, 268627.5)
(20, 236045.046875)
(21, 208101.25)
(22, 184039.0)
(23, 163241.125)
(24, 145184.921875)
(25, 129455.8125)
(26, 115717.9921875)
(27, 103673.90625)
(28, 93088.703125)
(29, 83751.328125)
(30, 75489.9921875)
(31, 68163.765625)
(32, 61651.75)
(33, 55847.328125)
(34, 50664.68359375)
(35, 46032.5625)
(36, 41883.4921875)
(37, 38157.43359375)
(38, 34803.94140625)
(39, 31783.60546875)
(40, 29064.1171875)
(41, 26612.8359375)
(42, 24392.48828125)
(43, 22378.384765625)
(44, 20549.484375)
(45, 18885.083984375)
(46, 17369.515625)
(47, 15987.884765625)
(48, 14726.8359375)
(49, 13575.1591796875)
(50, 12522.078125

## Autograd
###PyTorch: Tensors and autograd
In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If x is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [13]:
%time
import torch

dtype = torch.float
device = torch.device("cpu")
#device = torch.device("cuda:0")
# N batch_size; D_in input dimention
# H hidden dimention; D_out output dimention
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred -y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # manually zero all the gradient after update weights
        w1.grad.zero_()
        w2.grad.zero_()
    

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs
(0, 30193284.0)
(1, 27335972.0)
(2, 28052356.0)
(3, 28322944.0)
(4, 25304272.0)
(5, 19202206.0)
(6, 12352478.0)
(7, 7166758.0)
(8, 4048828.0)
(9, 2408390.75)
(10, 1569851.0)
(11, 1126226.75)
(12, 870342.8125)
(13, 706824.125)
(14, 591754.0)
(15, 504692.875)
(16, 435502.5)
(17, 378687.34375)
(18, 331161.5)
(19, 290890.96875)
(20, 256522.5)
(21, 227022.296875)
(22, 201538.96875)
(23, 179403.625)
(24, 160113.953125)
(25, 143255.34375)
(26, 128471.296875)
(27, 115463.15625)
(28, 103986.8984375)
(29, 93824.8359375)
(30, 84810.6640625)
(31, 76792.203125)
(32, 69640.140625)
(33, 63251.71875)
(34, 57534.18359375)
(35, 52406.546875)
(36, 47802.8671875)
(37, 43671.5859375)
(38, 39946.09375)
(39, 36581.796875)
(40, 33537.5859375)
(41, 30778.640625)
(42, 28275.775390625)
(43, 26001.205078125)
(44, 23932.19140625)
(45, 22048.978515625)
(46, 20331.65234375)
(47, 18762.380859375)
(48, 17328.60546875)
(49, 16017.8291015625)
(50, 14818.277

##PyTorch: Defining new autograd functions
Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of torch.autograd.Function and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

In [14]:
%time
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

dtype = torch.float
device = torch.device("cuda:0")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()      

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs
(0, 36068840.0)
(1, 29475204.0)
(2, 24179312.0)
(3, 18218820.0)
(4, 12399582.0)
(5, 7888791.0)
(6, 4953786.0)
(7, 3225837.0)
(8, 2235222.0)
(9, 1651326.375)
(10, 1285989.125)
(11, 1040177.0625)
(12, 863155.5625)
(13, 728682.125)
(14, 622616.75)
(15, 536783.875)
(16, 465761.15625)
(17, 406337.28125)
(18, 356146.0)
(19, 313385.375)
(20, 276768.78125)
(21, 245231.28125)
(22, 217905.1875)
(23, 194143.03125)
(24, 173402.59375)
(25, 155237.453125)
(26, 139298.6875)
(27, 125304.0078125)
(28, 112942.765625)
(29, 101984.7578125)
(30, 92254.5859375)
(31, 83593.265625)
(32, 75880.328125)
(33, 68984.8515625)
(34, 62803.94921875)
(35, 57254.578125)
(36, 52261.5859375)
(37, 47762.7734375)
(38, 43705.13671875)
(39, 40037.20703125)
(40, 36717.59375)
(41, 33710.26171875)
(42, 30982.220703125)
(43, 28501.251953125)
(44, 26246.091796875)
(45, 24192.4375)
(46, 22322.453125)
(47, 20615.04296875)
(48, 19054.7734375)
(49, 17626.208984375)
(50, 1

##TensorFlow: Static Graphs
PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is that TensorFlow’s computational graphs are static and PyTorch uses dynamic computational graphs.

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.

Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.

One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as tf.scan for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

In [16]:
import tensorflow as tf
import numpy as np
%time
# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs
22274388.0
15002860.0
11281076.0
9100265.0
7649930.5
6556337.0
5627670.0
4799289.0
4037848.8
3353943.8
2748241.5
2230587.5
1796526.0
1442284.9
1156499.2
929592.4
750114.5
609045.1
497972.16
410527.0
341335.06
286314.12
242269.66
206747.33
177808.73
154018.16
134281.25
117756.92
103804.7
91932.45
81762.75
72995.73
65386.855
58740.285
52905.68
47766.715
43213.14
39163.043
35549.836
32317.54
29418.734
26813.191
24466.607
22352.621
20441.47
18711.184
17142.775
15719.33
14426.096
13249.576
12178.383
11201.654
10311.056
9497.358
8753.814
8073.377
7450.2637
6879.132
6355.37
5874.731
5433.3213
5027.7363
4654.9336
4311.8564
3995.8513
3704.9524
3436.8489
3189.5007
2961.295
2750.6936
2556.1414
2376.356
2210.142
2056.4368
1914.1514
1782.4141
1660.4688
1547.3923
1442.5629
1345.338
1255.0876
1171.3384
1093.6149
1021.376
954.2449
891.87787
833.8183
779.7829
729.51685
682.7002
639.0997
598.4525
560.56287
525.24805
492.30038
461.53522
432.