# Learning PyTorch with Examples

## 1. Tensors

### 1.1 Warm-up: numpy

省略……

### 1.2 PyTorch: Tensors

In [3]:
import torch

dtype = torch.float
device = torch.device('cpu')
#device = torch.device('cuda:0') # To run on GPU

In [4]:
N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.randn(D_in, H, device=device, dtype=dtype) # Default requires_grad=False, indicates that we need to compute gradients manually when backward!
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

In [6]:
learning_rate = 1e-6
for t in range(20):
    # Forward pass: compute predicted y
    h = x.mm(w1)               # Tensor.mm表示相乘，类似于numpy.dot
    h_relu = h.clamp(min=0)    # Tensor.clamp表示限定数值在一个范围内，截断太大或太小的数值
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # Backward to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)  # Tensor.t表示转置，类似于numpy.T
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()          # Tensor.clone表示复制，类似于numpy.copy
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 1174780.375
1 941842.6875
2 778356.375
3 655291.9375
4 558364.0
5 479846.90625
6 415080.90625
7 360968.1875
8 315305.8125
9 276538.6875
10 243410.5
11 214961.34375
12 190448.0625
13 169241.234375
14 150804.640625
15 134697.859375
16 120583.421875
17 108177.4921875
18 97238.8515625
19 87577.5546875


## 2. Autograd

### 2.1 PyTorch: Tensors and Autograd

We can use ***automatic differentiation*** to aumomate the computation of backward passes in neural network. The ***autograd*** package in PyTorch provides exactly this functionality.

When using ***autograd***, the forward pass of your network will define a **computational graph**: nodes in the graph will be **Tensors**, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

Each Tensor represents a node in a computational graph. If ***x*** is a Tensor that has ***x.requires_grad=True***, then ***x.grad*** is another Tensor holding the gradient of ***x*** with respect to some scalar value.

In [1]:
import torch
dtype = torch.float
device = torch.device('cpu')

> 刘尧：注意以下Tensors的**requires_grad**，默认是False，表示不需计算其gradient，设置为True，表示当loss.backward()时，会**automatically计算其gradient**！

In [8]:
N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention

# Create random Tensors to hold input and outputs
# Setting requires_grad=False indicates that we don't need to compute gradients with respect to these Tensors during the backward pass
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights
# 关键点：Setting requres_grad=True indicates that we want to compute gradients with respect to these Tensors during the backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

> 刘尧：注意**loss.backward()**是关键，会**自动计算相关Tensors(w1和w2)的gradient**并分别保存在**w1.grad和w2.grad**中！以供后续手动更新weights时使用

> 刘尧：注意**with torch.no_grad()**，因为我们打算手动更新weights，不再需要track history in autograd！

In [13]:
learning_rate = 1e-6
for t in range(10):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)  # 不再需要1.2中的中间变量，因为此时我们不再需要手动计算gradients(手动计算才需要这些中间变量)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(loss.item())
    
    # 关键点：Use autograd to compute the backward pass. This call will compute the gradient of loss with respect to all Tensors with requires_grad=True
    # After this call, w1.grad and w2.grad will be Tensors holding the gradient of the loss with respect to w1 and w2 respectively
    loss.backward()
    
    # 关键点：Manually update weights using gradient descent. 
    # Wrap in torch.no_grad() because weights have requires_grad=True, but we don't need to track this in autograd
    # Alternative way is to operate on weight.data and weight.grad.data, since tensor.data is a tensor that shares the storage but doesn't track history
    # We can also use torch.optim.SGD to achieve this
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

134.58912658691406
128.32373046875
122.36589050292969
116.69249725341797
111.29835510253906
106.16795349121094
101.28012084960938
96.62560272216797
92.19682312011719
87.97853088378906


### 2.2 PyTorch: Defining New Autograd Function

The ***forward function*** computes output Tensors from input Tensors. The ***backward*** function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In [2]:
class MyReLU(torch.autograd.Function):
    """Define custom autograd Function by subclassing torch.autograd.Function and implementing the forward and backward passes which operate on Tensors"""
    
    @staticmethod
    def forward(ctx, input):
        """ctx: Context object that can be used to stash information for backward. Cache arbitrary objects for backward using ctx.save_for_backward"""
        ctx.save_for_backward(input)
        return input.clamp(min=0)  # 实现ReLU
    
    @staticmethod
    def backward(ctx, grad_output):
        """grad_output: Tensor containing the gradient of the loss with respect to output"""
        input, = ctx.saved_tensors  # 接收forward中ctx.save_for_backward的objects
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [4]:
dtype = torch.float
device = torch.device('cpu')
N, D_in, H, D_out = 64, 1000, 100, 10  # batch size, input dimension, hidden dimension, output dimention
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(10):
    relu = MyReLU.apply               # To apply custom Function, we use Function.apply method and alias this as 'relu'  疑问：可不可以放在for循环外面？！？
    y_pred = relu(x.mm(w1)).mm(w2)    # Forward pass  代替了原来的：x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()  # Compute loss
    print(loss.item())
    loss.backward()                   # Backward pass
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

29840608.0
25225794.0
23254584.0
20728966.0
17023182.0
12410040.0
8325976.0
5259737.5
3318946.75
2152646.75


### 2.3 TensorFlow: Static Graphs

PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is :

TensorFlow’s computational graphs are **static** and PyTorch uses **dynamic** computational graphs.

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. 

In PyTorch, each forward pass defines a new computational graph.

> 刘尧：更新weigths这一操作，在Tensorflow中是Computational Graph的一部分，是Static！而在PyTorch中，它发生**在Computational Graph之外，是dynamic**！

In [6]:
import tensorflow as tf
import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10
x = tf.placeholder(tf.float32, shape=(None, D_in))  # For input, it will be filled with real data when execution
y = tf.placeholder(tf.float32, shape=(None, D_out)) # For target
w1 = tf.Variable(tf.random_normal((D_in, H)))       # For weights, initialized with random data
w2 = tf.Variable(tf.random_normal((H, D_out)))      # Same as above

# Forward pass  Note that these code doesn't actually perform any numeric operations. It merely sets up the computational graph
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss and its gradient with respect to w1 and w2
loss = tf.reduce_sum((y - y_pred) ** 2.0)
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# 关键点：Update the weights  Note that in Tensorflow, it's a part of the computational graph; In PyTorch, it happens outside the computational graph!
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built computational graph, so we enter a Tensorflow session to actually execute the graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())  # Run the graph once to initialize the Variables w1 and w2
    
    # Create numpy arrays holding the actual data for inputs x and targets y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(10):
        # Execute the graph many times. Each time it executes, we bind x_value to x and y_value to y, and compute the values for loss, new_w1 and new_w2
        loss_value, _, _ = sess.run([loss, new_w1, new_w2], feed_dict={x: x_value, y: y_value})
        print(loss_value)

31212168.0
24741888.0
20985892.0
17999220.0
16083461.0
13378464.0
10198802.0
7229977.5
4866626.5
3210210.0


## 3. nn Module

### 3.1 PyTorch: nn

***nn*** package defines a set of **Modules**, which are roughly equivalent to neural network layers.

> 刘尧：Module由Layer组成，同时又可像Layer那样看待和使用！ 

> 刘尧：另外，PyTorch也可像Keras那样**使用Sequential等high-level API来快速构建Model**！谁说不能的来着！？Custom Model见3.3

In [7]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [9]:
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)
loss_fn = torch.nn.MSELoss(reduction='sum')

In [11]:
learning_rate = 1e-4
for t in range(10):
    # Forward pass: Module object override __call__ operator, so we can call it like function
    y_pred = model(x)
    
    # Compute and print loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Zero the gradients before running the backward pass
    model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable parameters of model
    loss.backward()
    
    # Update the weights: Each parameter is a Tensor, so we can access its gradient using Tensor.grad
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 2.821704626083374
1 2.7032277584075928
2 2.5902411937713623
3 2.4824397563934326
4 2.379688024520874
5 2.281726121902466
6 2.188166856765747
7 2.098848581314087
8 2.013753652572632
9 1.9325002431869507


### 3.2 PyTorch: optim

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with *torch.no_grad()* or *.data* to avoid tracking history in autograd).

This is not a huge burden for simple optimization algorithms like **stochastic gradient descent**, but in practice we often train neural networks **using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc**.

> 刘尧：使用optimizer代替之前 with torch.no_grad() 包裹的那一堆代码！

In [14]:
# N, D_in, H, D_out, x, y, model, loss_fn are same with those in 3.1

learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  # The 1st argument tells which Tensors to be updated
for t in range(10):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    model.zero_grad()
    loss.backward()
    optimizer.step()  # Calling the step function on an Optimizer makes an update to its parameters

0 1.416172381141223e-05
1 0.05516074225306511
2 0.03499681130051613
3 0.025090083479881287
4 0.018877189606428146
5 0.018316902220249176
6 0.018190057948231697
7 0.015330985188484192
8 0.01237315870821476
9 0.01067157182842493


### 3.3 PyTorch: Custom Modules

In [15]:
class TwoLayerNet(torch.nn.Module):
    
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """We can use Modules defined in __init__() as well as arbitrary operators on Tensors"""
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

In [16]:
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = TwoLayerNet(D_in, H, D_out)

### 3.4 PyTorch: Control Flow + Weight Sharing

We implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, **reusing the same weights multiple times to compute the innermost hidden layers**.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by **simply reusing the same Module multiple times when defining the forward pass**.

In [17]:
import random
import torch

In [18]:
class DynamicNet(torch.nn.Module):
    
    def __init__(self, D_in, H, D_out):
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        For the forward pass, we reuse the middle_linear Module many times to compute hidden layer representations. 
        It's perfectly safe to reuse the same Module many times when defining a computational graph. It's a big improvement from Lua Torch.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)  # 多次使用Module的同一个实例(self.middle_linear)，以实现weight sharing！
        y_pred = self.output_linear(h_relu)
        return y_pred