### 作業目標: 使用Pytorch進行微分與倒傳遞
這份作業我們會實作微分與倒傳遞以及使用Pytorch的Autograd。

### 使用Pytorch實作微分與倒傳遞

這裡我們很簡單的實作兩層的神經網路進行回歸問題，其中loss function為L2 loss

$$
L2\_loss = (y_{pred}-y)^2
$$

兩層經網路如下所示
$$
y_{pred} = ReLU(XW_1)W_2
$$

In [1]:
import torch
device = torch.device('cpu')

In [5]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成x, y
x = torch.randn((N, D_in))
y = torch.randn((N, D_out))

# 初始化weight W1, W2
W1 = torch.randn((D_in, H), requires_grad=True)
W2 = torch.randn((H, D_out), requires_grad=True)

# 設置learning rate
learning_rate = 1e-6

# 訓練500個epoch
for t in range(500):
  # 向前傳遞: 計算y_pred
  y_hat = torch.relu(x.mm(W1)).mm(W2)

  # 計算loss
  residual = y_hat - y
  loss = (residual ** 2).sum()
  print(t, loss.item())

  # 倒傳遞: 計算W1與W2對loss的微分(梯度)
  W1_grad = (residual.mm(W2.T) * torch.where(x.mm(W1)>0, torch.tensor(1.), torch.tensor(0.))).T.mm(x)
  W2_grad = residual.T.mm(torch.relu(x.mm(W1)))

  # 參數更新
  W1.data -= 2 * learning_rate * W1_grad.T
  W2.data -= 2 * learning_rate * W2_grad.T

0 30224682.0
1 24212614.0
2 23248890.0
3 23289546.0
4 22149226.0
5 18655318.0
6 13759242.0
7 8965520.0
8 5448845.5
9 3248760.5
10 2005960.625
11 1320261.0
12 936942.5625
13 710701.625
14 567359.5
15 469186.90625
16 397059.0625
17 341055.65625
18 295863.25
19 258465.125
20 226983.40625
21 200209.828125
22 177196.296875
23 157339.375
24 140079.640625
25 124991.7890625
26 111766.953125
27 100137.125
28 89885.703125
29 80821.09375
30 72781.9609375
31 65644.9453125
32 59291.875
33 53624.2109375
34 48557.7890625
35 44020.9375
36 39958.5625
37 36312.6484375
38 33032.8828125
39 30078.859375
40 27415.087890625
41 25011.57421875
42 22839.70703125
43 20877.279296875
44 19099.154296875
45 17485.068359375
46 16020.537109375
47 14688.923828125
48 13477.5673828125
49 12374.8017578125
50 11369.9345703125
51 10454.09375
52 9617.720703125
53 8853.37109375
54 8155.1513671875
55 7516.9599609375
56 6931.3779296875
57 6394.7060546875
58 5902.61865234375
59 5451.2470703125
60 5036.81787109375
61 4655.9970703

### 使用Pytorch的Autograd

In [6]:
import torch
device = torch.device('cpu')

In [7]:
# N: batch size
# D_in: input dimension
# H: hidden dimension
# D_out: output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# 隨機生成x, y
x = torch.randn((N, D_in))
y = torch.randn((N, D_out))

# 初始化weight W1, W2
W1 = torch.randn((D_in, H), requires_grad=True)
W2 = torch.randn((H, D_out), requires_grad=True)

# 設置learning rate
learning_rate = 1e-6

# 訓練500個epoch
for t in range(500):
  # 向前傳遞: 計算y_pred
  y_hat = torch.relu(x.mm(W1)).mm(W2)

  # 計算loss
  loss = ((y_hat - y) **2).sum()
  print(t, loss.item())

  # 倒傳遞: 計算W1與W2對loss的微分(梯度)
  loss.backward()

  # 參數更新: 這裡再更新參數時，我們不希望更新參數的計算也被紀錄微分相關的資訊，因此使用torch.no_grad()
  with torch.no_grad():
      W1.data -= learning_rate * W1.grad 
      W2.data -= learning_rate * W2.grad 

  # 將紀錄的gradient清空(因為已經更新參數)
  W1.grad.zero_()
  W2.grad.zero_()

0 28344930.0
1 20894568.0
2 17332850.0
3 14810688.0
4 12397750.0
5 9835302.0
6 7420828.5
7 5340980.0
8 3762893.5
9 2626426.0
10 1854723.75
11 1334945.5
12 987213.0
13 750407.125
14 586115.3125
15 468853.9375
16 382713.59375
17 317597.3125
18 267193.28125
19 227140.8125
20 194671.625
21 167936.984375
22 145655.15625
23 126926.21875
24 111046.3984375
25 97524.140625
26 85946.4140625
27 75964.078125
28 67317.1953125
29 59796.02734375
30 53231.3984375
31 47487.9609375
32 42453.6640625
33 38029.203125
34 34129.4609375
35 30681.94921875
36 27628.17578125
37 24919.439453125
38 22514.955078125
39 20373.333984375
40 18461.2265625
41 16752.513671875
42 15221.5517578125
43 13847.46875
44 12616.1826171875
45 11509.017578125
46 10510.6669921875
47 9609.2705078125
48 8794.609375
49 8056.419921875
50 7387.53515625
51 6780.26123046875
52 6228.923828125
53 5727.125
54 5270.169921875
55 4853.4345703125
56 4474.0849609375
57 4127.28173828125
58 3810.055419921875
59 3519.78271484375
60 3253.825439453125
6