实现两层全连接神经网络
--------------

一个全连接ReLU神经网络，一个隐藏层，没有bias。用来从x预测y，使用L2 Loss。
- ##  $h = W_1X$
- ## $h_{relu} = max(0, h)$
- ## $y_{pred} = W_2 h_{relu}$

### 方案一：

## 用 numpy 实现两层神经网络

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。
- forward pass
- loss
- backward pass

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构。


In [1]:
import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    h = x.dot(w1) # N * H
    h_relu = np.maximum(h, 0) # N * H
    y_pred = h_relu.dot(w2) # N * D_out
    
    # compute loss
    loss = np.square(y_pred - y).sum()
    if it % 50 == 0:
        print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h<0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32760332.78537062
50 18453.703819797192
100 912.7155486709695
150 73.30407075891168
200 7.2165139702279895
250 0.8062327538551376
300 0.0981733647629658
350 0.012704385438356888
400 0.0017174817983649658
450 0.00023961750375020837
500 3.419821523577262e-05


### 方案二：

## PyTorch: Tensors 实现两层神经网络

使用PyTorch tensors来创建前向神经网络，计算损失，以及反向传播。

一个PyTorch Tensor很像一个numpy的ndarray。但是它和numpy ndarray最大的区别是，PyTorch Tensor可以在CPU或者GPU上运算。如果想要在GPU上运算，就需要把Tensor换成cuda类型。

In [2]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H)
w2 = torch.randn(H, D_out)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    h = x.mm(w1) # N * H
    h_relu = h.clamp(min=0) # N * H
    y_pred = h_relu.mm(w2) # N * D_out
    
    # compute loss
    loss = (y_pred - y).pow(2).sum().item()
    if it % 50 == 0:
        print(it, loss)
    
    # Backward pass
    # compute the gradient
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # update weights of w1 and w2
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 38362032.0
50 19347.6875
100 1365.721435546875
150 180.57022094726562
200 29.791967391967773
250 5.396778583526611
300 1.025632619857788
350 0.20053671300411224
400 0.04009748995304108
450 0.008369175717234612
500 0.0020418709609657526


### 方案三：

## PyTorch: Tensors 和 Autograd 实现两层神经网络


PyTorch的一个重要功能就是autograd，也就是说只要定义了forward pass(前向神经网络)，计算了loss之后，PyTorch可以自动求导计算模型所有参数的梯度。

一个PyTorch的Tensor表示计算图中的一个节点。如果``x``是一个Tensor并且``x.requires_grad=True``那么``x.grad``是另一个储存着``x``当前梯度(相对于一个scalar，常常是loss)的向量。

In [3]:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    loss = (y_pred - y).pow(2).sum() # computation graph
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

0 33255704.0
50 10729.2197265625
100 298.8142395019531
150 14.81286334991455
200 0.9232968688011169
250 0.06418387591838837
300 0.004952050279825926
350 0.0005932210478931665
400 0.0001456607860745862
450 5.7708843087311834e-05
500 3.1205090635921806e-05


### 方案四：

## PyTorch: Tensors 和 optim 实现两层神经网络

In [4]:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
optimizer = torch.optim.SGD([w1, w2], lr=learning_rate)

for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    loss = (y_pred - y).pow(2).sum() # computation graph
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
#     with torch.no_grad():
#         w1 -= learning_rate * w1.grad
#         w2 -= learning_rate * w2.grad
#         w1.grad.zero_()
#         w2.grad.zero_()
    optimizer.step()
    optimizer.zero_grad()

0 31978560.0
50 13783.6044921875
100 616.9820556640625
150 49.179222106933594
200 4.8297014236450195
250 0.522691011428833
300 0.05973578989505768
350 0.007287117652595043
400 0.0011213204124942422
450 0.00027184493956156075
500 9.802633576327935e-05


### 方案五：

## PyTorch: Tensors 和 nn.MSELoss 实现两层神经网络

In [5]:
import torch
import torch.nn as nn
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
optimizer = torch.optim.SGD([w1, w2], lr=learning_rate)
loss_fn = nn.MSELoss(reduction='sum')

for it in range(501):
    # Forward pass
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # compute loss
    # loss = (y_pred - y).pow(2).sum() 
    loss = loss_fn(y_pred, y)
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    optimizer.step()
    optimizer.zero_grad()

0 32740072.0
50 17535.12109375
100 841.505126953125
150 64.40705108642578
200 5.932732105255127
250 0.6023210883140564
300 0.06497772783041
350 0.007521670311689377
400 0.0011529145995154977
450 0.0002922675048466772
500 0.0001107300486182794


### 方案六：

## PyTorch: nn 实现两层神经网络

使用PyTorch中nn这个库来构建网络。
用PyTorch autograd来构建计算图和计算gradients，
然后PyTorch会帮我们自动计算gradient。


In [6]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=True),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for it in range(501):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    
    if it % 50 == 0:
        print(it, loss.item())
    
    # Backward pass
    loss.backward()
    
    # update weights of w1 and w2
    with torch.no_grad():
        for param in model.parameters(): # param (tensor, grad)
            param -= learning_rate * param.grad
#             param.grad.zero_()
            
    model.zero_grad()

0 30020408.0
50 11235.6328125
100 292.935791015625
150 12.274823188781738
200 0.6340152025222778
250 0.036800604313611984
300 0.002526442054659128
350 0.0003348247555550188
400 9.150088590104133e-05
450 3.98082411265932e-05
500 2.1767235011793673e-05


### 方案七：

## PyTorch: nn 和 Optim 实现两层神经网络

In [7]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=False), # w_1 * x + b_1
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out, bias=False),
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()

loss_fn = nn.MSELoss(reduction='sum')
# learning_rate = 1e-4
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

learning_rate = 1e-6
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for it in range(501):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    optimizer.zero_grad()


0 28515726.0
50 13081.2568359375
100 541.2086791992188
150 39.39814758300781
200 3.4956207275390625
250 0.33633261919021606
300 0.03371109813451767
350 0.0036836632061749697
400 0.0005850918241776526
450 0.00015891357907094061
500 6.251141167012975e-05


### 方案八：

## PyTorch:  自定义 nn Modules 实现两层神经网络 (显式参数)

可以定义一个模型，这个模型继承自nn.Module类。如果需要定义一个比Sequential模型更加复杂的模型，就需要定义nn.Module模型。

In [8]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        # define the model architecture
        self.W1 = nn.Parameter(nn.init.xavier_normal_(torch.Tensor(D_in, H)))
        self.W2 = nn.Parameter(nn.init.xavier_normal_(torch.Tensor(H, D_out)))
    
    def forward(self, x):
        y_pred = x.mm(self.W1).clamp(min=0).mm(self.W2)
        return y_pred

model = TwoLayerNet(D_in, H, D_out)
# loss_fn = nn.MSELoss(reduction='sum')
loss_fn = nn.MSELoss()
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    
    optimizer.zero_grad()

0 3.1352224349975586
50 0.38237589597702026
100 0.05351356416940689
150 0.015370173379778862
200 0.008596157655119896
250 0.005803755018860102
300 0.003925961442291737
350 0.0024989633820950985
400 0.0014968174509704113
450 0.0008101157727651298


### 方案九：

## PyTorch: 自定义 nn Modules 实现两层神经网络 (隐式参数)

In [9]:
import torch.nn as nn

N, D_in, H, D_out = 64, 1000, 100, 10

# 随机创建一些训练数据
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        # define the model architecture
        self.linear1 = torch.nn.Linear(D_in, H, bias=False)
        self.linear2 = torch.nn.Linear(H, D_out, bias=False)
    
    def forward(self, x):
        y_pred = self.linear2(self.linear1(x).clamp(min=0))
        return y_pred

model = TwoLayerNet(D_in, H, D_out)
loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for it in range(500):
    # Forward pass
    y_pred = model(x) # model.forward() 
    
    # compute loss
    loss = loss_fn(y_pred, y) # computation graph
    if it % 50 == 0:
        print(it, loss.item())

    # Backward pass
    loss.backward()
    
    # update model parameters
    optimizer.step()
    
    optimizer.zero_grad()

0 604.010009765625
50 160.0804901123047
100 33.51659393310547
150 3.980309247970581
200 0.2226010411977768
250 0.006924773100763559
300 0.0001677029358688742
350 4.555347913992591e-06
400 1.1379442099723747e-07
450 2.3033137619421495e-09


In [10]:
# Kan Horst