In [17]:
import numpy as np

Given $\vec{X} \rightarrow \vec{H} \rightarrow \vec{Y}$ with 3 -> 4 -> 2 node, 

to calculate the hidden layer tensor: 

$$h=\sigma(W_{i}x + b)$$

changes the input dimension from $3 \times 1$ to $4 \times 1$, therefore $\vec{b}$ is also a $4 \times 1$ matrix


1. Hidden Layer matrix $\vec{H}$ (4 $\times$ 1)
$$
\vec{H} = \begin{bmatrix} h_1 \\ h_2 \\ h_3 \\ h_4 \end{bmatrix}
$$

2. Input Layer $\vec{X}$ (3 $\times$ 1)
$$
\vec{X} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}
$$

3. Weight matrix, $W_{1}$, therefore, is a ($4 \times 3$)
$$
W_1 = \begin{bmatrix} 
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23} \\
w_{31} & w_{32} & w_{33} \\
w_{41} & w_{42} & w_{43}
\end{bmatrix}

In [18]:
# true label y
y_true = np.array([0.5, 0.8]).reshape(2, 1)

# 3 nodes input layer
X = np.array([1, 2, 3]).reshape(3, 1)

# Hidden Layer parameter
W_1 = np.array([
    [0.1, 0.2, 0.3],   # 对应 h1 的权重
    [0.4, 0.5, 0.6],   # 对应 h2 的权重
    [0.7, 0.8, 0.9],   # 对应 h3 的权重
    [1.0, 1.1, 1.2]    # 对应 h4 的权重
])
b_1 = np.array([0.1, 0.2, 0.3, 0.4]).reshape(4, 1)

# Output Layer parameter
W_2 = np.array([
    [0.2, 0.4, 0.7, 0.3],   # 对应 o1 的权重
    [0.3, 0.5, 0.2, 0.9],   # 对应 o2 的权重
])
b_2 = np.array([0.4, 0.6]).reshape(2, 1)

# ----------------- Forward Propagation -----------------
# Hidden Layer Calculation
z_1 = W_1.dot(X) + b_1      # Linear Transformation
H_1 = np.maximum(0, z_1)      # Relu

# Output layer Calculation
z_2 = W_2.dot(H_1) + b_2
y_pred = np.maximum(0, z_2)

print("predicted result:")
print(y_pred)


predicted result:
[[ 7.93]
 [10.29]]


Step 1: Calculation of $\frac{\partial_L}{\partial_{W_2}}$ and $\frac{\partial_L}{\partial_{d_2}}$ for Output Layer

After forward propagation giving the output layer, backward propagation is required to calculate gradient and update the parameters.

Gradient is a vector indicating the function's partial deriviative of each parameter at any point, it points toward where the function value ascending the fastest. 

In neural network, loss function $L$ has gradient $\frac{\partial_L}{\partial_{y_{pred}}}$ and $\frac{\partial_L}{\partial_{b}}$ telling us how to adjust weight $W$ and bias $b$ to optimize (minimize) the loss.

In [19]:
# Calculate MSE(Mean Square Error) Loss
loss = 0.5 * np.sum((y_pred - y_true) ** 2)
print("forward propagation loss:" + str(loss))

forward propagation loss:72.6325


In this case above, loss function MSE

$$
L = \frac{1}{2} sum(y_{pred} - y_{true})^2
$$

has

$$
\frac{\partial{L}}{\partial{y_{pred}}} = y_{pred} - y_{true}
$$
indicating gradient increases as $y_{pred}$ deviates from $y_{true}$, therefore it needs more adjustment.

And we have: 
$$
\frac{\partial y_{\text{pred}}}{\partial z_2} =
\begin{cases} 
1 & \text{if } z_2 > 0, \\
0 & \text{else}.
\end{cases}
$$

We want: 
$$\frac{\partial{L}}{\partial_{z_2}} = \frac{\partial{L}}{\partial{y_{pred}}} \times \frac{\partial y_{pred}}{\partial {z_2}}$$



In [20]:
dL_dypred = y_pred - y_true
print("Gradient of Loss to y_predict")
print(dL_dypred)

relu_deriv = (z_2 > 0).astype(float)
print("ReLU deriviative matrix:")
print(relu_deriv)

dL_dz2 = dL_dypred * relu_deriv 
print("Gradient of Loss to z_2:")
print(dL_dz2)

Gradient of Loss to y_predict
[[7.43]
 [9.49]]
ReLU deriviative matrix:
[[1.]
 [1.]]
Gradient of Loss to z_2:
[[7.43]
 [9.49]]


And then the gradient of $\partial_{L}$ to $W_{2}$
$$
\frac{\partial_{L}}{\partial_{W_{2}}} = \frac{\partial_{L}}{\partial_{z_{2}}} \times H_{1}^T
$$

where $H_{1}^T$ is the transpose of $H_{1}$, the hidden layer output. Since $\frac{\partial_{L}}{\partial_{z_{2}}}$ is $(2 \times 1)$ and $H_{1}$ is $1 \times 4$, therefore $W_{2}$'s gradient $\frac{\partial_{L}}{\partial_{W_{2}}}$ is $(2 \times 4)$.

In [21]:
dL_dw2 = dL_dz2.dot(H_1.T)
print("Gradient of Loss with respect to W2:")
print(dL_dw2)

Gradient of Loss with respect to W2:
[[11.145 25.262 39.379 53.496]
 [14.235 32.266 50.297 68.328]]


We also want 
$$
\frac{\partial_{L}}{\partial_{b_{2}}} = \frac{\partial_{L}}{\partial_{z_{2}}} \times \frac{\partial_{z_2}}{\partial_{b_{2}}}
$$

while 

$$
z_{2} = W_2 * H_1 + b_2
$$

therefore

$$
\frac{\partial_{z_2}}{\partial_{b_{2}}} = 1
$$

resulting 

$$
\frac{\partial_{L}}{\partial_{b_{2}}} = \frac{\partial_{L}}{\partial_{z_{2}}}
$$

In [22]:
dl_db2 = dL_dz2
print("Gradient of Loss with respect to d2:")
print(dl_db2)

Gradient of Loss with respect to d2:
[[7.43]
 [9.49]]


Step 2: Calculation of $\frac{\partial_L}{\partial_{W_{1}}}$ and $\frac{\partial_L}{\partial_{b_{1}}}$

$$
\frac{\partial_L}{\partial_{W_{1}}} = \frac{\partial_L}{\partial_{H_{1}}} \times \frac{\partial_{H_{1}}}{\partial_{z_1}} \times \frac{\partial_{z_1}}{\partial_{W_1}}
$$

Where

$$
\frac{\partial_L}{\partial_{H_{1}}} = W_{2}^T \times \frac{\partial_L}{\partial_{z_{2}}}
$$

$$
\frac{\partial_{H_1}}{\partial_{z_{1}}} = 
\begin{cases} 
1 & \text{if } z_2 > 0, \\
0 & \text{else}.
\end{cases}
$$

$$
\frac{\partial_{z_1}}{\partial_{W_{1}}} = X^T
$$

In [23]:
dL_dH1 = W_2.T.dot(dL_dz2)
relu_deriv_z1 = (z_1 > 0).astype(float)
dL_dz1 = dL_dH1 * relu_deriv_z1 
dL_dW1 = dL_dz1.dot(X.T) 
dL_db1 = dL_dz1  # (4x1)