# Manual differentiation - Worksheet 5

Given system:

$$
x_{n+1} = 
\begin{bmatrix}
x_{n+1,1} \\
x_{n+1,2} \\
x_{n+1,3} \\
\end{bmatrix}
=
\begin{bmatrix}
x^2_{n,1} + x_{n,2} \\ 
- x_{n,1} + \frac{x_{n,2}}{2} \\
- x^2_{n,2} + x_{n,3} \\
\end{bmatrix}
+
\begin{bmatrix}
x_{n,1}\theta_1 \\
x_{n,2}\theta_2 \\
x_{n,3}\theta_3 \\
\end{bmatrix}
$$


### (1) One step-control Backprop

Current system state and weights:

$$
x_{0} = (2, -1, 3)^T
$$
$$
\theta = (1, 4, 2)^T
$$

Next system state $x_{1}$:
$$
\mathbf{x}_1 = \begin{bmatrix} x_{0,1}^2 + x_{0,2} \\ -x_{0,1} + \frac{x_{0,2}}{2} \\ -x_{0,2}^2 + x_{0,3} \end{bmatrix} + \begin{bmatrix} x_{0,1}\theta_1 \\ x_{0,2}\theta_2 \\ x_{0,3}\theta_3 \end{bmatrix} = \begin{bmatrix} 2^2 + (-1) + 2\cdot(-1) \\ -2  -0.5 + (-1)\cdot4 = \\ -(-1)^2 + 3 + 3 \cdot 2 \end{bmatrix} = \begin{bmatrix} 1 \\ -6.5 \\ 8\end{bmatrix}
$$

Loss function L:
$$
L = \|\mathbf{x}_1\|_1 = |x_{1,1}| + |x_{1,2}| + |x_{1,3}| = |1| + |-6.5| + |8| = 1 + 6.5 + 8 = 15.5
$$

Back-propagation gradient 
$$
\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial \mathbf{x}_{1}} \cdot \frac{\partial \mathbf{x}_{1}}{\partial \mathbf{\theta}}
$$

$$
\frac{\partial L}{\partial \mathbf{x}_1} = \begin{bmatrix} \frac{\partial L}{\partial x_{1,1}} \\ \frac{\partial L}{\partial x_{1,2}} \\ \frac{\partial L}{\partial x_{1,3}} \end{bmatrix} = \begin{bmatrix} 1 \\ -1 \\ 1 \end{bmatrix}
$$

$$
\frac{\partial \mathbf{x}_1}{\partial \boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial x_{1,1}}{\partial \theta_1} & \frac{\partial x_{1,2}}{\partial \theta_1} & \frac{\partial x_{1,3}}{\partial \theta_1} \\ \frac{\partial x_{1,1}}{\partial \theta_2} & \frac{\partial x_{1,2}}{\partial \theta_2} & \frac{\partial x_{1,3}}{\partial \theta_2} \\ \frac{\partial x_{1,1}}{\partial \theta_3} & \frac{\partial x_{1,2}}{\partial \theta_3} & \frac{\partial x_{1,3}}{\partial \theta_3} \end{bmatrix} = \begin{bmatrix} x_{0,1} & 0 & 0 \\ 0 & x_{0,2} & 0 \\ 0 & 0 & x_{0,3} \end{bmatrix} = \begin{bmatrix} 2 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & 3 \end{bmatrix}
$$

$$
\frac{\partial L}{\partial \theta} = \begin{bmatrix} 1 & -1 & 1 \end{bmatrix} \cdot \begin{bmatrix} 2 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \\ 3 \end{bmatrix}
$$

### (2) Multi-step control Backprop


Current system state and weights:

$$
x_{0} = (-1, 1, 2)^T
$$
$$
\theta = (3, -1, 1)^T
$$

Next system state $x_{1}$:
$$
\mathbf{x}_1 = \begin{bmatrix} x_{0,1}^2 + x_{0,2} \\ -x_{0,1} + \frac{x_{0,2}}{2} \\ -x_{0,2}^2 + x_{0,3} \end{bmatrix} + \begin{bmatrix} x_{0,1}\theta_1 \\ x_{0,2}\theta_2 \\ x_{0,3}\theta_3 \end{bmatrix} = \begin{bmatrix} (-1)^2 + 1 + (-1)\cdot3 \\ -(-1)  + 0.5 + 1\cdot(-1) = \\ -(1)^2 + 2 + 2 \cdot 1 \end{bmatrix} = \begin{bmatrix} -1 \\ 0.5 \\ 3\end{bmatrix}
$$

Next system state $x_{2}$:
$$
\mathbf{x}_2 = \begin{bmatrix} x_{1,1}^2 + x_{1,2} \\ -x_{1,1} + \frac{x_{1,2}}{2} \\ -x_{1,2}^2 + x_{1,3} \end{bmatrix} + \begin{bmatrix} x_{1,1}\theta_1 \\ x_{1,2}\theta_2 \\ x_{1,3}\theta_3 \end{bmatrix} = \begin{bmatrix} (-1)^2 + 0.5 + (-1)\cdot3 \\ -(-1)  + 0.25 + 0.5\cdot(-1) = \\ -(0.5)^2 + 3 + 3 \cdot 1 \end{bmatrix} = \begin{bmatrix} -1.5 \\ 0.75 \\ 5.75\end{bmatrix}
$$

Loss function L:
$$
L = \|\mathbf{x}_2\|_1 = |x_{2,1}| + |x_{2,2}| + |x_{2,3}| = |-1.5| + |0.75| + |5.75| = -1.5 + 0.75 + 5.75 = 8
$$

Back-propagation gradient 
$$
\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial \theta}^{(1)} + \frac{\partial L}{\partial \theta}^{(2)} = (\frac{\partial L}{\partial \mathbf{x}_2} \cdot \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \cdot \frac{\partial \mathbf{x}_1}{\partial \mathbf{\theta}}) + (\frac{\partial L}{\partial \mathbf{x}_2} \cdot \frac{\partial \mathbf{x}_2}{\partial \mathbf{\theta}})
$$

$$
\frac{\partial L}{\partial \mathbf{x}_2} = \begin{bmatrix} \frac{\partial L}{\partial x_{2,1}} \\ \frac{\partial L}{\partial x_{2,2}} \\ \frac{\partial L}{\partial x_{2,3}} \end{bmatrix} = \begin{bmatrix} -1 \\ 1 \\ 1 \end{bmatrix}
$$

$$
\frac{\partial x_2}{\partial x_1} = \begin{bmatrix} \frac{\partial x_{2,1}}{\partial x_{1,1}} & \frac{\partial x_{2,1}}{\partial x_{1,2}} & \frac{\partial x_{2,1}}{\partial x_{1,3}} \\ \frac{\partial x_{2,2}}{\partial x_{1,1}} & \frac{\partial x_{2,2}}{\partial x_{1,2}} & \frac{\partial x_{2,2}}{\partial x_{1,3}} \\ \frac{\partial x_{2,3}}{\partial x_{1,1}} & \frac{\partial x_{2,3}}{\partial x_{1,2}} & \frac{\partial x_{2,3}}{\partial x_{1,3}} \end{bmatrix} = \begin{bmatrix} 2x_{1,1} + \theta_{1} & 1 & 0 \\ -1 & 0.5 + \theta_{2} & 0 \\ 0 & -2x_{1,2} & 1 + \theta_{3} \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 \\ -1 & -0.5 & 0 \\ 0 & -1 & 2 \end{bmatrix}
$$

$$
\frac{\partial \mathbf{x}_1}{\partial \boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial x_{1,1}}{\partial \theta_1} & \frac{\partial x_{1,2}}{\partial \theta_1} & \frac{\partial x_{1,3}}{\partial \theta_1} \\ \frac{\partial x_{1,1}}{\partial \theta_2} & \frac{\partial x_{1,2}}{\partial \theta_2} & \frac{\partial x_{1,3}}{\partial \theta_2} \\ \frac{\partial x_{1,1}}{\partial \theta_3} & \frac{\partial x_{1,2}}{\partial \theta_3} & \frac{\partial x_{1,3}}{\partial \theta_3} \end{bmatrix} = \begin{bmatrix} x_{0,1} & 0 & 0 \\ 0 & x_{0,2} & 0 \\ 0 & 0 & x_{0,3} \end{bmatrix} = \begin{bmatrix} -1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 2 \end{bmatrix}
$$

$$
\frac{\partial \mathbf{x}_2}{\partial \boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial x_{2,1}}{\partial \theta_1} & \frac{\partial x_{2,2}}{\partial \theta_1} & \frac{\partial x_{2,3}}{\partial \theta_1} \\ \frac{\partial x_{2,1}}{\partial \theta_2} & \frac{\partial x_{2,2}}{\partial \theta_2} & \frac{\partial x_{2,3}}{\partial \theta_2} \\ \frac{\partial x_{2,1}}{\partial \theta_3} & \frac{\partial x_{2,2}}{\partial \theta_3} & \frac{\partial x_{2,3}}{\partial \theta_3} \end{bmatrix} = \begin{bmatrix} x_{1,1} & 0 & 0 \\ 0 & x_{1,2} & 0 \\ 0 & 0 & x_{1,3} \end{bmatrix} = \begin{bmatrix} -1 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 3 \end{bmatrix}
$$

$$
\frac{\partial L}{\partial \theta}^{(1)} = \frac{\partial L}{\partial \mathbf{x}_2} \cdot \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \cdot \frac{\partial \mathbf{x}_1}{\partial \mathbf{\theta}} = \begin{bmatrix} -1 & 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 & 0 \\ -1 & -0.5 & 0 \\ 0 & -1 & 2 \end{bmatrix} \cdot \begin{bmatrix} -1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 2 \end{bmatrix} = \begin{bmatrix} 2 \\ -2.5 \\ 4 \end{bmatrix}
$$

$$
\frac{\partial L}{\partial \theta}^{(2)} = \frac{\partial L}{\partial \mathbf{x}_2} \cdot \frac{\partial \mathbf{x}_2}{\partial \mathbf{\theta}} = \begin{bmatrix} -1 & 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} -1 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 3 \end{bmatrix} = \begin{bmatrix} 1 \\ 0.5 \\ 3 \end{bmatrix}
$$

$$
\frac{\partial L}{\partial \theta} = \begin{bmatrix} 2 \\ -2.5 \\ 4 \end{bmatrix} + \begin{bmatrix} 1 \\ 0.5 \\ 3 \end{bmatrix} = \begin{bmatrix} 3 \\ -2 \\ 7 \end{bmatrix}
$$


### (3) Forward Propagation

Forward propagation traverses the chain rule from inside to outside (input to output), while Backward differentiation traverses from outside to inside (output to input). Forward propagation is more efficient when $n<<m$ (input dimension much smaller than output), while backward propagation when $n>>m$, which is the case of interest in the context of Deep Learning (big dataset as input)