# Solution

## Task a) Compute $\hat{y}$

First, we need to perform the convolution. We call the output of the convolutional layer $CN$:

$$\begin{align}CN  & = X \circledast F^C - B^C\\

& = \begin{bmatrix}
0 & 3 & -1 & 0 \\
0 & 0 & 0 & 0 \\
0 & 0 & 3 & 0 \\
3 & 0 & 3 & 1
\end{bmatrix} \circledast
\begin{bmatrix}
1 & 0 \\
-1 & 0 \\
\end{bmatrix} -1\\

& = \begin{bmatrix}
1 & 0 \\
-2 & 1
\end{bmatrix}
\end{align}$$

Now we put this matrix through the self-attention layer. 

First we calculate the query, key and value matrices:
$$
Q = CN * W^Q
= \begin{bmatrix}
1 & 0 \\
-2 & 1 \\
\end{bmatrix} *
\begin{bmatrix}
1 & -2 \\
0 & 2 \\
\end{bmatrix}
= \begin{bmatrix}
1 & -2 \\
-2 & 6 \\
\end{bmatrix}\\
$$

$$
K = CN * W^K
= \begin{bmatrix}
1 & 0 \\
-2 & 1 \\
\end{bmatrix} *
\begin{bmatrix}
1 & -1 \\
3 & 0 \\
\end{bmatrix}
= \begin{bmatrix}
1 & -1 \\
1 & 2 \\
\end{bmatrix}\\
$$

$$
V = CN * W^V
= \begin{bmatrix}
1 & 0 \\
-2 & 1 \\
\end{bmatrix} *
\begin{bmatrix}
3 & 1 \\
-1 & 1 \\
\end{bmatrix}
= \begin{bmatrix}
3 & 1 \\
-7 & -1 \\
\end{bmatrix}\\
$$

The output of the self-attention layer is:
$$SA = softmax(\frac{Q * K^T}{\sqrt{d}})* V$$

d is the size of dimaensions of each query, in this network 2. 

$$
\begin{align}
SA & = softmax(\frac{
\begin{bmatrix}
1 & -2 \\
-2 & 6 \\
\end{bmatrix}
*
\begin{bmatrix}
1 & -1 \\
1 & 2 \\
\end{bmatrix}^T
}{\sqrt{2}})*
\begin{bmatrix}
3 & 1 \\
-7 & -1 \\
\end{bmatrix}\\

& = softmax(\frac{
\begin{bmatrix}
1 & -2 \\
-2 & 6 \\
\end{bmatrix}
*
\begin{bmatrix}
1 & 1 \\
-1 & 2 \\
\end{bmatrix}
}{\sqrt{2}})*
\begin{bmatrix}
3 & 1 \\
-7 & -1 \\
\end{bmatrix}\\

& = softmax(\begin{bmatrix}
-\frac{1}{\sqrt{2}} & -\frac{5}{\sqrt{2}} \\
2\cdot\sqrt{2} & 7\cdot\sqrt{2} \\
\end{bmatrix})*
\begin{bmatrix}
3 & 1 \\
-7 & -1 \\
\end{bmatrix}\\

& = \begin{bmatrix}
0.03 & 0 \\
0.97 & 1 \\
\end{bmatrix}*
\begin{bmatrix}
3 & 1 \\
-7 & -1 \\
\end{bmatrix}\\

& = \begin{bmatrix}
0.09 & 0.03 \\
-4.09 & 1.97 \\
\end{bmatrix}

\end{align}$$

Now we flatten this so that $SA = \begin{bmatrix}
0.09\\
0.03\\
-4.09\\
1.97\\
\end{bmatrix}$

Finally, we calculate $\hat{y}$:
$$
\begin{align}
\hat{y} & = ReLU(W^{DT}* SA -B^D)\\
& = ReLU(\begin{bmatrix}
1\\
1\\
0\\
2\\
\end{bmatrix}^T * \begin{bmatrix}
0.09\\
0.03\\
-4.09\\
1.97\\
\end{bmatrix} - 1)\\
& = ReLU(\begin{bmatrix}
1 & 1 & 0 & 2\\
\end{bmatrix}*\begin{bmatrix}
0.09\\
0.03\\
-4.09\\
1.97\\
\end{bmatrix} - 1)\\

& = ReLU(3.06)\\

& = \underline{\underline{3.06}}

\end{align}
$$

## Task b) Update the weights and biases

To begin, we calculate the derivative of the error:
$$\frac{\partial C}{\partial \hat{y}} = 2(y-\hat{y}) = -4.12$$

To update the weights and biases of the FCNN, we propagate it back:
$$
\begin{align}
\frac{\partial C}{\partial W^D} & = \frac{\partial C}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W^D}\\
& = 2(y-\hat{y})\cdot SA
& = -4.12 \cdot \begin{bmatrix}
0.09\\
0.03\\
-4.09\\
1.97\\
\end{bmatrix}\\
& = \begin{bmatrix}
-0.37\\
-0.12\\
16.85\\
-8.12\\
\end{bmatrix}
\end{align}$$


$$
\frac{\partial C}{\partial B^D} = \frac{\partial C}{\partial \hat{y}} = -4.12
$$

We can now update $W^D$ and $B^D$:

$$
W_1^D = W_0^D -\alpha\frac{\partial C}{\partial W^D} = \begin{bmatrix}
1\\
1\\
0\\
2\\
\end{bmatrix} - 0.1
\begin{bmatrix}
-0.37\\
-0.12\\
16.85\\
-8.12\\
\end{bmatrix}
= \begin{bmatrix}
1.04\\
1.01\\
-1.69\\
2.812\\
\end{bmatrix}
$$

$$
B_1^D = B_0^D - \alpha\frac{\partial C}{\partial B^D} = -1 - 0.1\cdot -4.12 = 0.59
$$

In order to calculate the derivatives for the other layers, we also need:
$$\begin{align}
\frac{\partial C}{\partial SA} & = \frac{\partial C}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial SA}\\
& = 2(y-\hat{y}) \cdot W^D
& = -4.12 \cdot \begin{bmatrix}
1\\
1\\
0\\
2\\
\end{bmatrix} \\
& = \begin{bmatrix}
-4.12\\
-4.12\\
0\\
-8.24\\
\end{bmatrix}
\end{align}
$$

We reshape this to: $\frac{\partial C}{\partial SA} = \begin{bmatrix}
-4.12 & -4.12\\
0 & -8.24\\
\end{bmatrix}$

*Backpropagation through the self-attention layer is a bit more complicated. In order to make it more readable, we call $\frac{Q* K^T}{\sqrt{d}}$ $S$, and $softmax(S)$ $A$, so that $SA = A* V = softmax(S)* V$. Then we have:
$$
\begin{align}
\frac{\partial C}{\partial V} & = A^T * \frac{\partial C}{\partial SA}\\
& = \begin{bmatrix}
0.03 & 0.97 \\
0 & 1 \\
\end{bmatrix} * \begin{bmatrix}
-4.12 & -4.12\\
0 & -8.24\\
\end{bmatrix}
& = \begin{bmatrix}
-0.12 & -8.12\\
0 & -8.24\\
\end{bmatrix}
\end{align}
$$

This can be used to calculate the derivatives of the cost wrt teh values weights:

$$\begin{align}
\frac{\partial C}{\partial W^V} & = CN^T * \frac{\partial C}{\partial V}\\
& = \begin{bmatrix}
1 & -2\\
0 & 1\\
\end{bmatrix} * \begin{bmatrix}
-0.12 & -8.12\\
0 & -8.24\\
\end{bmatrix}\\
& = \begin{bmatrix}
-0.12 & 8.36\\
0 & -8.24\\
\end{bmatrix}\\
\end{align}
$$
Now we update the value weights:
$$
W_1^V = W_0^V - \alpha\cdot\frac{\partial C}{\partial W^V} = 
\begin{bmatrix}
3.01 & 0.17\\
-1 & 1.82\\
\end{bmatrix}
$$