# Chapter 6: Deep learning

## Introducing convolutional networks

### Problem 1 ([link](http://neuralnetworksanddeeplearning.com/chap6.html#problem_393174)): equations of backpropagation in a convolutional network

We consider a network like the following, except that **there's only one convolutional layer and only one pooling layer:**

![img/conv_network.png](img/conv_network.png)

The leftmost arrows actually look like this:

![img/conv.png](img/conv.png)

And the second arrows like this:

![img/max_pooling.png](img/max_pooling.png)

The third arrow actually represents a full connection between the max-pooling layer and the output layer.

We'll call:

* $a_{j,k}^0$, $0 \leq j, k \leq 27$, the input activations;
* $w_{l,m}^1$, $0 \leq l, m \leq 4$, the shared weights for the convolutional layer;
* $b^1$, the shared bias for the convolutional layer;
* $z_{j,k}^1$, $0 \leq j, k \leq 23$, the weighted input to neuron $(j, k)$ (line $j$, column $k$) in the convolutional layer:

$$z_{j,k}^1 = b^1 + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m}^1 a_{j+l, k+m}^0$$

* $a_{j,k}^1$, $0 \leq j, k \leq 23$, the activation of neuron $(j, k)$ in the convolutional layer:

$$a_{j,k}^1 = \sigma \left(z_{j,k}^1 \right)$$

* $a_{j,k}^2$, $0 \leq j, k \leq 11$, the activation of neuron $(j, k)$ in the max-pooling layer:

$$a_{j,k}^2 = \max \left( a_{2j, 2k}^1, a_{2j, 2k+1}^1, a_{2j+1, 2k}^1, a_{2j+1, 2k+1}^1 \right)$$

So neuron $(j,k)$ in the convolutional layer will contribute to the computation of the max for neuron $\left( \left \lfloor{\frac j 2}\right \rfloor, \left \lfloor{\frac k 2}\right \rfloor \right)$

* **Note that the max-pooling layer doesn't have any weights, biases, or weighted inputs!**
* $w_{l;j,k}^3$, $0 \leq j, k \leq 11, 0 \leq l \leq 9$, the weight of the connection between neuron $(j,k)$ in the max-pooling layer and neuron $l$ in the output layer;
* $b_l^3$, $0 \leq l \leq 9$, the bias of neuron $l$ in the output layer;
* $z_l^3$, $0 \leq l \leq 9$, the weighted input of neuron $l$ in the output layer:

$$z_l^3 = b_l^3 + \sum\limits_{0 \leq j, k \leq 11} w_{l;j,k}^3 a_{j,k}^2$$

* $a_l^3$, $0 \leq l \leq 9$, the output activation of neuron $l$ in the output layer:

$$a_l^3 = \sigma \left( z_l^3 \right)$$

Now for comparison, here are equations BP1 - BP4 for regular fully connected networks:

* **BP1**: $\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$
* **BP2**: $\delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j)$
* **BP3**: $\frac{\partial C}{\partial b^l_j} = \delta^l_j$
* **BP4**: $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$

And their shortened derivations (only writing $\frac{\partial x}{\partial y}$ when $y$ has an influence on $x$):

* **BP1**:

$$\delta_j^L = \frac{\partial C}{\partial z_j^L} = \frac{\partial C}{\partial a_j^L} \frac{\partial a_j^L}{\partial z_j^L} = \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L)$$

* **BP2**:

$$\delta_j^l = \frac{\partial C}{\partial z_j^l} = \sum\limits_k \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial a_j^l} \frac{\partial a_j^l}{\partial z_j^l} = \sum\limits_k \delta_k^{l+1} w_{kj}^{l+1} \sigma'(z_j^l)$$

* **BP3**:

$$\frac{\partial C}{\partial b_j^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial b_j^l} = \delta_j^l \times 1$$

* **BP4**:

$$\frac{\partial C}{\partial w_{jk}^l} = \frac{\partial C}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^l} = \delta_j^l a_k^{l-1}$$

Let's look at each equation in turn, with our new network architecture.

* **BP1**: The last layer following the previous network architecture, we see that the derivation of BP1 remains correct. Therefore, BP1 doesn't change.
* **BP2**: since the max-pooling layer doesn't have any weighted inputs, we'll just have to compute $\delta_{j,k}^1$.

\begin{equation*}
    \begin{aligned}
        \delta_{j,k}^1 &= \frac{\partial C}{\partial z_{j,k}^1} \\
        &= \sum\limits_{l=0}^9 \frac{\partial C}{\partial z_l^3} \frac{\partial z_l^3}{\partial z_{j,k}^1} \\
        &= \sum\limits_{l=0}^9 \delta_l^3 \frac{\partial z_l^3}{\partial a_{j', k'}^2} \frac{\partial a_{j', k'}^2}{\partial z_{j,k}^1} \qquad \text{with } j' = \left \lfloor{\frac j 2}\right \rfloor \text{ and } k' = \left \lfloor{\frac k 2}\right \rfloor  \\
        &\text{(} a_{j',k'}^2 \text{ being the only activation in the max-pooling layer affected by } z_{j,k}^1 \text{)} \\
        &= \sum\limits_{l=0}^9 \delta_l^3 w_{l;j',k'}^3 \frac{\partial a_{j', k'}^2}{\partial z_{j,k}^1} \\
        &= \sum\limits_{l=0}^9 \delta_l^3 w_{l;j',k'}^3 \frac{\partial a_{j', k'}^2}{\partial a_{j,k}^1} \frac{\partial a_{j,k}^1}{\partial z_{j,k}^1} \\
        &= \sum\limits_{l=0}^9 \delta_l^3 w_{l;j',k'}^3 \frac{\partial a_{j', k'}^2}{\partial a_{j,k}^1} \sigma' \left( z_{j,k}^1 \right)
    \end{aligned}
\end{equation*}

Now since $a_{j',k'}^2 = \max \left( a_{2j', 2k'}^1, a_{2j', 2k'+1}^1, a_{2j'+1, 2k'}^1, a_{2j'+1, 2k'+1}^1 \right)$ and we're talking about infinitesimal changes, we have:

\begin{eqnarray}
   \frac{\partial a_{j', k'}^2}{\partial a_{j,k}^1} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } a_{j,k}^1 \neq \max \left( a_{2j', 2k'}^1, a_{2j', 2k'+1}^1, a_{2j'+1, 2k'}^1, a_{2j'+1, 2k'+1}^1 \right) \\
      1 & \mbox{if } a_{j,k}^1 = \max \left( a_{2j', 2k'}^1, a_{2j', 2k'+1}^1, a_{2j'+1, 2k'}^1, a_{2j'+1, 2k'+1}^1 \right)
    \end{array}
  \right.
\end{eqnarray}

This is because $a_{j,k}^1$ only affects $a_{j',k'}^2$ if $a_{j,k}^1$ is the maximum activation in its local pooling field. In this case, we have $a_{j', k'}^2 = a_{j,k}^1$, so $\frac{\partial a_{j', k'}^2}{\partial a_{j,k}^1} = 1$.

And so to conclude the derivation of our new BP2:

\begin{eqnarray}
   \delta_{j,k}^1 = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } a_{j,k}^1 \neq \max \left( a_{2j', 2k'}^1, a_{2j', 2k'+1}^1, a_{2j'+1, 2k'}^1, a_{2j'+1, 2k'+1}^1 \right) \\
      \sum\limits_{l=0}^9 \delta_l^3 w_{l;j',k'}^3 \sigma' \left( z_{j,k}^1 \right) & \mbox{if } a_{j,k}^1 = \max \left( a_{2j', 2k'}^1, a_{2j', 2k'+1}^1, a_{2j'+1, 2k'}^1, a_{2j'+1, 2k'+1}^1 \right)
    \end{array}
  \right.
\end{eqnarray}

* **BP3**: we consider two cases:
 * $\frac{\partial C}{\partial b_l^3} = \delta_l^3$ as the third layer respects the previous architecture (the derivation still works);
 * $\frac{\partial C}{\partial b^1}$. This one is different, since the bias $b^1$ is shared for all neurons in the convolutional layer. We have:
 
\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b^1} &= \sum\limits_{0 \leq j,k \leq 23} \frac{\partial C}{\partial z_{j,k}^1} \frac{\partial z_{j,k}^1}{\partial b^1} \\
        &= \sum\limits_{0 \leq j,k \leq 23} \delta_{j,k}^1 \frac{\partial z_{j,k}^1}{\partial b^1} \\
        &= \sum\limits_{0 \leq j,k \leq 23} \delta_{j,k}^1 \qquad \text{as } z_{j,k}^1 = b^1 + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m}^1 a_{j+l, k+m}^0
    \end{aligned}
\end{equation*}

* **BP4**:
 * $\frac{\partial C}{\partial w_{l;j,k}^3} = a_{j,k}^2 \delta_l^3$ since, again, the derivation still works for the third layer;
 * $\frac{\partial C}{\partial w_{l,m}^1}, 0 \leq l, m \leq 4$. These 25 weights are shared, and each of them is used in the computation of the weighted input of each neuron in the convolutional layer:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{l,m}^1} &= \sum\limits_{0 \leq j, k \leq 23} \frac{\partial C}{\partial z_{j,k}^1} \frac{\partial z_{j, k}^1}{\partial w_{l,m}^1} \\
        &= \sum\limits_{0 \leq j, k \leq 23} \delta_{j,k}^1 \frac{\partial z_{j, k}^1}{\partial w_{l,m}^1} \\
        &= \sum\limits_{0 \leq j, k \leq 23} \delta_{j,k}^1 a_{j+l,k+m}^0 \qquad \text{as } z_{j,k}^1 = b^1 + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m}^1 a_{j+l, k+m}^0
    \end{aligned}
\end{equation*}

## Convolutional neural networks in practice

### Exercise 1 ([link](http://neuralnetworksanddeeplearning.com/chap6.html#exercise_683491)): importance of the fully-connected layer

Because I'm running code with a CPU and training the network is quite long (it would take around 30 minutes to train the network with the fully-connected layer for 60 epochs):
* I won't train the network with the fully-connected layer and will instead trust Nielsen's result of a 98.78 percent accuracy;
* I'll only train the network once, instead of keeping best-in-3 results.

For these reasons, the comparison won't be satisfying. To get a satisfying comparison, simply train 3 networks and keep the best result.

The code is in the `chap6ex1` directory.

`exec_with.py` trains a network with the fully-connected layer (which I haven't done), and `exec_without.py` trains a network without this layer.

The best classification accuracy obtained during these 60 epochs is 98.52 percent (at epoch 26).

It's worse than Nielsen's accuracy with a fully-connected layer, but I trained only one network, not 3 (as announced, the comparison isn't satisfying).

Anyway, the difference seems significant, so the fully-connected layer was probably helpful.