# Chapter 3: Improving the way neural networks learn

## The cross-entropy cost function

## Introducing the cross-entropy cost function

### Exercise 1 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercise_35813)): verify that $\sigma'(z) = \sigma(z) (1-\sigma(z))$

The definition of the sigmoid function is: $\sigma(z) = \frac{1}{1 + e^{-z}} = (1 + e^{-z})^{-1}$.

The derivative of the denominator is $-e^{-z}$ and therefore we have $\sigma'(z) = + e^{-z} (1 + e^{-z})^{-2} = \frac{e^{-z}}{(1 + e^{-z})^2}$

And $\sigma(z)(1 - \sigma(z)) = \frac{1}{1 + e^{-z}} \frac{1 + e^{-z} - 1}{1 + e^{-z}} = \frac{e^{-z}}{(1 + e^{-z})^2}$

And so we have $\sigma'(z) = \sigma(z) (1-\sigma(z))$.

### Exercise 2 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercises_824189)): the roles of $y$s and $a$s in the cross-entropy cost function

The correct cross-entropy cost function is $-[y \ln a + (1-y) \ln (1-a)]$.

The (incorrect) similar expression $-[a \ln y + (1-a) \ln (1-y)]$ isn't defined when $y = 0$ or $y = 1$, because $\ln(x)$ isn't defined when $x = 0$.

This is an issue because $y = 0$ or $y = 1$ can clearly happen, as $y$ is the correct output (if the expected answer is "yes", we would ideally like the network to output exactly 1; in this case we would have $y = 1$).

In the right definition, we might think that the same problem would arise when $a = 0$ or $a = 1$. However, this never happens with the sigmoid activation function, because $a = \sigma(z)$ and whatever the weighted input $z$ for a neuron, we will always have $0 < \sigma(z) < 1$ by definition of $\sigma$.

### Exercise 3: show that the cross-entropy function is still a good cost function when $0 < y < 1$.

Namely, we need to show that the cross-entropy cost function $C(a) = - (y \ln a + (1 - y) \ln (1 - a))$ is minimized when $a = y$.

Let's differentiate $C$:

$C'(a) = - \frac y a + \frac{1-y}{1-a}$.

We look for a local extremum by solving $C'(a) = 0$:

\begin{equation*}
    \begin{aligned}
        - \frac y a + \frac{1-y}{1-a} = 0 &\iff \frac y a = \frac{1-y}{1-a} \\
        &\iff y - ay = a - ay \\
        &\iff a = y
    \end{aligned}
\end{equation*}

We have a unique extremum in $a = y$. To determine whether it is a minimum or a maximum, we compute the second derivative:

$C''(a) = \frac{y}{a^2} + \frac{1 - y}{(1 - a)^2}$.

Since we have supposed $0 < y < 1$, and as always $0 < a < 1$, we have for all $0 < a < 1$: $C''(a) \geq 0$ (the function is convex). Therefore, the function was minimized when $a = y$.

### Problem 1 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#problems_382219)): Many-layer multi-neuron networks

For a single training example $x$, we have for the quadratic cost function:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \qquad \text{(BP1)}
    \end{aligned}
\end{equation*}

The cost function for a single training example is $C = \frac 1 2 \sum_i (a_i^L - y_i)^2$, so we have $\frac{\partial C}{\partial a_j^L} = a_j^L - y_j$. This gives us:

$$\frac{\partial C}{\partial w^L_{jk}} = a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j)$$

And taking all training examples into account,

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum\limits_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j)$$

Now with the cross-entropy cost function, let's first compute $\delta^L$ for a single training example $x$: for all neurons $j$ in the $L$th layer,

\begin{equation*}
    \begin{aligned}
        \delta_j^L &= \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \qquad \text{(BP1)} \\
        &= - \left( \frac{y_j}{a_j^L} - \frac{1 - y_j}{1 - a_j^L} \right) \sigma'(z_j^L) \\
        &= - \left( \frac{y_j}{\sigma(z_j^L)} - \frac{1 - y_j}{1 - \sigma(z_j^L)} \right) \sigma(z_j^L) \left( 1 - \sigma(z_j^L) \right) \\
        &= - \left( y_j (1 - \sigma(z_j^L)) - (1 - y_j) \sigma(z_j^L) \right) \\
        &= \sigma(z_j^L) - y_j \\
        &= a_j^L - y_j
    \end{aligned}
\end{equation*}

And so $\delta^L = a^L - y$.

Let's incorporate it into our previous calculus.

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} (a_j^L - y_j)
    \end{aligned}
\end{equation*}

And taking all training examples into account,

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum\limits_x a^{L-1}_k  (a^L_j-y_j)$$

For the biases, everything is the same except that instead of using BP4 ($\frac{\partial C}{\partial w_{jk}^L} = a_k^{L-1} \delta_j^L$), we use BP3 ($\frac{\partial C}{\partial b_j^L} = \delta_j^L$) and so we don't have the $a_k^{L-1}$ part.

### Problem 2: using the quadratic cost when we have linear neurons in the output layer

We use the quadratic cost function and the activation function $f: x \rightarrow x$ in the last layer. We have for a single training example $x$ and for all neurons $j$ in the $L$th layer:

$$\delta_j^L = \frac{\partial C}{\partial a_j^L} f'(z_j^L) \qquad \text{(BP1)}$$

The cost function for a single training example is $C = \frac 1 2 \sum_i (a_i^L - y_i)^2$, so we have $\frac{\partial C}{\partial a_j^L} = a_j^L - y_j$. And $\forall x \in \mathbb{R}, f'(x) = 1$.

So $\delta_j^L = a_j^L - y_j$. In vector form:

$$\delta^L = a^L - y$$

Applying BP4 gives us, for a single training example:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} (a_j^L - y_j)
    \end{aligned}
\end{equation*}

For the biases we apply BP3:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b_j^L} &= \delta_j^L \qquad \text{(BP3)} \\
        &= (a_j^L - y_j)
    \end{aligned}
\end{equation*}

And taking all training examples into account:

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum_x a^{L-1}_k  (a^L_j-y_j)$$

And

$$\frac{\partial C}{\partial b^L_{j}} = \frac{1}{n} \sum_x (a^L_j-y_j)$$