# Chapter 3: Improving the way neural networks learn

## The cross-entropy cost function

## Introducing the cross-entropy cost function

### Exercise 1 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercise_35813)): verify that $\sigma'(z) = \sigma(z) (1-\sigma(z))$

The definition of the sigmoid function is: $\sigma(z) = \frac{1}{1 + e^{-z}} = (1 + e^{-z})^{-1}$.

The derivative of the denominator is $-e^{-z}$ and therefore we have $\sigma'(z) = + e^{-z} (1 + e^{-z})^{-2} = \frac{e^{-z}}{(1 + e^{-z})^2}$

And $\sigma(z)(1 - \sigma(z)) = \frac{1}{1 + e^{-z}} \frac{1 + e^{-z} - 1}{1 + e^{-z}} = \frac{e^{-z}}{(1 + e^{-z})^2}$

And so we have $\sigma'(z) = \sigma(z) (1-\sigma(z))$.

### Exercise 2 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercises_824189)): the roles of $y$s and $a$s in the cross-entropy cost function

The correct cross-entropy cost function is $-[y \ln a + (1-y) \ln (1-a)]$.

The (incorrect) similar expression $-[a \ln y + (1-a) \ln (1-y)]$ isn't defined when $y = 0$ or $y = 1$, because $\ln(x)$ isn't defined when $x = 0$.

This is an issue because $y = 0$ or $y = 1$ can clearly happen, as $y$ is the correct output (if the expected answer is "yes", we would ideally like the network to output exactly 1; in this case we would have $y = 1$).

In the right definition, we might think that the same problem would arise when $a = 0$ or $a = 1$. However, this never happens with the sigmoid activation function, because $a = \sigma(z)$ and whatever the weighted input $z$ for a neuron, we will always have $0 < \sigma(z) < 1$ by definition of $\sigma$.

### Exercise 3: show that the cross-entropy function is still a good cost function when $0 < y < 1$.

Namely, we need to show that the cross-entropy cost function $C(a) = - (y \ln a + (1 - y) \ln (1 - a))$ is minimized when $a = y$.

Let's differentiate $C$:

$C'(a) = - \frac y a + \frac{1-y}{1-a}$.

We look for a local extremum by solving $C'(a) = 0$:

\begin{equation*}
    \begin{aligned}
        - \frac y a + \frac{1-y}{1-a} = 0 &\iff \frac y a = \frac{1-y}{1-a} \\
        &\iff y - ay = a - ay \\
        &\iff a = y
    \end{aligned}
\end{equation*}

We have a unique extremum in $a = y$. To determine whether it is a minimum or a maximum, we compute the second derivative:

$C''(a) = \frac{y}{a^2} + \frac{1 - y}{(1 - a)^2}$.

Since we have supposed $0 < y < 1$, and as always $0 < a < 1$, we have for all $0 < a < 1$: $C''(a) \geq 0$ (the function is convex). Therefore, the function was minimized when $a = y$.

### Problem 1 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#problems_382219)): Many-layer multi-neuron networks

For a single training example $x$, we have for the quadratic cost function:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \qquad \text{(BP1)}
    \end{aligned}
\end{equation*}

The cost function for a single training example is $C = \frac 1 2 \sum_i (a_i^L - y_i)^2$, so we have $\frac{\partial C}{\partial a_j^L} = a_j^L - y_j$. This gives us:

$$\frac{\partial C}{\partial w^L_{jk}} = a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j)$$

And taking all training examples into account,

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum\limits_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j)$$

Now with the cross-entropy cost function, let's first compute $\delta^L$ for a single training example $x$: for all neurons $j$ in the $L$th layer,

\begin{equation*}
    \begin{aligned}
        \delta_j^L &= \frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \qquad \text{(BP1)} \\
        &= - \left( \frac{y_j}{a_j^L} - \frac{1 - y_j}{1 - a_j^L} \right) \sigma'(z_j^L) \\
        &= - \left( \frac{y_j}{\sigma(z_j^L)} - \frac{1 - y_j}{1 - \sigma(z_j^L)} \right) \sigma(z_j^L) \left( 1 - \sigma(z_j^L) \right) \\
        &= - \left( y_j (1 - \sigma(z_j^L)) - (1 - y_j) \sigma(z_j^L) \right) \\
        &= \sigma(z_j^L) - y_j \\
        &= a_j^L - y_j
    \end{aligned}
\end{equation*}

And so $\delta^L = a^L - y$.

Let's incorporate it into our previous calculus.

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} (a_j^L - y_j)
    \end{aligned}
\end{equation*}

And taking all training examples into account,

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum\limits_x a^{L-1}_k  (a^L_j-y_j)$$

For the biases, everything is the same except that instead of using BP4 ($\frac{\partial C}{\partial w_{jk}^L} = a_k^{L-1} \delta_j^L$), we use BP3 ($\frac{\partial C}{\partial b_j^L} = \delta_j^L$) and so we don't have the $a_k^{L-1}$ part.

### Problem 2: using the quadratic cost when we have linear neurons in the output layer

We use the quadratic cost function and the activation function $f: x \rightarrow x$ in the last layer. We have for a single training example $x$ and for all neurons $j$ in the $L$th layer:

$$\delta_j^L = \frac{\partial C}{\partial a_j^L} f'(z_j^L) \qquad \text{(BP1)}$$

The cost function for a single training example is $C = \frac 1 2 \sum_i (a_i^L - y_i)^2$, so we have $\frac{\partial C}{\partial a_j^L} = a_j^L - y_j$. And $\forall x \in \mathbb{R}, f'(x) = 1$.

So $\delta_j^L = a_j^L - y_j$. In vector form:

$$\delta^L = a^L - y$$

Applying BP4 gives us, for a single training example:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial w_{jk}^L} &= a_k^{L-1} \delta_j^L \qquad \text{(BP4)} \\
        &= a_k^{L-1} (a_j^L - y_j)
    \end{aligned}
\end{equation*}

For the biases we apply BP3:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b_j^L} &= \delta_j^L \qquad \text{(BP3)} \\
        &= (a_j^L - y_j)
    \end{aligned}
\end{equation*}

And taking all training examples into account:

$$\frac{\partial C}{\partial w^L_{jk}} = \frac{1}{n} \sum_x a^{L-1}_k  (a^L_j-y_j)$$

And

$$\frac{\partial C}{\partial b^L_{j}} = \frac{1}{n} \sum_x (a^L_j-y_j)$$

## Using the cross-entropy to classify MNIST digits

## What does the cross-entropy mean? Where does it come from?

### Problem 3 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#problem_507295)): why it's not possible to eliminate the $x_j$ term through a clever choice of cost function

The derivation of equation (61) began like this, using the chain rule:

$$\frac{\partial C}{\partial w_j} = \frac{\partial C}{\partial a} \frac{\partial a}{\partial w_j}$$

Since $a = \sigma(\sum_i w_i x_i + b)$, we have $\frac{\partial a}{\partial w_j} = x_j \sigma'(\sum_i w_i x_i + b) = x_j \sigma'(z)$.

Using the cross-entropy cost function, we have managed to make $\frac{\partial C}{\partial a}$ look like $\frac{something}{\sigma'(z)}$, eliminating the $\sigma'(z)$ term in $\frac{\partial C}{\partial w_j}$.

Now we would like to make it look like $\frac{something}{x_j}$. The problem is that whatever the choice of the cost function $C$, it can only depend on the network output $a$ (and the expected output $y$). Therefore, the contributions of each $x_j$ to the final activation $a$ can't be taken into account by $C$ (keeping in mind that an infinity of choices for the $\{x_j\}$ lead to the same weighted input $z$ and the same output $a$).

## Softmax

### Exercise 4 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercise_332838)): construct an example showing that with a sigmoid output layer, the output activations won't always sum to 1

Consider a 2-layer network made of a single input neuron and a single output neuron, with a weight $w$ (a scalar) and bias $b$. Its input (given by the input neuron) is $x$.

Now whatever the input $x$, the weighted input will be $z = wx + b$ and the output $a = \sigma(z) < 1$ since $\forall x \in \mathbb{R}, \sigma(x) < 1$. So the sum of the output activations, being equal to our unique output activation, won't be 1 (we can also construct examples where the sum of the output activations is more than 1).

By contrast, had we used the softmax output layer in this case, we would have had:

\begin{equation*}
    \begin{aligned}
        \sum_j a_j &= \frac{\sum_j e^{z_j}}{\sum_k e^{z_k}} \\
        &= \frac{e^z}{e^z} \\
        &= 1
    \end{aligned}
\end{equation*}

### Exercise 5 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#exercises_193619)): monotonicity of softmax

For easier differentiation, let's reformulate $a_j^L$:

\begin{equation*}
    \begin{aligned}
        a_j^L &= \frac{e^{z_j^L}}{e^{z_j^L} + \sum\limits_{k \neq j} e^{z_k^L}} \\
        &= \frac{1}{1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L}} \\
        &= \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-1}
    \end{aligned}
\end{equation*}

Now

\begin{equation*}
    \begin{aligned}
        \frac{\partial a_j^L}{\partial z_j^L} &= - \frac{\partial \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right)}{\partial z_j^L} \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-2} \\
        &= \left( e^{z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-2} \\
        &= \left( e^{z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) \left( a_j^L \right) ^2 \\
        &> 0
    \end{aligned}
\end{equation*}

And for $k \neq j$,

\begin{equation*}
    \begin{aligned}
        \frac{\partial a_j^L}{\partial z_k^L} &= - \frac{\partial \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right)}{\partial z_k^L} \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-2} \\
        &= - \left( e^{-z_j^L} e^{z_k^L} \right) \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-2} \\
        &= - \left( e^{-z_j^L} e^{z_k^L} \right) \left( a_j^L \right) ^2 \\
        &< 0
    \end{aligned}
\end{equation*}

### Exercise 6: non-locality of softmax

We just showed that for $k \neq j$, $\frac{\partial a_j^L}{\partial z_k^L} \neq 0$. This shows that $a_j^L$ depends on all weighted inputs $z_k^L$, not just $z_j^L$.

But the previous derivation wasn't actually necessary to see that. Recall that with a softmax output layer, $a_j^L = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$. Because of the sum in the denominator, we see directly that changing the value of any $z_k^L$ will change the value of $a_j^L$.

### Problem 4 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#problem_905066)): Inverting the softmax layer

By definition:

$$a_j^L = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$

So:

$$z_j^L = \ln \left( a_j^L \right) + \ln \left( \sum_k e^{z_k^L} \right)$$

We just have to call $C$ the constant $\ln \left( \sum_k e^{z_k^L} \right)$, which is independent of $j$.

### Problem 5 ([link](http://neuralnetworksanddeeplearning.com/chap3.html#problems_919607)): derive equations (81) and (82)

Let's first derive equation (81): $\frac{\partial C}{\partial b_j^L} = a_j^L - y_j$.

For clarity, let's call $y$ the expected output vector, made only of $0$s and $1$s, and $\tilde{y}$ the integer such that $y_\tilde{y} = 1$ (in the MNIST example, the correct digit).

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b_j^L} &= \delta_j^L \qquad \text{(BP3)} \\
        &= \frac{\partial C}{\partial z_j^L} \qquad \text{by definition of } \delta_j^L \\
        &= \sum\limits_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_j^L} \qquad \text{(chain rule)} \\
        &= \frac{\partial C}{\partial a_\tilde{y}^L} \frac{\partial a_\tilde{y}^L}{\partial z_j^L} \qquad \text{as } C \text{ only depends on } a_\tilde{y}^L \\
        &= - \frac{1}{a_\tilde{y}^L} \frac{\partial a_\tilde{y}^L}{\partial z_j^L} \qquad \text{as } C = - \ln a_\tilde{y}^L
    \end{aligned}
\end{equation*}

At this point, we must treat 2 cases separately, using the expressions derived in Exercise 5.

* If $j = \tilde{y}$:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b_j^L} &= - \frac{1}{a_j^L} \left( e^{z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) \left( a_j^L \right) ^2 \qquad \text{using the first expression from Exercise 5} \\
        &= - a_j^L \left( 1 +  e^{z_j^L} \sum\limits_{k \neq j} e^{z_k^L} - 1 \right) \\
        &= - a_j^L \left( \frac{1}{a_j^L} - 1 \right) \qquad \text{recalling from Exercise 5 that } a_j^L = \left( 1 + e^{-z_j^L} \sum\limits_{k \neq j} e^{z_k^L} \right) ^{-1} \\
        &= a_j^L - 1 \\
        &= a_j^L - y_j \qquad \text{since } y_j = y_\tilde{y} = 1
    \end{aligned}
\end{equation*}

* If $j \neq \tilde{y}$:

\begin{equation*}
    \begin{aligned}
        \frac{\partial C}{\partial b_j^L} &= - \frac{1}{a_\tilde{y}^L} \left(- e^{-z_\tilde{y}^L} e^{z_j^L} \right) \left( a_\tilde{y}^L \right) ^2 \qquad \text{using the second expression from Exercise 5} \\
        &= a_\tilde{y}^L e^{-z_\tilde{y}^L} e^{z_j^L} \\
        &= \frac{e^{z_\tilde{y}^L}}{\sum_k e^{z_k^L}} e^{-z_\tilde{y}^L} e^{z_j^L} \qquad \text{by definition of } a_\tilde{y}^L \\
        &= \frac{e^{z_j^L}}{\sum_k e^{z_k^L}} \\
        &= a_j^L \\
        &= a_j^L - y_j \qquad \text{since } j \neq \tilde{y} \text{ and so } y_j = 0
    \end{aligned}
\end{equation*}

We have proven Equation (81): $\frac{\partial C}{\partial b_j^L} = a_j^L - y_j$.

The proof for equation (82) is exactly the same, except that instead of starting with $\frac{\partial C}{\partial b_j^L} = \delta_j^L$ using BP3, it starts with $\frac{\partial C}{\partial w_{jk}^L} = a_k^{L-1} \delta_j^L$ using BP4.

### Problem 6: explanation of the "softmax" name

Let's consider $a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}$ with $c > 0$.

* We still have $a_j^L \geq 0$ for all $j$
* The sum of the outputs of all neurons is still 1: $\sum\limits_j a_j^L = \frac{\sum\limits_j e^{c z^L_j}}{\sum\limits_k e^{c z^L_k}} = 1$.

Therefore, the output activations still form a probability distribution.

What about the limit as $c \rightarrow + \infty$?

Let's write the expression slightly differently:

$$a_j^L = \frac{1}{1 + e^{-c z_j^L} \sum\limits_{k \neq j} e^{c z_k^L}} = \frac{1}{1 + \sum\limits_{k \neq j} e^{c( z_k^L - z_j^L)}}$$

* If $z_j^L$ is not the maximum weighted input (there exists $m$ such that $z_m^L > z_j^L$), then we will have:

$$\lim_{c \to + \infty} e^{c(z_m^L - z_j^L)} = + \infty $$

And since all other terms in the sum are positive:

$$\lim_{c \to + \infty} \sum\limits_{k \neq j} e^{c(z_k^L - z_j^L)} = + \infty $$

And therefore:

$$\lim_{c \to + \infty} a_j^L = 0$$

* If $z_j^L$ is one of the $n \geq 1$ maximal weighted inputs, we will have:
  * for $k$ such that $z_k^L$ is another maximal weighted input, $\lim_{c \to + \infty} e^{c(z_k^L - z_j^L)} = 1$ since $z_k^L - z_j^L = 0$;
  * for $k$ such that $z_k^L$ is not one of the maximal weighted inputs, $\lim_{c \to + \infty} e^{c(z_k^L - z_j^L)} = 0$ since $z_k^L - z_j^L < 0$;

Therefore,

$$\lim_{c \to + \infty} \sum\limits_{k \neq j} e^{c(z_k^L - z_j^L)} = n - 1$$

And:

$$\lim_{c \to + \infty} a_j^L = \frac 1 n$$

(in particular, $\lim_{c \to + \infty} a_j^L = 1$ if $z_j^L$ is the unique maximum).

More succinctly,

\begin{eqnarray}
   \lim_{c \to + \infty} a_j^L = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } z_j^L \mbox{ is not a maximum weighted input} \\
      \frac 1 n & \mbox{if } z_j^L \mbox{ is one of } n \mbox{ maximal weighted inputs}
    \end{array}
  \right.
\end{eqnarray}

Now we see that when $c = 1$, we still put more weight on the bigger values because of the exponential function, but we take all of them into account, not just the maximal ones, hence the "softmax" name.