### Proof of BP3

Equation BP3 expresses how the partial derivative of the cost function with respect to any weight in the network is calculated:

$$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$$

#### Derivation:

1. The change in the cost $C$ due to a change in weight $w^l_{jk}$ can be expressed through the chain rule as:
$$\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_{jk}}$$

2. Breaking down each component:
   - $\frac{\partial C}{\partial a^l_j}$ represents the sensitivity of the cost to the activation of the $j^{th}$ neuron in the $l^{th}$ layer.
   - $\frac{\partial a^l_j}{\partial z^l_j}$ is the derivative of the activation function, indicating how the activation changes with respect to the neuron's weighted input.
   - $\frac{\partial z^l_j}{\partial w^l_{jk}}$ directly equals $a^{l-1}_k$ because $z^l_j$ is linear with respect to $w^l_{jk}$.

3. Combining these expressions gives the BP3 formula:
$$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$$

This shows that the gradient of the cost with respect to a weight is directly proportional to the activation of the connecting neuron in the previous layer and the error term of the neuron in the current layer.

---

### Proof of BP4

Equation BP4 details the derivative of the cost function with respect to the biases:

$$\frac{\partial C}{\partial b^l_j} = \delta^l_j$$

#### Derivation:

1. Similarly, applying the chain rule to find how changes in biases affect the cost, we have:
$$\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j}$$

2. Here:
   - $\frac{\partial C}{\partial a^l_j}$ and $\frac{\partial a^l_j}{\partial z^l_j}$ are the same as in BP3.
   - $\frac{\partial z^l_j}{\partial b^l_j}$ equals 1 since $z^l_j$ has a linear relationship with $b^l_j$.

3. Therefore, we derive BP4 as follows:
$$\frac{\partial C}{\partial b^l_j} = \delta^l_j$$

This equation indicates that the gradient of the cost with respect to a bias is equal to the error term of the neuron in the corresponding layer.


In [1]:
import numpy as np

class SimpleNeuralNetwork:
    def __init__(self, sizes):
        """Initialize the neural network with random weights and biases.

        Args:
            sizes (list): The sizes of the layers. For example, [2, 3, 1] represents
                          a network with 2 inputs, a hidden layer with 3 neurons,
                          and an output layer with 1 neuron.
        """
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network given input a."""
        for b, w in zip(self.biases, self.weights):
            a = self.sigmoid(np.dot(w, a) + b)
        return a

    def sigmoid(self, z):
        """The sigmoid function."""
        return 1.0 / (1.0 + np.exp(-z))

    def sigmoid_prime(self, z):
        """Derivative of the sigmoid function."""
        return self.sigmoid(z) * (1 - self.sigmoid(z))

    def backprop(self, x, y):
        """Return a tuple `(nabla_b, nabla_w)` representing the gradient for the cost function.
        `x` is the input data, `y` is the desired output.
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # Forward pass
        activation = x
        activations = [x]  # list to store all the activations, layer by layer
        zs = []  # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = self.sigmoid(z)
            activations.append(activation)

        # Backward pass
        delta = self.cost_derivative(activations[-1], y) * \
                self.sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].T)

        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = self.sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].T, delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].T)

        return (nabla_b, nabla_w)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x / \partial a for the output activations."""
        return (output_activations - y)
