In [None]:
import numpy as np

## Incipit: Back Propagation Algorithm

Let's say we have a neural network with multiple layers, and we want to compute the gradients of the loss function with respect to the weights and biases in each layer. Here's how we can do this:

1. **Forward pass**: We start by feeding the input data through the network to compute the output. During this step, we also store the intermediate values (i.e., the inputs and outputs of each layer) in memory, as we'll need them later during the backward pass.

2. **Compute the loss**: Once we have the output of the network, we can compute the loss function. The loss function is typically some measure of the difference between the predicted output and the actual output.

3. **Backward pass**: This is where we compute the gradients of the loss with respect to the weights and biases in each layer. We start by computing the gradient of the loss with respect to the output of the final layer (i.e., the output layer). This is done using the chain rule of calculus:

```dL/dy = dL/do * do/dy```

Here, `dL/dy` is the gradient of the loss with respect to the output y of the final layer, `dL/do` is the gradient of the loss with respect to the predicted output `o`, and `do/dy` is the derivative of the activation function used in the final layer.

4. **Backpropagation through the layers**: Once we have the gradient of the loss with respect to the output of the final layer, we can use this to compute the gradients of the loss with respect to the inputs of the previous layer. This is done using the chain rule again:

```dL/dx = dL/dy * dy/dx```

Here, `dL/dx` is the gradient of the loss with respect to the input `x` of the previous layer, `dL/dy` is the gradient of the loss with respect to the output `y` of the current layer, and `dy/dx` is the derivative of the activation function used in the current layer.

5. **Compute the gradients with respect to the weights and biases**: Finally, we can use the gradients computed in step 4 to compute the gradients of the loss with respect to the weights and biases in each layer. This is also done using the chain rule:

```dL/dw = dL/dx * dx/dw```

```dL/db = dL/dx * dx/db```


Here, `dL/dw` is the gradient of the loss with respect to the weights `w`, `dL/db` is the gradient of the loss with respect to the bias `b`, `dL/dx` is the gradient of the loss with respect to the input `x`, and `dx/dw` and `dx/db` are the derivatives of the input `x` with respect to the weights and biases, respectively.

These steps are repeated for each layer in the network, starting from the final layer and working backwards to the input layer. Once we have computed the gradients of the loss with respect to the weights and biases in each layer, we can use them to update the parameters using a gradient descent optimization algorithm.

Let's consider a simple neural network with one hidden layer, one input feature, and one output. The network has the following architecture:

* Input layer: 1 node
* Hidden layer: 2 nodes, with weights `[w1, w2]` and biases `[b1, b2]`
* Output layer: 1 node, with weight `w3` and bias `b3`

The activation function is the **sigmoid function**.

We will use **mean squared error** (MSE) as the loss function.

Suppose we have one training example with input `x=0.5` and target output `y=0.8`. Let's initialize the weights and biases randomly:

In [None]:
x = 0.5
y = 0.8

w1 = 0.2
w2 = -0.3
b1 = 0.4
b2 = -0.5
w3 = 0.1
b3 = 0.2

# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

To compute the forward pass, we first compute the hidden layer activations:

In [None]:
z1 = w1 * x + b1
a1 = sigmoid(z1)

z2 = w2 * x + b2
a2 = sigmoid(z2)

Then we compute the output layer activation:

In [None]:
z3 = w3 * a1 + w3 * a2 + b3
y_pred = sigmoid(z3)

The loss function for this example is:

In [None]:
L = 0.5 * (y - y_pred)**2

To compute the gradient of the loss function with respect to the network parameters, we first compute the derivative of the loss function with respect to the predicted output:

In [None]:
dL_dy_pred = y_pred - y

Then we can compute the gradients of the output layer weights and bias:

In [None]:
dz3_dw3 = a1 + a2
dL_dw3 = dL_dy_pred * sigmoid_derivative(z3) * dz3_dw3

dz3_db3 = 1
dL_db3 = dL_dy_pred * sigmoid_derivative(z3) * dz3_db3

Next, we compute the gradients of the hidden layer activations:

In [None]:
dz3_da1 = w3
dL_da1 = dL_dw3 * dz3_da1 * sigmoid_derivative(z1)

dz3_da2 = w3
dL_da2 = dL_dw3 * dz3_da2 * sigmoid_derivative(z2)

Then, we can compute the gradients of the hidden layer weights and biases:

In [None]:
dz1_dw1 = x
dL_dw1 = dL_da1 * dz1_dw1

dz1_db1 = 1
dL_db1 = dL_da1 * dz1_db1

dz2_dw2 = x
dL_dw2 = dL_da2 * dz2_dw2

dz2_db2 = 1
dL_db2 = dL_da2 * dz2_db2

Finally, we update the network parameters using the computed gradients and a learning rate:

In [None]:
learning_rate = 0.1

w1 -= learning_rate * dL_dw1
b1 -= learning_rate * dL_db1

w2 -= learning_rate * dL_dw2
b2 -= learning_rate * dL_db2

w3 -= learning_rate * dL_dw3
b3 -= learning_rate * dL_db3

We can repeat this process for multiple training examples to train the neural network.

## The effect of not having an activation function

Here's an example of a neural network without an activation function that separates the same two classes of data using a vertical line:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the input data and labels
np.random.seed(0)
X_inner = np.random.randn(100, 2) * 0.5
X_outer = np.random.randn(100, 2) * 1.5 + np.array([0, 3])
X = np.concatenate((X_inner, X_outer))
y = np.concatenate((np.zeros(100), np.ones(100)))

# Define the neural network architecture
input_dim = 2
hidden_dim = 16
output_dim = 1
lr = 0.0001
epochs = 1000

# Initialize the weights and biases
W1 = np.random.randn(input_dim, hidden_dim)
b1 = np.zeros((1, hidden_dim))
W2 = np.random.randn(hidden_dim, output_dim)
b2 = np.zeros((1, output_dim))

# Train the neural network
for i in range(epochs):
    # Forward pass
    z1 = np.dot(X, W1) + b1
    z2 = np.dot(z1, W2) + b2
    y_pred = z2

    # Compute the error
    error = y.reshape(-1, 1) - y_pred

    # Backward pass
    dW2 = np.dot(z1.T, error)
    db2 = np.sum(error, axis=0, keepdims=True)
    dW1 = np.dot(X.T, np.dot(error, W2.T))
    db1 = np.sum(np.dot(error, W2.T), axis=0, keepdims=True)

    # Update the weights and biases
    W1 += lr * dW1
    b1 += lr * db1
    W2 += lr * dW2
    b2 += lr * db2

# Evaluate the neural network on a grid of points
xx, yy = np.meshgrid(np.linspace(-4, 4, 1000), np.linspace(-4, 7, 1000))
X_grid = np.c_[xx.ravel(), yy.ravel()]
z1_grid = np.dot(X_grid, W1) + b1
z2_grid = np.dot(z1_grid, W2) + b2
y_pred = np.zeros(z2_grid.shape)
y_pred[z2_grid<0.5]=0
y_pred[z2_grid>=0.5]=1
y_pred_grid = y_pred.reshape(xx.shape)

# Plot the input data and the decision boundary
plt.figure(figsize=(8, 6),dpi=300)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr')
plt.contourf(xx, yy, y_pred_grid, levels=[0, 0.5, 1], alpha=0.2, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Neural network coordinate transformation without activation function')
plt.show()


In this example, we define the `input` data `X` and the labels `y`. We randomly generate 100 data points inside a circle with radius 2, and 100 data points a circle of radius 3 shited by 3 along the y axis.

We then define a neural network without an activation function, with a single hidden layer with 16 neurons. We initialize the weights and biases randomly and train the network using backpropagation.

In the forward pass, we compute the output of the network as a linear combination of the input features, without applying an activation function. In the backward pass, we compute the gradients of the loss function with respect to the weights and biases using the chain rule.

We evaluate the neural network on a grid of points and plot the result.

## Applying an activation function

Activation functions play a crucial role in neural networks as they introduce nonlinearity into the model, allowing it to learn complex and nonlinear relationships between the input and output data. Without activation functions, a neural network would simply be a series of linear transformations, which can only model linear relationships between the input and output data.

Activation functions allow neural networks to learn complex decision boundaries and patterns in the input data. They are responsible for determining whether a neuron in the network should be activated or not, based on its input. This activation or non-activation of a neuron is then propagated through the network, allowing it to learn more complex representations of the input data.

Different activation functions have different properties, and choosing the right activation function for a given task can have a significant impact on the performance of a neural network. For example, the sigmoid function is commonly used for binary classification tasks, while the ReLU function is widely used in deep learning models due to its ability to accelerate training and reduce the vanishing gradient problem.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the input data and labels
np.random.seed(0)
X_inner = np.random.randn(100, 2) * 0.5
X_outer = np.random.randn(100, 2) * 1.5 + np.array([0, 3])
X = np.concatenate((X_inner, X_outer))
y = np.concatenate((np.zeros(100), np.ones(100)))

# Define the neural network architecture
input_dim = 2
hidden_dim = 16
output_dim = 1
lr = 0.01
epochs = 10000

# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Initialize the weights and biases
W1 = np.random.randn(input_dim, hidden_dim)
b1 = np.zeros((1, hidden_dim))
W2 = np.random.randn(hidden_dim, output_dim)
b2 = np.zeros((1, output_dim))

# Train the neural network
for i in range(epochs):
    # Forward pass
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(a1, W2) + b2
    y_pred = sigmoid(z2)

    # Compute the error
    error = y.reshape(-1, 1) - y_pred

    # Backward pass
    delta2 = error * sigmoid_derivative(z2)
    dW2 = np.dot(a1.T, delta2)
    db2 = np.sum(delta2, axis=0, keepdims=True)
    delta1 = np.dot(delta2, W2.T) * sigmoid_derivative(z1)
    dW1 = np.dot(X.T, delta1)
    db1 = np.sum(delta1, axis=0, keepdims=True)

    # Update the weights and biases
    W1 += lr * dW1
    b1 += lr * db1
    W2 += lr * dW2
    b2 += lr * db2

# Evaluate the neural network on a grid of points
xx, yy = np.meshgrid(np.linspace(-4, 4, 100), np.linspace(-4, 7, 100))
X_grid = np.c_[xx.ravel(), yy.ravel()]
z1_grid = np.dot(X_grid, W1) + b1
a1_grid = sigmoid(z1_grid)
z2_grid = np.dot(a1_grid, W2) + b2
y_pred_grid = sigmoid(z2_grid)
y_pred_grid = y_pred_grid.reshape(xx.shape)

# Plot the input data and the decision boundary
plt.figure(figsize=(8, 6),dpi=300)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr')
plt.contourf(xx, yy, y_pred_grid, levels=[0, 0.5, 1], alpha=0.2, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Neural network coordinate transformation')
plt.show()


## Exercise

1. What happens when you change the number of epochs (try to decrease the epochs to 10)
2. What happens when you change the learning rate ?
3. What happens if tou change the hidden layer dimension ?

### Advanced exercise

1. Divide the training process in batches
2. register the training history (i.e. compute and store the loss at the end of each epoch)
3. how does the loss change when changing the batch size ?

## Appendix

The vanishing gradient problem is a common issue in deep neural networks that can occur during the training process. It refers to the phenomenon where the gradients (i.e., the derivatives of the loss function with respect to the model parameters) become very small as they propagate backward through the network, towards the earlier layers.

This can happen because of the way the gradients are computed and propagated through the layers of the network. In particular, when the network has many layers, the gradients can get smaller and smaller as they are multiplied by the weights in each layer, which can lead to very small or negligible updates to the earlier layers in the network.

The vanishing gradient problem can make it difficult for the network to learn meaningful representations of the input data, especially for deeper networks. It can also make the training process slower and less stable, as the network is not able to make meaningful updates to its parameters.

To address the vanishing gradient problem, several activation functions have been proposed that are more robust to this issue, such as the rectified linear unit (ReLU) and its variants. Additionally, other techniques such as weight initialization, batch normalization, and skip connections have also been proposed to alleviate the vanishing gradient problem and improve the training of deep neural networks.