# Neural Networks: Forward and Backward Propogation

This short tutorial focuses on forward and backward propagation, key processes in training neural networks. We will also compare how these processes occur in different network architectures, specifically feedforward and recurrent neural networks, and demonstrate the forward and backward passes with a simple feedforward network.

## Core Concepts: Forward and Backward Propagation

### Forward Propagation

Forward propagation is the process of calculating the output of a neural network given its inputs and current weights and biases. It involves passing the input data through each layer of the network, from the input layer to the output layer.

Here's a simplified breakdown:

1.  **Input Layer:** The input data is fed into the input layer.
2.  **Hidden Layers (if any):** For each neuron in a hidden layer, a weighted sum of the inputs from the previous layer is calculated, and an activation function is applied to this sum. The output of this activation function becomes the input for the next layer.
3.  **Output Layer:** The same process of weighted sum and activation is applied to the final layer to produce the network's output.

The purpose of forward propagation is to generate a prediction or output from the neural network based on the current state of its parameters. This output is then compared to the actual target value to calculate the error.  Target output is calculated from the ground truth data or actual labels for each data point.  When training, conventionally, we provide data that includes data points and a label.  See the tutorial on preparing data for an neural networks.

### Backward Propagation

Backpropagation (short for "backward propagation of errors") is the key algorithm used to train neural networks. Its primary purpose is to calculate the gradients of the network's loss function with respect to each weight and bias. These gradients indicate how much the loss function will change with a small change in each parameter.

Here's a simplified breakdown of what happens during training:

1.  **Error Calculation:** After forward propagation, the output of the network is compared to the expected output (the ground truth), and an error (or loss) is calculated.
2.  **Gradient Calculation (Backward Pass):** Backpropagation starts from the output layer and moves backward through the network towards the input layer. At each layer, it calculates the contribution of each weight and bias to the total error using the chain rule of calculus. This process determines how much each parameter needs to be adjusted to reduce the error.
3.  **Parameter Update:** Once the gradients are calculated, an optimization algorithm (like gradient descent) uses these gradients to update the weights and biases. The parameters are adjusted in a direction that minimizes the loss function.

Backpropagation essentially tells the network how to change its internal parameters to make better predictions in the future.

### Relationship Between Forward and Backward Propagation

Forward propagation and backpropagation are two essential, sequential processes in the training of a neural network:

1.  **Forward Propagation:** This is the **initial step**. It takes the input data and pushes it through the network to generate an output (prediction). This process uses the current weights and biases to compute the output and then calculates the error between the predicted output and the actual target value.
2.  **Backward Propagation:** This is the **subsequent step**. It uses the error calculated during forward propagation to determine how much each weight and bias in the network contributed to that error. By calculating gradients, backpropagation provides the necessary information for the optimization algorithm to adjust the network's parameters.

In essence, forward propagation provides the outcome and the measure of its inaccuracy (the error), while backpropagation provides the necessary information (the gradients) to *correct* that inaccuracy by updating the network's parameters. Forward propagation feeds information forward to get an output, and backpropagation flows backward to adjust the parameters based on the output's error. This cycle of forward and backward passes is repeated many times during training until the network's parameters are optimized and the error is minimized.

## Forward and Backward Passes in Different Architectures

The core principles of forward and backward propagation apply to various neural network architectures, but their implementation and the flow of information can differ.

### Feedforward Neural Networks (FNNs)

In a feedforward network, the connections between neurons only flow in one direction—from the input layer through the hidden layers (if any) to the output layer. There are no cycles or loops.

*   **Forward Pass:** The input data travels directly through the network, layer by layer, with each neuron's output calculated based on the weighted sum of its inputs from the *previous* layer. This is a straightforward, one-way computation.
*   **Backward Pass:** The error is calculated at the output layer and then propagated backward through the network. Gradients are calculated for each layer based on the gradients of the * subsequent* layer and the activation values from the forward pass. The chain rule is applied layer by layer from the output back to the input.

### Recurrent Neural Networks (RNNs)

Recurrent neural networks are designed to handle sequential data. They have connections that loop back on themselves, allowing information to persist and influence the processing of subsequent inputs. This introduces a time dimension to the network's behavior.

*   **Forward Pass:** The forward pass in an RNN involves processing the input sequence one step at a time. At each time step, the network receives the current input and the hidden state from the *previous* time step. The hidden state is updated based on the current input and the previous hidden state, and an output is produced. This means the forward pass is not just a single pass but a sequence of passes over time.
*   **Backward Pass:** Training RNNs involves a technique called **Backpropagation Through Time (BPTT)**. BPTT essentially unfolds the recurrent network over the entire sequence length, treating it as a deep feedforward network where each time step is a layer. The error is calculated at the end of the sequence and then backpropagated through each time step, going backward in time. This process can be computationally expensive and can lead to the vanishing or exploding gradient problem for long sequences.

### Key Differences in Forward/Backward Pass due to Architecture

| Feature          | Feedforward Neural Network                      | Recurrent Neural Network (RNN)                    |
| :--------------- | :---------------------------------------------- | :------------------------------------------------ |
| **Forward Pass** | Straightforward, one-way computation layer by layer | Sequential processing over time, uses previous hidden state |
| **Backward Pass**| Backpropagation layer by layer                  | Backpropagation Through Time (BPTT), unfolds network over time |
| **Information Flow** | Unidirectional                                  | Includes loops and cycles, allows information persistence |
| **Gradient Calculation** | Standard backpropagation                        | BPTT, can face vanishing/exploding gradient issues |
| **Suitable for** | Independent data points                         | Sequential data (time series, text)               |

Understanding these differences is crucial for choosing the appropriate network architecture for a given task and for implementing the training process correctly.

## Demonstrating Forward and Backward Propagation (Feedforward Network)

Here, we will use a simple feedforward neural network to illustrate the forward and backward passes computationally.

In [41]:
import numpy as np

def sigmoid(x):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of the sigmoid function."""
    return x * (1 - x)

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        """Initializes the neural network with random weights and biases."""
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Initialize weights and biases
        self.weights_input_hidden = np.random.rand(self.input_size, self.hidden_size) * 0.01
        self.bias_hidden = np.zeros((1, self.hidden_size))
        self.weights_hidden_output = np.random.rand(self.hidden_size, self.output_size) * 0.01
        self.bias_output = np.zeros((1, self.output_size))

    def forward(self, X):
        """Performs the forward pass through the network."""
        # Input to hidden layer
        self.hidden_layer_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden
        self.hidden_layer_output = sigmoid(self.hidden_layer_input)

        # Hidden to output layer
        self.output_layer_input = np.dot(self.hidden_layer_output, self.weights_hidden_output) + self.bias_output
        self.predicted_output = sigmoid(self.output_layer_input)

        return self.predicted_output

    def backward(self, X, y, predicted_output):
        """Performs the backward pass through the network."""
        # Calculate the error at the output layer
        output_error = 2 * (predicted_output - y) / y.shape[0]

        # Calculate the delta (error signal) at the output layer
        output_delta = output_error * sigmoid_derivative(predicted_output)

        # Calculate the gradients for the weights and biases between the hidden and output layers
        gradients_weights_hidden_output = np.dot(self.hidden_layer_output.T, output_delta)
        gradients_bias_output = np.sum(output_delta, axis=0, keepdims=True)

        # Calculate the error signal at the hidden layer
        hidden_error = np.dot(output_delta, self.weights_hidden_output.T)

        # Calculate the delta (error signal) at the hidden layer
        hidden_delta = hidden_error * sigmoid_derivative(self.hidden_layer_output)

        # Calculate the gradients for the weights and biases between the input and hidden layers
        gradients_weights_input_hidden = np.dot(X.T, hidden_delta)
        gradients_bias_hidden = np.sum(hidden_delta, axis=0, keepdims=True)

        # Return the calculated gradients
        return {
            'weights_input_hidden': gradients_weights_input_hidden,
            'bias_hidden': gradients_bias_hidden,
            'weights_hidden_output': gradients_weights_hidden_output,
            'bias_output': gradients_bias_output
        }

### Implementing Forward Propagation

In [42]:
# 1. Define input data X and target output y
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

# 2. Instantiate the NeuralNetwork class
# Input size is the number of features in X
input_size = X.shape[1]
# Output size is the number of features in y
output_size = y.shape[1]
# Choose a hidden layer size
hidden_size = 4

nn = NeuralNetwork(input_size, hidden_size, output_size)

# 3. Call the forward() method
predicted_output = nn.forward(X)

# 4. Print the input data and the predicted output
print("Input data (X):")
print(X)
print("\nPredicted output after forward pass:")
print(predicted_output)

Input data (X):
[[0 0]
 [0 1]
 [1 0]
 [1 1]]

Predicted output after forward pass:
[[0.50173088]
 [0.50173437]
 [0.50173677]
 [0.50174026]]


### Calculating the Error

In [43]:
# 1. Define the Mean Squared Error (MSE) loss function
def mean_squared_error(y_true, y_pred):
    """Calculates the Mean Squared Error between true and predicted values."""
    return np.mean((y_true - y_pred)**2)

# 2. Calculate the error between the predicted_output and the true target values y
error = mean_squared_error(y, predicted_output)

# 3. Print the calculated error
print("\nTrue target values (y):")
print(y)
print("\nCalculated error (MSE):")
print(error)


True target values (y):
[[0]
 [1]
 [1]
 [0]]

Calculated error (MSE):
0.2500030121868324


### Implementing Backward Propagation

In [44]:
# Call the backward() method to calculate gradients
gradients = nn.backward(X, y, predicted_output)

# Print the calculated gradients
print("\nGradients (Output of Backward Pass):")
print("Gradients weights hidden to output:")
print(gradients['weights_hidden_output'])
print("\nGradients bias output:")
print(gradients['bias_output'])
print("\nGradients weights input to hidden:")
print(gradients['weights_input_hidden'])
print("\nGradients bias hidden:")
print(gradients['bias_hidden'])


Gradients (Output of Backward Pass):
Gradients weights hidden to output:
[[0.00043509]
 [0.00043578]
 [0.00043503]
 [0.000434  ]]

Gradients bias output:
[[0.00086777]]

Gradients weights input to hidden:
[[6.06738546e-07 5.10538267e-07 1.89077514e-08 3.63114503e-07]
 [6.04552463e-07 5.10296569e-07 1.89399583e-08 3.62859972e-07]]

Gradients bias hidden:
[[1.21235786e-06 1.02512665e-06 3.78728578e-08 7.25015605e-07]]


## Showing the Difference: Forward vs. Backward Pass Outputs

Let's compare the outputs of the forward and backward passes to highlight their distinct roles.

### Forward Pass Output

The output of the forward pass is the network's prediction for the given input data. In our example, the predicted output for the input `X` was:

In [45]:
print(predicted_output)

[[0.50173088]
 [0.50173437]
 [0.50173677]
 [0.50174026]]


This output is a set of values, one for each input sample, representing the network's attempt to match the target output.

### Backward Pass Output (Gradients)

The output of the backward pass is a set of gradients. These gradients quantify how much the loss function changes with respect to each weight and bias in the network.

In our example, the calculated gradients were:

In [46]:
print("Gradients weights hidden to output:")
print(gradients['weights_hidden_output'])
print("\nGradients bias output:")
print(gradients['bias_output'])
print("\nGradients weights input to hidden:")
print(gradients['weights_input_hidden'])
print("\nGradients bias hidden:")
print(gradients['bias_hidden'])

Gradients weights hidden to output:
[[0.00043509]
 [0.00043578]
 [0.00043503]
 [0.000434  ]]

Gradients bias output:
[[0.00086777]]

Gradients weights input to hidden:
[[6.06738546e-07 5.10538267e-07 1.89077514e-08 3.63114503e-07]
 [6.04552463e-07 5.10296569e-07 1.89399583e-08 3.62859972e-07]]

Gradients bias hidden:
[[1.21235786e-06 1.02512665e-06 3.78728578e-08 7.25015605e-07]]


These gradients are not predictions but rather indicators of how the network's parameters should be adjusted to reduce the error.

### Key Differences in Outputs

| Feature          | Forward Pass Output                       | Backward Pass Output (Gradients)                  |
| :--------------- | :---------------------------------------- | :------------------------------------------------ |
| **Nature**       | Network's prediction                      | Sensitivity of the loss to each parameter         |
| **Purpose**      | Generate an output based on current parameters | Inform parameter updates to reduce error          |
| **Represents**   | The network's current performance        | How parameters need to change for improvement     |

## Summary

Forward propagation is the process of generating a prediction by passing data through the network, while backward propagation is the process of calculating gradients to understand how to adjust the network's parameters to improve those predictions. Both are essential steps in the training of neural networks, with backpropagation being crucial for enabling the network to learn from its errors. The specific implementation of these passes can vary depending on the neural network architecture, as seen when comparing feedforward networks to recurrent neural networks.

## Updating Weights and Biases with Gradients

After calculating the gradients during the backward pass, the next crucial step in training a neural network is to use these gradients to update the network's weights and biases. This process is guided by an optimization algorithm, the most common of which is Gradient Descent.

### Gradient Descent

Gradient Descent is an iterative optimization algorithm used to find the minimum of a function. In the context of neural networks, the function we want to minimize is the loss function. The gradients calculated during backpropagation tells us the direction of the steepest increase in the loss function. To minimize the loss, we move in the opposite direction (towards convergence) of the gradient.

The update rule for a parameter (weight or bias) using Gradient Descent is:

$$ \text{parameter} = \text{parameter} - \text{learning\_rate} \times \text{gradient} $$

Where:
- $\text{parameter}$ is the current value of the weight or bias.
- $\text{learning\_rate}$ is a hyperparameter that controls the step size of the update. A smaller learning rate leads to slower but potentially more stable convergence, while a larger learning rate can speed up convergence but may overshoot the closest minimim weight required for convergence.
- $\text{gradient}$ is the gradient of the loss function with respect to the parameter, calculated during backpropagation.

### Implementing Parameter Updates

We can now add a method to our `NeuralNetwork` class to update the weights and biases using the calculated gradients and a specified learning rate.

In [51]:
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        """Initializes the neural network with random weights and biases."""
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Initialize weights and biases
        self.weights_input_hidden = np.random.rand(self.input_size, self.hidden_size) * 0.01
        self.bias_hidden = np.zeros((1, self.hidden_size))
        self.weights_hidden_output = np.random.rand(self.hidden_size, self.output_size) * 0.01
        self.bias_output = np.zeros((1, self.output_size))

    def sigmoid(self, x):
        """Sigmoid activation function."""
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        """Derivative of the sigmoid function."""
        return x * (1 - x)

    def forward(self, X):
        """Performs the forward pass through the network."""
        # Input to hidden layer
        self.hidden_layer_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden
        self.hidden_layer_output = self.sigmoid(self.hidden_layer_input)

        # Hidden to output layer
        self.output_layer_input = np.dot(self.hidden_layer_output, self.weights_hidden_output) + self.bias_output
        self.predicted_output = self.sigmoid(self.output_layer_input)

        return self.predicted_output

    def backward(self, X, y, predicted_output):
        """Performs the backward pass through the network."""
        # Calculate the error at the output layer
        # The error is the difference between the predicted output and the true output (ground truth)
        output_error = predicted_output - y
        # We then multiply by the derivative of the activation function at the output layer
        output_delta = output_error * self.sigmoid_derivative(predicted_output)

        # Calculate the gradients for the weights and biases between the hidden and output layers
        # Gradients for weights_hidden_output: dot product of hidden layer output transpose and output_delta
        gradients_weights_hidden_output = np.dot(self.hidden_layer_output.T, output_delta)
        # Gradients for bias_output: sum of output_delta along the rows
        gradients_bias_output = np.sum(output_delta, axis=0, keepdims=True)

        # Calculate the error signal at the hidden layer
        # This error is propagated backward from the output layer
        hidden_error = np.dot(output_delta, self.weights_hidden_output.T)

        # Calculate the delta (error signal) at the hidden layer
        # Multiply the hidden error by the derivative of the activation function at the hidden layer
        hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_layer_output)

        # Calculate the gradients for the weights and biases between the input and hidden layers
        # Gradients for weights_input_hidden: dot product of input transpose and hidden_delta
        gradients_weights_input_hidden = np.dot(X.T, hidden_delta)
        # Gradients for bias_hidden: sum of hidden_delta along the rows
        gradients_bias_hidden = np.sum(hidden_delta, axis=0, keepdims=True)

        # Return the calculated gradients
        return {
            'weights_input_hidden': gradients_weights_input_hidden,
            'bias_hidden': gradients_bias_hidden,
            'weights_hidden_output': gradients_weights_hidden_output,
            'bias_output': gradients_bias_output
        }

    def update_parameters(self, gradients, learning_rate):
        """Updates the weights and biases using the calculated gradients and learning rate."""
        self.weights_input_hidden -= learning_rate * gradients['weights_input_hidden']
        self.bias_hidden -= learning_rate * gradients['bias_hidden']
        self.weights_hidden_output -= learning_rate * gradients['weights_hidden_output']
        self.bias_output -= learning_rate * gradients['bias_output']

### Demonstrating Parameter Update

Now, let's demonstrate how to use the calculated gradients to update the network's parameters. We'll use a learning rate of 0.1.

In [50]:
# Re-instantiate the NeuralNetwork class after updating its definition
nn = NeuralNetwork(input_size, hidden_size, output_size)

# Perform a forward pass to get the predicted output and calculate gradients with the new instance
predicted_output = nn.forward(X)
gradients = nn.backward(X, y, predicted_output)

# Now, call the update_parameters() method
learning_rate = 0.1
nn.update_parameters(gradients, learning_rate)

print("\nWeights and biases after update:")
print("Weights input to hidden:\n", nn.weights_input_hidden)
print("\nBias hidden:\n", nn.bias_hidden)
print("\nWeights hidden to output:\n", nn.weights_hidden_output)
print("\nBias output:\n", nn.bias_output)


Weights and biases after update:
Weights input to hidden:
 [[0.00425942 0.0019169  0.00885266 0.00940727]
 [0.00893047 0.0027246  0.00314774 0.00466143]]

Bias hidden:
 [[-1.28307422e-07 -2.29543836e-07 -2.50043725e-07 -1.40688294e-07]]

Weights hidden to output:
 [[0.00368034]
 [0.00663345]
 [0.00723572]
 [0.00404265]]

Bias output:
 [[-0.00013702]]


In [49]:
# Define a learning rate
learning_rate = 0.1

# Get the initial weights and biases
initial_weights_input_hidden = nn.weights_input_hidden.copy()
initial_bias_hidden = nn.bias_hidden.copy()
initial_weights_hidden_output = nn.weights_hidden_output.copy()
initial_bias_output = nn.bias_output.copy()

print("Initial weights and biases:")
print("Weights input to hidden:\n", initial_weights_input_hidden)
print("\nBias hidden:\n", initial_bias_hidden)
print("\nWeights hidden to output:\n", initial_weights_hidden_output)
print("\nBias output:\n", initial_bias_output)

# Update the parameters using the calculated gradients and learning rate
nn.update_parameters(gradients, learning_rate)

print("\nWeights and biases after update:")
print("Weights input to hidden:\n", nn.weights_input_hidden)
print("\nBias hidden:\n", nn.bias_hidden)
print("\nWeights hidden to output:\n", nn.weights_hidden_output)
print("\nBias output:\n", nn.bias_output)

Initial weights and biases:
Weights input to hidden:
 [[0.00518564 0.00100597 0.00760695 0.0052026 ]
 [0.00158254 0.0084511  0.00807895 0.00784457]]

Bias hidden:
 [[-2.04296483e-07 -2.31571760e-07 -1.02931199e-07 -2.82286884e-08]]

Weights hidden to output:
 [[0.00679525]
 [0.0077105 ]
 [0.00339879]
 [0.00088819]]

Bias output:
 [[-0.00011924]]

Weights and biases after update:
Weights input to hidden:
 [[0.00518554 0.00100586 0.0076069  0.00520259]
 [0.00158244 0.00845098 0.0080789  0.00784455]]

Bias hidden:
 [[-4.08592966e-07 -4.63143521e-07 -2.05862398e-07 -5.64573769e-08]]

Weights hidden to output:
 [[0.00673553]
 [0.00765074]
 [0.00333894]
 [0.00082838]]

Bias output:
 [[-0.00023847]]


## Frequently Asked Questions (FAQ)

### 1. What is ground truth data and why is it primarily used during training a neural network?

**Ground truth data** refers to the actual, correct output or labels for the input data used to train a neural network. It represents the desired outcome that the neural network is trying to predict or approximate.

Ground truth data is primarily used **during the training phase** of a neural network because the training process is based on learning from examples. The network makes predictions on the input data, and these predictions are compared against the corresponding ground truth labels. This comparison allows the network to calculate the error or loss. The error signal is then used during backpropagation to adjust the network's internal parameters (weights and biases) in a way that minimizes this error, making the network's future predictions closer to the ground truth.

During validation and testing, after the network has been trained, ground truth data is often used to evaluate the network's performance on unseen data, but it is not used to update the network's parameters. The network's goal during inference is to make predictions based on what it has learned from the training data. During inference new output is created (predictions) and compared to the actual corrsponding correct responses.  These can be used to detect accuracy and precision of the neural network.

### 2. What is a loss function, what error does it calculate, and how does it calculate that error?

A **loss function** (also known as a cost function or error function) is a mathematical function that quantifies the difference between the output predicted by a neural network and the actual ground truth value. It measures how well the neural network is performing for a given set of parameters and input data.

The loss function calculates the **error** between the predicted output and the true output. The specific type of error calculated depends on the task the neural network is designed for. For example:

*   **For regression tasks** (predicting continuous values), the error is typically the difference between the predicted continuous value and the true continuous value.
*   **For classification tasks** (categorizing data into classes), the error relates to how confidently and correctly the network predicts the class label compared to the true class label.

The loss function calculates this error through a specific mathematical formula. Different loss functions have different formulas that are suited for different types of tasks and error measurements. Here are a few common loss functions:

*   **Mean Squared Error (MSE):** Commonly used in regression tasks. It calculates the average of the squared differences between the predicted and true values.
    $$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
    where $y_i$ is the true value, $\hat{y}_i$ is the predicted value, and $n$ is the number of data points.

*   **Mean Absolute Error (MAE):** Also used in regression tasks. It calculates the average of the absolute differences between the predicted and true values.
    $$ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$

*   **Binary Cross-Entropy:** Used in binary classification tasks (two classes). It measures the performance of a classification model whose output is a probability value between 0 and 1.

*   **Categorical Cross-Entropy:** Used in multi-class classification tasks (more than two classes). It measures the performance of a classification model where the output is a probability distribution over the classes.

The goal during neural network training is to minimize the value of the loss function by adjusting the network's parameters. A lower loss value indicates that the network's predictions are closer to the ground truth.