***
# <center>***Backpropagation***
***

In deep learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

It is an efficient application of the **chain rule** to neural networks. **Backpropagation** computes the **gradient** of a **loss function** with respect to the weights of the network for a single input–output example, and does so efficiently, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

the term **backpropagation** refers only to an algorithm for **efficiently computing the gradient**, not how the gradient is used; but the term is often used loosely to refer to the entire learning algorithm including how the gradient is used, such as by stochastic gradient descent, or as an intermediate step in a more complicated optimizer, such as Adaptive Moment Estimation. The local minimum convergence, exploding gradient, vanishing gradient, and weak control of learning rate are main disadvantages of these optimization algorithms. The Hessian and quasi-Hessian optimizers solve only local minimum convergence problem, and the backpropagation works longer. These problems caused researchers to develop hybrid and fractional optimization algorithms.

***
## **How It Works**
***

**Initialization:** 
  - When a neural network is initialized, weights are often set to small random values.
  - The network processes inputs through layers to generate predictions.

**Forward Pass:**
  - In this step, the input data is fed through the network layer by layer.
  - Each neuron performs a weighted sum of its inputs and passes the result through an activation function to produce an output.
  - This process continues until the final output is produced.

**Loss Calculation:**
  - The loss (or cost) function measures the difference between the predicted output and the actual output (e.g., mean squared error for regression tasks or cross-entropy loss for classification tasks).
  - This loss value quantifies how well or poorly the network is performing.

**Backward Pass (Backpropagation):**
  - The goal of backpropagation is to reduce the loss.
  - The algorithm calculates the gradient (partial derivatives) of the loss function with respect to each weight by the chain rule of calculus (this is why it's also known as gradient descent).

**Step-by-Step of Backward Pass:**
  - **Compute Gradients for Output Layer:**
     - For the output layer neurons, compute the gradient of the loss concerning the neuron’s output. This tells you how much the loss would change if you changed the output of that neuron.

  - **Propagate Gradients Backwards:**
    - For each layer, starting from the output layer and moving backwards, compute the gradient of the loss with respect to each weight.
    - This involves computing the product of the gradient with respect to the neuron’s output and the gradient of the neuron’s output with respect to the weight.

  - **Update Weights:**
      - Use the computed gradients to update the weights. This is typically done using an optimization algorithm like Stochastic Gradient Descent (SGD):
        
$$𝑤_{𝑛𝑒𝑤}=𝑤_{𝑜𝑙𝑑}− 𝜂 \cdot \frac{∂L}{∂w}$$


Here, 
- **𝑤** is the weight, 
- **𝜂** is the learning rate, and 
- **𝐿** is the loss function.

By adjusting the weights slightly in the direction that reduces the loss, the network learns to produce more accurate outputs.
***

Now that we have an idea of how to measure the impact of variables on a function’s output, we can begin to write the code to calculate these **partial derivatives** to see their role in minimizing the model’s loss. Before applying this to a complete neural network, let’s start with a simplified **forward pass** with just **one neuron**. Rather than **backpropagating** from the **loss function** for a full neural network, let’s **backpropagate** the **ReLU function** for a **single neuron** and act as if we intend to **minimize the output for this single neuron**. We are first doing this only as a demonstration to simplify the explanation, since minimizing the output from a ReLU activated neuron does not serve any purpose other than as an exercise. **Minimizing the loss value is our end goal**, but in this case, we will start by showing how we can leverage the **chain rule** with **derivatives and partial derivatives** to calculate the impact of each variable on the ReLU activated output. We will also start by minimizing this more basic output before jumping to the full network and overall loss. 

we need to perform for this **single neuron and ReLU activation**. We will use an example neuron with 3 inputs, which means that it also has 3 weights and a bias: 

In [1]:

# Forward pass 
x = [1.0, -2.0, 3.0]  # input values 
w = [-3.0, -1.0, 2.0]  # weights 
b = 1.0  # bias


We then start with the first input, x[0], and the related weight, w[0]:
$$x[0] = 1.0$$
$$w[0] = -3.0$$
We have to multiply the input by the weight: 

In [4]:

# Multiplying inputs by weights 
xw0 = x[0] * w[0] 
xw1 = x[1] * w[1] 
xw2 = x[2] * w[2] 
print(xw0, xw1, xw2) 


-3.0 2.0 6.0


The next operation to perform is a **sum of all weighted inputs with a bias**: 

In [5]:

# Adding weighted inputs and a bias 
z = xw0 + xw1 + xw2 + b 
print(z)


6.0


This forms the **neuron’s output**. The last step is to apply the **ReLU activation function** on this output:

In [6]:

# ReLU activation function 
y = max(z, 0) 
print(y) 


6.0


This is the full forward pass through a single neuron and a ReLU activation function. Let’s treat all of these chained functions as one big function which takes **input values (x​)**, **weights (w​)**, and **bias (b​)**, as **inputs, and outputs y​**. This big function consists of multiple simpler functions there is a multiplication of input values and weights, sum of these values and bias, as well as a max function as the ReLU activation 3 chained functions in total: The first step is to **backpropagate** our **gradients** by calculating derivatives and partial derivatives with respect to each of our parameters and inputs. To do this, we are going to use the **chain rule**. Recall that the chain rule for a function stipulates that the derivative for nested functions like **f(g(x))** solves to:

$$\frac{d}{dx}f(g(x)) = \frac{d}{dg(x)}f(g(x)) \cdot \frac{d}{dx}g(x) = f'(g(x)) \cdot g'(x)$$

This big function that we just mentioned can be, in the context of our neural network, loosely interpreted as: 

$$ReLU(\sum_{i}[inputs \cdot weights] + bias)$$

Or in the form that matches code more precisely as:

$$ReLU(x_0w_0 + x_1w_1 + x_2w_2 + b)$$

Our current task is to calculate how much each of the inputs, weights, and a bias impacts the output. We will start by considering what we need to calculate for the partial derivative of w​0​, for example. But first, let’s rewrite our equation to the form that will allow us to determine how to calculate the derivatives more easily: 

$$y = ReLU(sum(mul(x_v, w_v), mul(x_p, w_p), mul(x_z, w_z), b)))$$

The above equation contains 3 nested functions: ReLU​, a sum of weighted inputs and a bias, and multiplications of the inputs and weights. To calculate the impact of the example weight, w​0​, on the output, the chain rule tells us to calculate the derivative of ReLU​ with respect to its parameter, which is the sum, then multiply it with the partial derivative of the sum operation with respect to its mul(x​0​, w​0​)​ input, as this input contains the parameter in question. Then, multiply this with the partial derivative of the multiplication operation with respect to the x​0​ input. Let’s see this in a simplified equation:

$$∂/∂x₀ [ReLU(sum(mul(x₀, w₀), mul(x₁, w₁), mul(x₂, w₂), b))] = 
dReLU()/dsum() * dsum()/dmul(x₀, w₀) * dmul(x₀, w₀)/dx₀$$

During the backward pass, we will calculate the derivative of the loss function, and use it to multiply with the derivative of the activation function of the output layer, then use this result to multiply by the derivative of the output layer, and so on, through all of the hidden layers and activation functions. Inside these layers, the derivative with respect to the weights and biases will form the gradients that we will use to update the weights and biases. The derivatives with respect to inputs will form the gradient to chain with the previous layer. This layer can calculate the impact of its weights and biases on the loss and backpropagate gradients on inputs further. 