# Neural Networks - Backpropagation and Computational Graph

## Simple Graph with One Parameter
Let's start with simple computation graph. Consider composite function $f(x)$ as follows:

$$f(x) = ((x + 5)^3+1)^2$$

And let's compute our function $f(x)$ at the point $x = -3$. So, if we decompose our function $f(x)$ into simple functions, we get:

$$h= x + 5 = 2$$
$$g = h^3 + 1 = 9$$
$$f = g^2 = 81$$

See [Computational Graph]() for more details.

We'll use PyTorch to compute the output (forward propagation) of our function $f(x)$ at the point $x = -3$. Then we will compute the gradients (backward propagation) of our function $f(x)$ with respect to $x$.

In [2]:
import torch

In [3]:
x = torch.tensor(-3.0, requires_grad=True)

Since we want to compute the gradient (derivative) of our function $f(x)$ with respect to $x$, we need to set `requires_grad=True` for any parameter that will be part of the computation graph for which we want to compute the gradient.

In [4]:
f_x = ((x + 5) ** 3 + 1) ** 2
print(f_x.data)

tensor(81.)


In [5]:
f_x.backward()

# Gradients
print(f"df/dx: {x.grad}")

df/dx: 216.0


During the forward propagation, PyTorch automatically builds a directed acyclic graph (DAG) of operations. This is when we calculate `f_x`. PyTorch create this graph when `torch.tensor` with `requires_grad=True` is involved in the computation.

When we call `f_x.backward()`, it walks backward through the computational graph to to each parameter (leaf node) and apply the chain rule of calculus to compute the gradients of the output (the final node where we called `backward()`) with respect to each parameter.

Gradient Accumulation: The calculated gradients are stored in the `grad` attribute of each parameter tensor that has `requires_grad=True`.

This [PyTorch Computation Graph](https://www.youtube.com/watch?v=MswxJw-8PvE) is a good video on this topic.

## Graph with Multiple Parameters

Now let's consider a simple linear regression model with one input feature $x$ and one target $y$. The model is defined as follows:

$$f_{w,b}(x) = wx + b$$

Where $w$ is the weight and $b$ is the bias. 

The [cost function]() is defined as the mean squared error (MSE) for a dataset with only one sample:

$$J(w,b) = \frac{1}{2}(f_{w,b}(x) - y)^2$$


[Gradient Descent]() is the most common optimization algorithm used to minimize the cost function. The algorithm works by iteratively updating the parameters in the opposite direction of the gradients of the cost function with respect to the parameters. So, we need to compute the gradients of the cost function with respect to the parameters $w$ and $b$.
 

See [Computational Graph]() for more details.


Let's define our input $x$, target value $y$, weight $w$ and bias $b$ as follows:

In [6]:
x = torch.tensor(-3.0)
y = torch.tensor(5.0)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

We didn't set the `requires_grad=True` for $x$ and $y$ because they are the input and target values. They are not parameters of the model which we need to optimize by computing the gradients.

### Step 1 - Forward Propagation (Compute Model Output)
We'll compute the model's output

In [7]:
# Forward Propagation (compute the output)
c = w * x
f = c + b

### Step 2 - Forward Propagation (Compute the Cost Function)

In [8]:
# Compute the Cost
d = f - y
J = (d**2) / 2

In PyTorch, the intermediate gradients for non-leaf nodes are not stored by default which is more efficient in practice when we have a large number of parameters. However, in this example to demonstrate the backward steps in the computational graph, we will change this behavior by calling `retain_grad()` on the intermediate tensors.

In [9]:
c.retain_grad()
f.retain_grad()
d.retain_grad()

In [10]:
print(f"Input feature x and target y:\nx: {x.data}\ny: {y.data}\n")
print(f"Model parameters:\nw: {w.data}\nb: {b.data}\n")
print(f"Model output:\nc: {c.data}\nf: {f.data}\n")
print(f"Cost:\nd: {d.data}\nJ: {J.data}\n")

Input feature x and target y:
x: -3.0
y: 5.0

Model parameters:
w: 3.0
b: 1.0

Model output:
c: -9.0
f: -8.0

Cost:
d: -13.0
J: 84.5



### Step 3 - Backpropagation (Compute the Gradients)

In [11]:
# Backpropagation (compute the gradients)
J.backward()

In [12]:
print(f"dJ/dd: {d.grad}")
print(f"dJ/df: {f.grad}")
print(f"dJ/dc: {c.grad}")
print(f"dJ/db: {b.grad}")
print(f"dJ/dw: {w.grad}")

dJ/dd: -13.0
dJ/df: -13.0
dJ/dc: -13.0
dJ/db: -13.0
dJ/dw: 39.0


## Simple Neural Network
Let's now see the computational graph for a simple neural network with the following architecture:
- Layer 0 (input layer): 2 features
- Layer 1: Fully connected layer with 3 neurons and ReLU activation function.
- Layer 2: Fully connected layer with 2 neurons and ReLU activation function.
- Layer 3 (output): Fully connected layer with 1 neuron (output) and Sigmoid activation function.

In this example we use a batch dataset with 2 samples. The input $X$ and target $Y$ are defined as follows:

$$X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$$
$$Y = \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix}$$

Which means, for example 1 $x_1 = 1$ and $x_2 = 2$ and the target $y = 0.5$.

Recall that we maintain each sample in **rows** and features in **columns**. So, each row of $X$ and $Y$ is associated with one sampleDataset with .

In [None]:
X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
Y = torch.tensor([[0.5], [0.8]])

Let's create our neural network

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()

        # Define the model architecture (Layers and nodes)
        self.linear1 = nn.Linear(in_features=2, out_features=3)
        self.linear2 = nn.Linear(in_features=3, out_features=2)
        self.linear3 = nn.Linear(in_features=2, out_features=1)

    def forward(self, x):
        # Forward Propagation happens here.
        # It takes the input tensor x and returns the output tensor for each
        # layer by applying the linear transformation first and then the
        # activation function.
        # It start from layer 1 and goes forward layer by layer to the output
        # layer.

        # Layer 1 linear transformation
        Z1 = self.linear1(x)
        # Layer 1 activation
        A1 = F.relu(Z1)

        # Layer 2 linear transformation
        Z2 = self.linear2(A1)
        # Layer 2 activation
        A2 = F.relu(Z2)

        # Layer 3 (output layer) linear transformation
        Z3 = self.linear3(A2)
        # Layer 3 activation
        A3 = F.sigmoid(Z3)

        # Print the intermediate results
        print(f"Z1: {Z1}\nA1: {A1}\nZ2: {Z2}\nA2: {A2}\nZ3: {Z3}\nA3: {A3}")

        # Output of the model (prediction)
        return A3

In [22]:
model = NeuralNet()
print(model)

NeuralNet(
  (linear1): Linear(in_features=2, out_features=3, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
  (linear3): Linear(in_features=2, out_features=1, bias=True)
)


Let's see the initial weights and biases of our neural network.

In [33]:
print(model.linear1.weight)

Parameter containing:
tensor([[ 0.4328, -0.2531],
        [ 0.5298, -0.4185],
        [-0.4404,  0.3311]], requires_grad=True)


In [42]:
def print_model_parameters(model):
    for i, child in enumerate(model.children()):
        print(f"Layer {i+1}: {type(child).__name__}")
        child_parameters = dict(child.named_parameters())

        print(f"weights: {child_parameters['weight']}")
        print(f"bias: {child_parameters['bias']}")
        print("-" * 80)


print_model_parameters(model)

Layer 1: Linear
weights: Parameter containing:
tensor([[ 0.4328, -0.2531],
        [ 0.5298, -0.4185],
        [-0.4404,  0.3311]], requires_grad=True)
bias: Parameter containing:
tensor([0.3261, 0.6766, 0.3776], requires_grad=True)
--------------------------------------------------------------------------------
Layer 2: Linear
weights: Parameter containing:
tensor([[ 0.5389,  0.5690, -0.1213],
        [-0.2766,  0.0664,  0.1120]], requires_grad=True)
bias: Parameter containing:
tensor([-0.0097, -0.3818], requires_grad=True)
--------------------------------------------------------------------------------
Layer 3: Linear
weights: Parameter containing:
tensor([[-0.3824,  0.1682]], requires_grad=True)
bias: Parameter containing:
tensor([0.2470], requires_grad=True)
--------------------------------------------------------------------------------


For this example for simplicity and having reproducible results, we'll set the weights and biases manually. Let's say we have the following weights and biases:

**Layer 1:**
$$W_1 = \begin{bmatrix} -1 & 2 \\ 4 & 5 \\ 6 & -3\end{bmatrix} \quad b_1 = \begin{bmatrix} -1 & -2 & 3 \end{bmatrix}$$

Note: each row of $W_1$ is associated with one neuron in layer 1. We have 3 neurons in layer 1, so we have 3 rows in $W_1$. The number of columns in $W_1$ is equal to the number of features in the input layer. We have 2 features in the input layer $X$, so we have 2 columns in $W_1$.