# Forward Pass Implementation
We'll implement a simple forward pass of the neural network with 2 fully connected layers and Sigmoid activation function. First, we'll use PyTorch to implement the forward pass and then we'll implement the matrix multiplication and activation function from scratch to understand the working of the neural network.

For more details see [here](https://github.com/pooyavahidi/content/blob/main/ai/neural_networks_inference.md).

## Forward Pass using PyTorch
Let's implement a simple forward pass of a neural network with 2 fully connected layers using PyTorch.



In [1]:
import numpy as np
import torch.nn
import torch.nn.functional as F


class SimpleNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(in_features=2, out_features=3)
        self.linear2 = torch.nn.Linear(in_features=3, out_features=1)

    def forward(self, x):
        Z1 = self.linear1(x)
        print(f"Z1: {Z1.data}")

        A1 = F.sigmoid(Z1)
        print(f"A1: {A1.data}")

        Z2 = self.linear2(A1)
        print(f"Z2: {Z2.data}")

        A2 = F.sigmoid(Z2)
        print(f"A2: {A2.data}")

        return A2

In [2]:
model = SimpleNet()
print(model)

SimpleNet(
  (linear1): Linear(in_features=2, out_features=3, bias=True)
  (linear2): Linear(in_features=3, out_features=1, bias=True)
)


Define the input dataset with 3 examples (batch size = 2) with 2 features each. So, $X$ is with shape of $(3, 2)$.


The matrix of input features is defined like all other input feature matrix with each row represent a **feature** and each column represent an **example**. Input feature shape is `(number of examples/batch size, number of features)`.

$$X =  \begin{bmatrix} \vec{\mathbf{x}}^{(1)} \\
\vec{\mathbf{x}}^{(2)} \\
\cdots \\
\end{bmatrix}$$

Here we have 3 examples, each with 2 features. Example 1 has features $x_1^{(1)} = 0.25$ and $x_2^{(1)} = -0.45$, and so on:

$$X =  \begin{bmatrix} 1.25 & 0.38 \\
-0.45 & 3.01 \\
0.72 & -0.56 \\
\end{bmatrix}$$


In [3]:
# 3 examples (Batch size = 3):
# Example 1: (Feature 1 = 1.25, Feature 2 = 0.38)
# Example 2: (Feature 1 = -0.45, Feature 2 = 3.01)
# Example 3: (Feature 1 = 0.72, Feature 2 = -0.56)
X = np.array([[1.25, 0.38], [-0.45, 3.01], [0.72, -0.56]])
print(f"X Shape: {X.shape}")

X Shape: (3, 2)


The weight matrix $W$ of a layer is with the shape of $(n^{[l]}, n^{[l-1]})$ where each row represents the weights of a single neuron in the layer. In other words, $W$ is the dimension of $(output \times input)$ which output is the number of neurons in the layer and input is the number of neurons in the previous layer.

$b$ is a vector of size number of $neurons$ in the layer, one bias for each neuron.

- Layer 0 (input layer)  with 2 features. 
- Layer 1 with 3 neurons (in = 2, out = 3): $W_1$ is $(3, 2)$ and $b_1$ is $(3,)$
- Layer 2 with 1 neuron  (in = 3, out = 1): $W_2$ is $(1, 3)$ and $b_2$ is $(1,)$




In [4]:
# Layer 1
W1 = np.array([[-0.6053, 0.2325], [-0.5255, -0.6182], [0.0117, -0.1774]])
b1 = np.array([0.3849, -0.6344, -0.2022])

# Layer 2 (Output Layer)
W2 = np.array([[0.3884, -0.4516, -0.0486]])
b2 = np.array([0.4796])

We set the parameters of the model manually for repeatability. Then, we'll use the same values in the manual implementation to compare the results.

In [5]:
model.linear1.weight.data.copy_(torch.tensor(W1))
model.linear1.bias.data.copy_(torch.tensor(b1))

model.linear2.weight.data.copy_(torch.tensor(W2))
model.linear2.bias.data.copy_(torch.tensor(b2))

print(
    f"Layer 1: Weights: {model.linear1.weight.data.shape}, "
    f"Bias: {model.linear1.bias.data.shape}"
)
print(
    f"Layer 2: Weights: {model.linear2.weight.data.shape}, "
    f"Bias: {model.linear2.bias.data.shape}"
)

Layer 1: Weights: torch.Size([3, 2]), Bias: torch.Size([3])
Layer 2: Weights: torch.Size([1, 3]), Bias: torch.Size([1])


Let's inspect the values of the weights and biases of the network.

In [6]:
print("Layer 1: ")
print("-" * 20)
print(f"Weights:\n{model.linear1.weight.data}")
print(f"Bias:\n{model.linear1.bias.data}")

print("\nLayer 2: ")
print("-" * 20)
print(f"Weights:\n{model.linear2.weight.data}")
print(f"Bias:\n{model.linear2.bias.data}")

Layer 1: 
--------------------
Weights:
tensor([[-0.6053,  0.2325],
        [-0.5255, -0.6182],
        [ 0.0117, -0.1774]])
Bias:
tensor([ 0.3849, -0.6344, -0.2022])

Layer 2: 
--------------------
Weights:
tensor([[ 0.3884, -0.4516, -0.0486]])
Bias:
tensor([0.4796])


Deep Learning frameworks (such as TensorFlow and PyTorch), keep the bias as a vector (1D array) for efficiency, but for matrix multiplicatin automatically broadcasts to effectively behave like a matrix with shape of $(n,1)$ when needed.

Vector $b$ is broadcasted to the shape of the matrix $Z$ during the addition operation. So, in this case, it will converted to a row vector of size $(1, n^{[l]})$ and added to each row of the matrix $Z$.


**Forward Pass**

In [7]:
y_pred_torch = model(torch.tensor(X, dtype=torch.float32))
print(y_pred_torch.data)

Z1: tensor([[-0.2834, -1.5262, -0.2550],
        [ 1.3571, -2.2587, -0.7414],
        [-0.1811, -0.6666, -0.0944]])
A1: tensor([[0.4296, 0.1786, 0.4366],
        [0.7953, 0.0946, 0.3227],
        [0.4548, 0.3393, 0.4764]])
Z2: tensor([[0.5446],
        [0.7301],
        [0.4799]])
A2: tensor([[0.6329],
        [0.6748],
        [0.6177]])
tensor([[0.6329],
        [0.6748],
        [0.6177]])


## Forward Pass using Matrix Multiplication (from scratch)
Let's now go through layer by layer calculations and compare the result with the PyTorch implementation.

We'll calculate the output of each layer using the following steps:

**1. Linear Transformation for layer $l$**:<br>

$$Z^{[1]} = A^{[l-1]}{W^{[l]}}^\top + \vec{\mathbf{b}}^{[l]}$$

**2. Activation Function for layer $l$**:<br>

$$A^{[l]} = g(Z^{[l]})$$

where $\sigma$ is the activation function (Sigmoid in this case):

$$g(Z) = \sigma(Z) = \frac{1}{1 + e^{-Z}}$$

More on this [here](https://github.com/pooyavahidi/content/blob/main/ai/neural_networks_inference.md)

**Activation Function Definition**:<br>
We'll use the Sigmoid activation function for this example.

In [8]:
def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

### Layer 1

The input matrix $X$ is with 3 examples.
$$X =  \begin{bmatrix} 1.25 & 0.38 \\
-0.45 & 3.01 \\
0.72 & -0.56 \\
\end{bmatrix}$$

As we discussed input layer can also be referred to as layer 0.

$$X = A^{[0]}$$


#### Linear Transformation for Layer 1

$$Z^{[1]} = A^{[0]}{W^{[1]}}^\top + \vec{\mathbf{b}}^{[1]}$$

$$Z^{[1]} = \begin{bmatrix} 1.25 & 0.38 \\
-0.45 & 3.01 \\
0.72 & -0.56 \\
\end{bmatrix} \begin{bmatrix} -0.6053 & -0.5255 & 0.0117 \\
0.2325 & -0.6182 & -0.1774 \\
\end{bmatrix} + \begin{bmatrix} 0.3849 & -0.6344 & -0.2022 \\
\end{bmatrix}$$

Which results in:

$$= \begin{bmatrix} 1.25 \times -0.6053 + 0.38 \times 0.2325 & 1.25 \times -0.5255 + 0.38 \times -0.6182 & 1.25 \times 0.0117 + 0.38 \times -0.1774 \\ 
-0.45 \times -0.6053 + 3.01 \times 0.2325 & -0.45 \times -0.5255 + 3.01 \times -0.6182 & -0.45 \times 0.0117 + 3.01 \times -0.1774 \\
0.72 \times -0.6053 + -0.56 \times 0.2325 & 0.72 \times -0.5255 + -0.56 \times -0.6182 & 0.72 \times 0.0117 + -0.56 \times -0.1774 \\
\end{bmatrix} + \begin{bmatrix} 0.3849 & -0.6344 & -0.2022 \\
\end{bmatrix}$$

Then we broadcast the bias vector to each row of the matrix:
$$= \begin{bmatrix} -0.668275 + 0.3849 & -0.891791 - 0.6344 & -0.052787 - 0.2022 \\
0.97221 + 0.3849 & -1.624307 - 0.6344 & -0.539239 - 0.2022 \\
-0.566016 + 0.3849 & -0.032168 - 0.6344 & 0.107768 - 0.2022 \\
\end{bmatrix}$$

Which then the final result is:

$$Z^{[1]} = \begin{bmatrix} -0.283375 & -1.526191 & -0.254987 \\
1.35711 & -2.258707 & -0.741439 \\
-0.181116 & -0.666568 & -0.094432 \\
\end{bmatrix}$$

Each row of the matrix $Z^{[1]}$ corresponds to a single example in the batch. We can interpret each row of this matrix as the the linear transformation of the a single example by the first layer of the neural network.


Let's caclulate the above using matrix multiplication in numpy.

In [9]:
Z1 = np.matmul(X, W1.T) + b1
print(f"Z1: {Z1}")

Z1: [[-0.283375 -1.526191 -0.254987]
 [ 1.35711  -2.258707 -0.741439]
 [-0.181116 -0.666568 -0.094432]]


#### Activation Function for Layer 1

$$A^{[1]} = g(Z^{[1]})$$

$$A^{[1]} = \begin{bmatrix} \sigma(-0.283375) & \sigma(-1.526191) & \sigma(-0.254987) \\
\sigma(1.35711) & \sigma(-2.258707) & \sigma(-0.741439) \\
\sigma(-0.181116) & \sigma(-0.666568) & \sigma(-0.094432) \\
\end{bmatrix}$$

The result for the first column of the first row is:

$$\sigma(-0.283375) = \frac{1}{1 + e^{-(-0.283375)}} = 0.42962654$$

If we calculate the rest of the values, we get:

$$A^{[1]} = \begin{bmatrix} 0.42962654 & 0.17855167 & 0.43659641 \\
0.7952896 & 0.09460106 & 0.32268955 \\
0.45484437 & 0.33926575 & 0.47640953 \\
\end{bmatrix}$$

Matrix $A^{[1]}$ is the output of the first layer.
- Each row of the matrix $A^{[1]}$ corresponds to a single example in the batch. We can interpret each row of this matrix as the output vector $\vec{\mathbf{a}}^{(i)}$ for a single example. 
- Each column is a activation value of a neuron in the previous layer for all examples. We have 3 neurons in the first layer, so we have 3 columns in the output matrix.

For example in the first row:

$${\vec{\mathbf{a}}^{[1]}}^{(1)} = \begin{bmatrix} 0.42962654 & 0.17855167 & 0.43659641 \\
\end{bmatrix}$$

Where:
- $[l]$ is the layer index.
- $(i)$ is the example index.

So:
- ${a^{[1]}_{1}}^{(1)} = 0.42962654$ is the output of the first neuron in the first layer for the first example.
- ${a^{[1]}_{2}}^{(1)} = 0.17855167$ is the output of the second neuron in the first layer for the first example.
- ${a^{[1]}_{3}}^{(1)} = 0.43659641$ is the output of the third neuron in the first layer for the first example.


Let's calculate the above using numpy and compare the results with the PyTorch implementation.

In [10]:
A1 = sigmoid(Z1)
print(f"A1: {A1}")

A1: [[0.42962654 0.17855167 0.43659641]
 [0.7952896  0.09460106 0.32268955]
 [0.45484437 0.33926575 0.47640953]]


### Layer 2

Now to calculate the output of the second layer, we need to use the output of the first layer as the input to this layer.

Input:

$$A^{[1]} =  \begin{bmatrix} 0.42962654 & 0.17855167 & 0.43659641 \\
0.7952896 & 0.09460106 & 0.32268955 \\
0.45484437 & 0.33926575 & 0.47640953 \\
\end{bmatrix}$$


#### Linear Transformation for Layer 2

$$Z^{[2]} = A^{[1]}{W^{[2]}}^\top + \vec{\mathbf{b}}^{[2]}$$

$$Z^{[2]} = \begin{bmatrix} 0.42962654 & 0.17855167 & 0.43659641 \\
0.7952896 & 0.09460106 & 0.32268955 \\
0.45484437 & 0.33926575 & 0.47640953 \\
\end{bmatrix} \begin{bmatrix} 0.3884 \\
-0.4516 \\
-0.0486 \\
\end{bmatrix} + \begin{bmatrix} 0.4796 \\ 
\end{bmatrix}$$ 

If we calculate the above matrix multiplication and then broadcast the bias vector, we get:

$$Z^{[2]} = \begin{bmatrix} 0.54461443 \\
0.73008593 \\
0.47989564 \\
\end{bmatrix}$$

Each row of the matrix $Z^{[2]}$ corresponds to a single example in the batch. This is the linear transformation of the first layer's output by the second layer of the neural network.


In [11]:
Z2 = np.matmul(A1, W2.T) + b2
print(f"Z2: {Z2}")

Z2: [[0.54461443]
 [0.73008593]
 [0.47989564]]


#### Activation Function for Layer 2

$$A^{[2]} = g(Z^{[2]})$$

$$A^{[2]} = \begin{bmatrix} \sigma(0.54461443) \\
\sigma(0.73008593) \\
\sigma(0.47989564) \\
\end{bmatrix}$$

If we calculate the sigmoid function for each value, we get:


$$A^{[2]} = \begin{bmatrix} 0.6328852 \\
0.67482413 \\
0.61772323 \\
\end{bmatrix}$$

Matrix $A^{[2]}$ is the output of the second layer (output layer).
- Each row of the matrix $A^{[2]}$ corresponds to a single example in the batch. 
- Each column is a activation value of a neuron in the layer for all examples. We have 1 neuron in the second layer, so we have 1 column in the output matrix.


In [12]:
A2 = sigmoid(Z2)
print(f"A2: {A2}")

A2: [[0.6328852 ]
 [0.67482413]
 [0.61772323]]


Let's compare the results with the PyTorch implementation.

In [13]:
# Compare the results of PyTorch and Manual Calculation

print(f"PyTorch:\n{y_pred_torch.data}")
print(f"Manual:\n{A2}")

PyTorch:
tensor([[0.6329],
        [0.6748],
        [0.6177]])
Manual:
[[0.6328852 ]
 [0.67482413]
 [0.61772323]]


### Output of the Neural Network
In this example, our neural network have 2 layers. So, the output of the second layer is in fact the output of the neural network.

Output of the neural network:

$$A^{[2]} = \begin{bmatrix} 0.6328852 \\
0.67482413 \\
0.61772323 \\
\end{bmatrix}$$

Since we have 3 examples in the batch, each of the rows in the output matrix corresponds to the output of the neural network for a single example.

- ${a^{[2]}_{1}}^{(1)} = 0.6328852$ is the output of the neural network for the first example.
- ${a^{[2]}_{1}}^{(2)} = 0.67482413$ is the output of the neural network for the second example.
- ${a^{[2]}_{1}}^{(3)} = 0.61772323$ is the output of the neural network for the third example.

#### Derive $\hat{y}$ from the output of the neural network
Deriving the final output $\hat{y}$ depends on the task and type of the activation function in the output layer. For example, in a binary classification task, we can use the Sigmoid activation function in the output layer (like in this example). Then the output of the neural network is the probability of the positive class. 

$$P(y = 1 | X) = A^{[2]}$$

If we set the threshold to 0.5:

$$\hat{y} = \begin{cases} 1 & \text{if } A^{[2]} \geq 0.5 \\ 0 & \text{if } A^{[2]} < 0.5 \end{cases}$$

So:
- For the first example, $0.6328852 \geq 0.5$, so $\hat{y}^{(1)} = 1$.
- For the second example, $0.67482413 \geq 0.5$, so $\hat{y}^{(2)} = 1$.
- For the third example, $0.61772323 \geq 0.5$, so $\hat{y}^{(3)} = 1$.

In simple words, the prediction of the neural network for the first example is 1, for the second example is 1, and so on.


In [14]:
def calculate_yhat(y_pred):
    yhat = np.zeros_like(y_pred)
    for i in range(len(y_pred)):
        if y_pred[i] >= 0.5:
            yhat[i] = 1
        else:
            yhat[i] = 0
    return yhat


# Network decisions for PyTorch
print(f"PyTorch Decisions:\n{calculate_yhat(y_pred_torch.data)}")

# Network decisions for Manual Calculation
print(f"Manual Calculation Decisions:\n{calculate_yhat(A2)}")

PyTorch Decisions:
[[1.]
 [1.]
 [1.]]
Manual Calculation Decisions:
[[1.]
 [1.]
 [1.]]
