<a href="https://colab.research.google.com/github/nhareesha/MLAI/blob/main/NN/NN_forward_backward_prop_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import numpy as np


In [6]:
def relu(z):
    return np.maximum(0, z)

In [7]:
def softmax_direct(Z):
  softmax = np.exp(Z) / np.sum(np.exp(Z))
  return softmax

### Softmax with numerical stability

When Z values are very large or very small, there is a change of $exp$ to be Infinity or zero.In order to avoid it we always subtract max value from each z value, so that the value shifts left.

In [19]:
def softmax(Z):
  z_exp = np.exp(Z - np.max(Z, axis = 0, keepdims = True))
  softmax = z_exp / np.sum(z_exp, axis = 0, keepdims = True)
  return softmax


- Number of paramters do not depend up on number of training examples
- They do depend up on number of feature.
- For one training example, W is a ROW vector $[w_1, w_2, w_3 ... w_n]$.
- For one trainig example X is a COLUMN vector $[x_1, x_2,x_3...x_m]^T$.
- $ Y = WX + B $ linear combination of inputs

### Problem Statement

A simple implementation of a 2-layer neural network using NumPy in Python, which includes an input layer with 5 inputs(5 features), two hidden layers each with 10 neurons using the ReLU activation function, and an output layer with 4 neurons.

#### Notation
$L^{[layer number]}$

For $m$ traning example

Input Layer - $ X_{5xm}$  which is same as $A_{5xm}^{[0]}$

$\\\\$

Hidden Layer $L^{[1]}$ has  $A_{10xm}^{[1]} = f(W_{10 x 5}^{[1]}A_{5 x m}^{[0]} + B_{10 x 1}^{[1]})$

$\\\\$

Hidden Layer $L^{[2]}$ has $A_{10xm}^{[2]} = f(W_{10 x 10}^{[2]}A_{10 x m}^{[1]} + B_{10 x 1}^{[2]})$

$\\\\$

Here Z can be generalized to $Z = WA + B $ and $A = f(Z)$


$\\\\$

Output layer is a Softmax layer

$A_{4xm}^{[3]} = softmax(W_{4x10}^{[3]}A_{10xm}^{[2]}+B_{4x1}^{[3]})$


### Matrices at a given Layer

$Z^{[2]} = W^{[2]} \times A^{[1]} + B^{[2]}$

$A^{[2]} = f(Z^{[2]})$

Here

$A^{[1]}$ : activations(inputs) from preceding layer, which here is Layer-1

$Z^{[2]}$ : The linear output (pre-activation) of the second layer.

$W^{[2]}$ : The weight matrix for the second layer.

$B^{[2]}$ : The bias vector for the second layer.

### Dimensions
If $n_1$ is number of neurons in $1_{st}$ layer and $n_2$ is number of neurons in $2_{nd}$ layer and $m$ is number of training examples, then dimensions would be

- $W^{[2]}$ will have dimensions $n_2×n_1$ , because each of $n_2$ neurons of second layer connects to all neurons of first layer $n_1$
- $B^{[2]}$ will have dimension $n_2 × 1$,as each neuron in the second layer has a single bias term
- $A^{[1]}$ will have dimension $n_1 × m$, where each column represents the activations from the first layer for one training example.
- $A^{[2]}$ will have dimension $n_2 × m$, where each column represents the activations from the second layer for one training example.


$
W^{[2]} = \begin{bmatrix}
w_{11}^{[2]} & w_{12}^{[2]} & \cdots & w_{1n_1}^{[2]} \\
w_{21}^{[2]} & w_{22}^{[2]} & \cdots & w_{2n_1}^{[2]} \\
\vdots & \vdots & \ddots & \vdots \\
w_{n_2 1}^{[2]} & w_{n_2 2}^{[2]} & \cdots & w_{n_2 n_1}^{[2]}
\end{bmatrix}
$

$w_{ij}^{[2]}$ : Weight connecting the $j_{th}$ neuron of the first layer to the $i_{th}$ neuron of the second layer.


$
A^{[1]} = \begin{bmatrix}
a_{11}^{[1]} & a_{12}^{[1]} & \cdots & a_{1m}^{[1]} \\
a_{21}^{[1]} & a_{22}^{[1]} & \cdots & a_{2m}^{[1]} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n_1 1}^{[1]} & a_{n_1 2}^{[1]} & \cdots & a_{n_1 m}^{[1]}
\end{bmatrix}
$

Each column corresponds to the activations from the first layer for a specific training example.

$
Z^{[2]} = \begin{bmatrix}
z_{11}^{[2]} & z_{12}^{[2]} & \cdots & z_{1m}^{[2]} \\
z_{21}^{[2]} & z_{22}^{[2]} & \cdots & z_{2m}^{[2]} \\
\vdots & \vdots & \ddots & \vdots \\
z_{n_2 1}^{[2]} & z_{n_2 2}^{[2]} & \cdots & z_{n_2 m}^{[2]}
\end{bmatrix}
$

Each column corresponds to the linear output for a specific training example


$
b^{[2]} = \begin{bmatrix}
b_{1}^{[2]} \\
b_{2}^{[2]} \\
\vdots \\
b_{n_2}^{[2]}
\end{bmatrix}
$

Each element corresponds to the bias term for a neuron in the second layer.

### Initializing parameters for 2 hidden layers and one output layer

In [9]:
# W1,b1,W2,b2,W3,b3

def init_parameters():
  np.random.seed(1)
  # Layer - 1
  W1 = np.random.randn(10, 5) * 0.01
  b1 = np.random.randn(10, 1) * 0.01

  # Layer - 2
  W2 = np.random.randn(10, 10) * 0.01
  b2 = np.random.randn(10, 1) * 0.01

  # Layer - 3 (output layer)
  W3 = np.random.randn(4, 10) * 0.01
  b3 = np.random.randn(4, 1) * 0.01

  parameters = (W1, b1, W2, b2, W3, b3)
  return parameters




## Forward Propagation

At high level

$ Z = WA + B$

$ A = f(Z) $



In [15]:
def forward_prop(X, parameters):
  W1, b1, W2, b2, W3, b3 = parameters

  Z1 = W1 @ X + b1 # Here Z1 will have dimension 10 by m , where m - number of training examples
  A1 = relu(Z1) # relu is activation over linear values, A1 will be of size 10 by m

  Z2 = W2 @ A1 + b2
  A2 = relu(Z2)

  Z3 = W3 @ A2 + b3
  A3 = softmax(Z3)

  pass_activations = (Z1, A1, Z2, A2, Z3, A3)
  return A3, pass_activations




                (W1, b1) --> Z1--> A1
                                    \
                                     \
                                      \                   
                          (W2, b2) --> Z2 --> A2  
                                               \
                                                \
                                   (W3, b3) -->  Z3 --> A3 --> L


## Dependency path

$ L = Y - A^{[3]}$

$Z^{[3]} = W^{[3]} \cdot A^{[2]} + b^{[3]}$

$A^{[3]} = Softmax(Z^{[3]}) $


$Z^{[2]} = W^{[2]} \cdot A^{[1]} + b^{[2]}$

$A^{[2]} = ReLU(Z^{[2]}) $


$Z^{[1]} = W^{[1]} \cdot A^{[0]} + b^{[1]}$

$A^{[1]} = ReLU(Z^{[1]}) $
                              
                              
                              
                                

In [11]:
def relu_derivative(Z):
  return Z > 0

### Weights and Biases adjustment

$w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}$

$b_{\text{new}} = b_{\text{old}} - \eta \cdot \frac{\partial L}{\partial b}$

$\eta $ is learning rate.

$\frac{\partial L}{\partial w}$ partial derivative of Loss function wrt to weight($w$), meaning except weight, everything else is treated as constant.

$\frac{\partial L}{\partial b}$ partial derivative of Loss function wrt to bias($b$), meaning except bias, everything else is treated as constant.





In [32]:
def backward_prop(X, Y, parameters, forwardpass_activations):
  m = X.shape[1]
  W1, b1, W2, b2, W3, b3 = parameters
  Z1, A1, Z2, A2, Z3, A3 = forwardpass_activations

  # Need to nudge parameters in such a way that L function reduces

  # layer - 3
  dL_dZ3 = A3 - Y

  dZ3_dW3 = A2.T
  dL_dW3 = (1/m) * np.dot(dL_dZ3, dZ3_dW3)

  dZ3_db3 = 1
  # dL_db3 = (1/m) * np.sum(np.dot(dL_dZ3 ,dZ3_db3), axis=1, keepdims=True)
  dL_db3 = (1/m) * np.sum(dL_dZ3 , axis=1, keepdims=True)


  # LAYER - 2
  dZ3_dA2 = W3.T
  dL_dA2 = np.dot(dZ3_dA2, dL_dZ3) # Switched positions

  dA2_dZ2 = relu_derivative(Z2) # Here A2 is f(z2)
  dL_dZ2 = dL_dA2 * dA2_dZ2

  dZ2_dW2 = A1.T
  dL_dW2 = (1/m) * np.dot(dL_dZ2, dZ2_dW2)

  dZ2_db2 = 1
  # dL_db2 = (1/m) * np.sum(np.dot(dL_dZ2, dZ2_db2), axis=1, keepdims=True)
  dL_db2 = (1/m) * np.sum(dL_dZ2, axis=1, keepdims=True)


  # LAYER - 1
  dZ2_dA1 = W2.T
  dL_dA1 = np.dot(dZ2_dA1, dL_dZ2) # SWITCHED POSTITIONS

  dA1_dZ1 = relu_derivative(Z1) # Here A2 is f(z2)
  dL_dZ1 = dL_dA1 * dA1_dZ1

  dZ1_dW1 = X.T
  dL_dW1 = (1/m) * np.dot(dL_dZ1 , dZ1_dW1)

  dZ1_db1 = 1
  # dL_db1 = (1/m) * np.sum(np.dot(dL_dZ1 * dZ1_db1), axis=1, keepdims=True)
  dL_db1 = (1/m) * np.sum(dL_dZ1, axis=1, keepdims=True)


  pass_gradients = (dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3)
  return pass_gradients


In [22]:
def update_parameters(parameters, pass_gradients, learning_rate):
  W1, b1, W2, b2, W3, b3 = parameters
  dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3 = pass_gradients

  W1 = W1 - learning_rate * dL_dW1
  b1 = b1 - learning_rate * dL_db1

  W2 = W2 - learning_rate * dL_dW2
  b2 = b2 - learning_rate * dL_db2

  W3 = W3 - learning_rate * dL_dW3
  b3 = b3 - learning_rate * dL_db3

  updated_params = (W1, b1, W2, b2, W3, b3)
  return updated_params




### Execution

In [33]:
np.random.seed(1)

X = np.random.rand(5, 10) # 5 features and 10 training examples

Y = np.eye(4)[np.random.choice(4, 10)].T  # One-hot encoded labels for 4 classes

parameters = init_parameters()

print("Intial parameters - ", parameters)

learning_rate = 0.01

A3, forwardpass_predictions = forward_prop(X, parameters)

gradients = backward_prop(X, Y, parameters, forwardpass_predictions)

updated_parameters = update_parameters(parameters, gradients, learning_rate)


print("Output after one forward pass:", A3)

print("Updated parameters after 1st pass - ", updated_parameters)


Intial parameters -  (array([[ 0.01624345, -0.00611756, -0.00528172, -0.01072969,  0.00865408],
       [-0.02301539,  0.01744812, -0.00761207,  0.00319039, -0.0024937 ],
       [ 0.01462108, -0.02060141, -0.00322417, -0.00384054,  0.01133769],
       [-0.01099891, -0.00172428, -0.00877858,  0.00042214,  0.00582815],
       [-0.01100619,  0.01144724,  0.00901591,  0.00502494,  0.00900856],
       [-0.00683728, -0.0012289 , -0.00935769, -0.00267888,  0.00530355],
       [-0.00691661, -0.00396754, -0.00687173, -0.00845206, -0.00671246],
       [-0.00012665, -0.0111731 ,  0.00234416,  0.01659802,  0.00742044],
       [-0.00191836, -0.00887629, -0.00747158,  0.01692455,  0.00050808],
       [-0.00636996,  0.00190915,  0.02100255,  0.00120159,  0.00617203]]), array([[ 0.0030017 ],
       [-0.0035225 ],
       [-0.01142518],
       [-0.00349343],
       [-0.00208894],
       [ 0.00586623],
       [ 0.00838983],
       [ 0.00931102],
       [ 0.00285587],
       [ 0.00885141]]), array([[-0.007