# **SIN 393 – Introduction to Computer Vision (2023)**

# Lecture 05 - Part 2a - Deep Learning

Prof. João Fernando Mari ([*joaofmari.github.io*](https://joaofmari.github.io/))

---

## Importing the required libraries

In [1]:
import numpy as np
import math

## Activation functions
---

In [2]:
def sigmoid(v):
    return 1 / (1 + np.exp(-v))
    ### return np.array([1 / (1 + math.exp(-v[0])), 1 / (1 + math.exp(-v[1]))])

def sigmoid_grad(v):
    y_ = sigmoid(v) * (1 - sigmoid(v))
    return y_

In [3]:
def relu(v):
    return np.maximum(0, v)

def relu_grad(v):
    return np.greater(v, 0).astype(float)    

## Loss functions
---

In [4]:
def loss_mse(y, y_hat):
    ### return (y - y_hat)**2
    return (1/len(y)) * (y - y_hat)**2

def loss_mse_grad(y, y_hat):
    N = len(y)
    return -(2 / N) * (y - y_hat)
    ### return -1 *  (y - y_hat)

## Architecture and hyperparameters
---

### Inputs and outputs

In [5]:
x = np.array([[0.3]])
y = np.array([1.0])

### Weights and bias

In [6]:
W0 = np.array([[0.1, 0.2]])

b0 = np.array([0.25, 0.25])

W1 = np.array([[0.5],
               [0.6]])

b1 = np.array([0.35])

![title](figures/nn02_ok.png)

### Hyperparameters

In [7]:
LEARNING_RATE = 0.01

# For float numbers rounding
DEC = 4 # PyTorch and Matlab default is 4

## Forward pass
---

$\mathbf{v}^0 = \mathbf{x}\mathbf{W}^{0} + \mathbf{b}^0$

$\mathbf{y}^0 = \sigma(\mathbf{v}^0)$

$\mathbf{v}^{1} = \mathbf{y}^{0}\mathbf{W}^{1} + \mathbf{b}^{1}$

$\mathbf{\hat{y}} = \sigma(\mathbf{v}^1)$


In [8]:
v0 = np.dot(x, W0) + b0
print(f'v0 = {np.around(v0, DEC)}')

y0 = sigmoid(v0)
print(f'y0 = {np.around(y0, DEC)}')

v1 = np.dot(y0, W1) + b1
print(f'v1 = {np.around(v1, DEC)}')

y_hat = sigmoid(v1)
print(f'y_hat = {np.around(y_hat, DEC)}')

v0 = [[0.28 0.31]]
y0 = [[0.5695 0.5769]]
v1 = [[0.9809]]
y_hat = [[0.7273]]


In [9]:
### L = loss_mse(y, y_hat).mean()
L = loss_mse(y, y_hat)
print(f'L = {np.around(L, DEC)}')

L = [[0.0744]]


## Backpropagation
---

### Layer 1

* We need to find $\frac{\partial{L}}{\partial{\mathbf{W}^1}}$ to update the weights, $\mathbf{W}^1$, through Gradient Descent.
* We can compute $\frac{\partial{L}}{\partial{\mathbf{W}^1}}$ using the chain rule:

$$\frac{\partial{L}}{\partial{\mathbf{W^1}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{W^1}}}$$

* We also need to find $\frac{\partial{L}}{\partial{\mathbf{b^1}}}$ to update the bias, $\mathbf{b}^1$, through Gradient Descent.
* We can compute $\frac{\partial{L}}{\partial{\mathbf{b^1}}}$ using the chain rule:
$$\frac{\partial{L}}{\partial{\mathbf{b^1}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{b^1}}}$$

#### Weights

$$\frac{\partial{L}}{\partial{\mathbf{W^1}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{W^1}}}$$

In [10]:
# ∂L/∂y^ 
dL_dyhat = loss_mse_grad(y, y_hat)
print(f'∂L/∂y^ = {np.around(dL_dyhat, DEC)}')

# ∂y^/∂v1
dyhat_dv1 = sigmoid_grad(v1)
print(f'\n∂y^/∂v1 = {np.around(dyhat_dv1, DEC)}')

# ∂v1/∂W1
dv1_dW1 = np.hstack([y0[np.newaxis].T] * len(dyhat_dv1))
print(f'\n∂y^/∂W1 = \n{np.around(dv1_dW1, DEC)}')

# ∂L/∂W1 = ∂L/∂y^ * ∂y^/∂v1 * ∂v1/∂W1
# -----------------------------------
dL_dW1 = dL_dyhat * dyhat_dv1 * dv1_dW1
print(f'\n∂L/∂W1 = \n{np.around(dL_dW1, DEC)}')

∂L/∂y^ = [[-0.5454]]

∂y^/∂v1 = [[0.1983]]

∂y^/∂W1 = 
[[[0.5695]]

 [[0.5769]]]

∂L/∂W1 = 
[[[-0.0616]]

 [[-0.0624]]]


#### Bias

$$\frac{\partial{L}}{\partial{\mathbf{b^1}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{b^1}}}$$

In [11]:
# ∂v1/∂b1
# As the input for bias is fixed in 1, the derivatives are 1.

# ∂L/∂b1 = ∂L/∂y^ * ∂y^/∂v1 * ∂v1/∂b1
# -----------------------------------
dL_db1 = dL_dyhat * dyhat_dv1 
print(f'∂L/∂b1 = {np.around(dL_db1, DEC)}')

∂L/∂b1 = [[-0.1082]]


### Layer 0

* We need to find $\frac{\partial{L}}{\partial{\mathbf{W}^0}}$ to update the weights, $\mathbf{W}^0$ through Gradient Descent.
* We can compute $\frac{\partial{L}}{\partial{\mathbf{W}^0}}$ using the chain rule:

$$\frac{\partial{L}}{\partial{\mathbf{W}^0}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{y^0}}} \times \frac{\partial{\mathbf{y^0}}}{\partial{\mathbf{v^0}}} \times \frac{\partial{\mathbf{v^0}}}{\partial{\mathbf{W}^0}}$$

* Simplifying to use the already calculated values:

$$\frac{\partial{L}}{\partial{\mathbf{W}^0}} = \frac{\partial{L}}{\partial{\mathbf{y^0}}} \times \frac{\partial{\mathbf{y^0}}}{\partial\mathbf{{v^0}}} \times \frac{\partial{\mathbf{v^0}}}{\partial{\mathbf{W}^0}}$$

* where:
$$\frac{\partial{L}}{\partial{\mathbf{y^0}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{y^0}}}$$

* in which $\frac{\partial{L}}{\partial{\mathbf{\hat{y}}}}$ and $\frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}}$ has already been calculated, and:

$$\frac{\partial{\mathbf{v}^1}}{\partial{\mathbf{y}^0}} = \mathbf{W}^1$$

* We need to find $\frac{\partial{L}}{\partial{\mathbf{b}^0}}$ to update the bias, $\mathbf{b}^0$, through Gradient Descent.
* We can compute $\frac{\partial{L}}{\partial{\mathbf{b}^0}}$ using the chain rule:

$$\frac{\partial{L}}{\partial{\mathbf{b}^0}} = \frac{\partial{L}}{\partial{\mathbf{y^0}}} \times \frac{\partial{\mathbf{y^0}}}{\partial\mathbf{{v^0}}} \times \frac{\partial{\mathbf{v^0}}}{\partial{\mathbf{b}^0}}$$

#### Weights

$$\frac{\partial{\mathbf{v}^1}}{\partial{\mathbf{y}^0}} = \mathbf{W}^1$$

In [12]:
# ∂v1/∂y0
dv1_dy0 = W1
print(f'∂v1/∂y0 = \n{np.around(dv1_dy0, DEC)}')

∂v1/∂y0 = 
[[0.5]
 [0.6]]


$$\frac{\partial{L}}{\partial{\mathbf{y^0}}} = \frac{\partial{L}}{\partial{\mathbf{\hat{y}}}} \times \frac{\partial{\mathbf{\hat{y}}}}{\partial{\mathbf{v^1}}} \times \frac{\partial{\mathbf{v^1}}}{\partial{\mathbf{y^0}}}$$

In [13]:
# ∂L/∂y0 = ∂L/∂y^ * ∂y^/∂v1 * ∂v1/∂y0
# -----------------------------------
dL_dy0_ = dL_dyhat * dyhat_dv1 * dv1_dy0
print(f'∂L/∂y0 * = \n{np.around(dL_dy0_, DEC)}')

# Summing the contributions of layer 1
dL_dy0 = dL_dy0_.sum(axis=1)
print(f'\n∂L/∂y01 = {np.around(dL_dy0, DEC)}')

∂L/∂y0 * = 
[[-0.0541]
 [-0.0649]]

∂L/∂y01 = [-0.0541 -0.0649]


$$\frac{\partial{L}}{\partial{\mathbf{W}^0}} = \frac{\partial{L}}{\partial{\mathbf{y^0}}} \times \frac{\partial{\mathbf{y^0}}}{\partial\mathbf{{v^0}}} \times \frac{\partial{\mathbf{v^0}}}{\partial{\mathbf{W}^0}}$$

In [14]:
# ∂y0/∂v0
dy0_dv0 = sigmoid_grad(v0)
print(f'∂y0/∂v0 = {dy0_dv0}')

# ∂v0/∂W0
dv0_dW0 = np.hstack([x[np.newaxis].T] * len(dy0_dv0))
print(f'\n∂v0/∂W0 = \n{np.around(dv0_dW0, DEC)}')

# ∂L/∂W0 = ∂L/∂y0 * ∂y0/∂v0 * ∂v0/∂yW0
# ------------------------------------
dL_dW0 = dL_dy0 * dy0_dv0 * dv0_dW0
print(f'\n∂L/∂W0 = \n{np.around(dL_dW0, DEC)}')

∂y0/∂v0 = [[0.24516332 0.24408866]]

∂v0/∂W0 = 
[[[0.3]]]

∂L/∂W0 = 
[[[-0.004  -0.0048]]]


#### Bias

$$\frac{\partial{L}}{\partial{\mathbf{b}^0}} = \frac{\partial{L}}{\partial{\mathbf{y^0}}} \times \frac{\partial{\mathbf{y^0}}}{\partial\mathbf{{v^0}}} \times \frac{\partial{\mathbf{v^0}}}{\partial{\mathbf{b}^0}}$$

In [15]:
# * ∂v1/∂b1 = 1
# As the input for bias is fixed in 1, the derivatives are 1.

# ∂L/∂b0 = ∂L/∂y0 * ∂y0/∂v0 
# -----------------------------------
dL_db0 = dL_dy0 * dy0_dv0 
print(f'\n∂L/∂b0 = {np.around(dL_db0, DEC)}')


∂L/∂b0 = [[-0.0133 -0.0158]]


## Gradient descent
---

In [16]:
W1 = W1 - LEARNING_RATE * dL_dW1
b1 = b1 - LEARNING_RATE * dL_db1

W0 = W0 - LEARNING_RATE * dL_dW0
b0 = b0 - LEARNING_RATE * dL_db0

print(f'W0 = \n{np.around(W0, DEC)}')
print(f'\nb0 = \n{np.around(b0, DEC)}')

print(f'\nW1 = \n{np.around(W1, DEC)}')
print(f'\nb1 = \n{np.around(b1, DEC)}')

W0 = 
[[[0.1 0.2]]]

b0 = 
[[0.2501 0.2502]]

W1 = 
[[[0.5006]
  [0.6006]]

 [[0.5006]
  [0.6006]]]

b1 = 
[[0.3511]]


## Bibliography
---
* Rabindra Lamsal. A step by step forward pass and backpropagation example
    * https://theneuralblog.com/forward-pass-backpropagation-example/
* Matt Mazur. A Step by Step Backpropagation Example
    * https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
* Back-Propagation is very simple. Who made it Complicated ?
    * https://medium.com/@14prakash/back-propagation-is-very-simple-who-made-it-complicated-97b794c97e5c 
* Chapter 7: Artificial neural networks with Math.
    * https://medium.com/deep-math-machine-learning-ai/chapter-7-artificial-neural-networks-with-math-bb711169481b
* The Matrix Calculus You Need For Deep Learning
    * http://explained.ai/matrix-calculus/index.html 
* How backpropagation works, and how you can use Python to build a neural network
    * https://medium.freecodecamp.org/build-a-flexible-neural-network-with-backpropagation-in-python-acffeb7846d0 
* All the Backpropagation derivatives
    * https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60
* Brent Scarff. Understanding Backpropagation. 
    * https://towardsdatascience.com/understanding-backpropagation-abcc509ca9d0 