<a href="https://colab.research.google.com/github/peter-lang/ml-tutorial/blob/master/02_Backpropagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Network Basics

### Number of neurons in human brain $ \approx 10^{11} $ (100 billion)
### 1 Nueron connects to $ \approx 10^4 $(10 thousand)
### Number of neural connections: $ \approx 10^{15} $ (1 thousand-trillion)

<img src="https://upload.wikimedia.org/wikipedia/commons/4/44/Neuron3.png">


<img src="https://drive.google.com/uc?export=view&id=1KDtsK2TVlA3DYRgCoynm0FUE1C8ivK-C">


| Species    | Number of synapses  | Memory (float32) |
|------------|---------------------| ---------------- |
| RoundWorm (fonalféreg) |  $ 10^{5} $ | 400 KB |
| FruitFly (muslica) |  $ 10^{7} $ | 40 MB |
| NeuralNets |  $ 10^{5}-10^{9} $ | 400 KB - 4 GB |
| Bee |  $ 10^{9} $ | 4 GB |
| Mouse |  $ 10^{12} $ | 4 TB |
| Cat |  $ 10^{13} $ | 40 TB|
| Human |  $ 10^{15} $ | 4 EB |


# Gradient Descent

### Training sample: $(\mathbf{X}, \mathbf{D})$

### Net parameters: $\theta$

$\mathbf{Y} = \mathit{Net}(\mathbf{X}, \theta)$

$\mathbf{\hat{Y}} = \mathit{Loss}(\mathbf{Y}, \mathbf{D})$

### Optimization task: $$ \underset{\theta}{\arg\min}(\mathbf{\hat{Y}}) $$

### Gradient Descent: $$ \theta_{n+1} = \theta_{n} - \alpha \nabla{\mathbf{\hat{Y}}} $$

### $$ w_{n+1}^{i} = w_n^{i} - \alpha \frac{\partial \mathit{Loss}(\mathit{Net}(\mathbf{X}, \theta), \mathbf{D})}{\partial w_n^{i}} $$

<img src="https://drive.google.com/uc?export=view&id=1n1C6lTtA0vl1dryLGauTWfMGPuVta6Rt">

### Rule of thumb

__Big steps__: jumps over local minima, but might diverge

__Small steps__: converge, but can stuck in local minima

__Learning rate schedule__:
  - Start with small steps, so learning will converge
  - __Warmup phase__: Gradually bigger steps, to walk through landscape and jump through local minima
  - After we find a "nice" place, gradually smaller steps, to find local minima in the "neighbourhood"

# Gradient Descent At Final Layer

### How to compute in general: $$ w_{n+1}^{i} = w_n^{i} - \alpha \frac{\partial \mathit{Loss}(\mathit{Net}(\mathbf{X}, \theta), \mathbf{D})}{\partial w_n^{i}} $$

### Partial Derivative $$ \frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x} $$

### Final layer

<img src="https://drive.google.com/uc?export=view&id=13-rotnmfvdl3kq8DxbtSSzyQbYN1GTOj">


#### Linear layer: $$ \mathit{Linear}(\mathbf{x}) = \sum_j{w_j x_j} $$
#### $$ \frac{\partial \mathit{Linear}(\mathbf{x})}{\partial w_i} = \frac{\partial \sum_j{w_j x_j}}{\partial x_i} = x_i $$
#### $$ \frac{\partial \mathit{Linear}(\mathbf{x})}{\partial w_n } = \frac{\partial \sum_j{w_j x_j}}{\partial w_n} = x_n = 1 $$

#### Activation function: $$\varphi(x) = \frac{1}{1+e^{-x}}$$
#### $$ \frac{\partial \varphi}{\partial x} = \varphi(x)(1-\varphi(x)) $$

#### Loss function: $$ \mathit{Loss}(y, d) = \frac{1}{2} (y-d)^2 $$
#### $$ \frac{\partial \mathit{Loss}(y, d)}{\partial y} = \frac{1}{2} 2 (y-d) = y - d $$

#### Compound function: $$ \frac{\partial \mathit{Loss}(\varphi(\mathit{Linear}(\mathbf{x})))}{\partial w_i} = \underbrace{\frac{\partial \mathit{Loss}(\varphi(\mathit{Linear}(\mathbf{x}))}{\partial \varphi(\mathit{Linear}(\mathbf{x}))}}_{\varphi(\mathit{Linear}(\mathbf{x})) - d} 
\underbrace{\frac{\partial \varphi(\mathit{Linear}(\mathbf{x}))}{\partial \mathit{Linear}(\mathbf{x})}}_{\varphi(\mathit{Linear}(\mathbf{x}))(1-\varphi(\mathit{Linear}(\mathbf{x})))} 
\underbrace{\frac{\partial \mathit{Linear}(\mathbf{x})}{\partial w_i}}_{x_i}$$

#### BUT! $ \varphi(\mathit{Linear}(\mathbf{x})) = \mathit{Net}(\mathbf{x}, \theta) = y $ was already computed during __FORWARD PATH__

#### $$ \frac{\partial \mathit{Loss}(\mathit{Net}(\mathbf{x}, \theta), d)}{\partial w_i} = (y - d)y(1-y)x_i $$

#### Let's also save this for later usage, also known as __BACKWARD PATH__ :)

#### $$ \delta =  \frac{\partial \mathit{Loss}(y, d)}{\partial \mathit{Linear}(\mathbf{x})} = (y - d)y(1-y)$$


# Gradient Descent at any place

<img src="https://drive.google.com/uc?export=view&id=1yNyUDqz4pqNDSWurzgRuUp56SaFv8V2l">


## Lessons learnt
- __FORWARD PATH__:
  - Needed to compute $ Loss(y, d) $ function
  - Save partial results at each neuron output: $ y_i $ 
- __BACKWARD PATH__: 
  - To compute gradients with regards to every weight
  - Save partial derivative results at each neuron input: $ \delta_i $
- __Neural Nets & Backpropagation algorithm__: 
  - Given $ N $ weights (a.k.a parameters)
  - Steps: $ 2N $
    - Forward
    - Backward
  - Memory: $ 3N $
    - Original weight
    - Forward partial result
    - Backward partial result
  


# Gradient Descent at any place



In [0]:
import numpy as np
import math

In [0]:
class Layer:
  def __init__(self, in_size = None, out_size = None, next_layer = None):
    self.next_layer = next_layer
    

class Linear:
  def __init__(self, in_size = None, out_size = None):
    self.weights = weights
    
  def forward(self, x):
    return np.dot(self.weights, x)
  
  def df(self, x):
    return self.weights
  
class Sigmoid:
  def __init__(self, in_size = None, out_size = None, next_layer = None):
    pass
  
  def forward(self, x):
    return 1/(1+np.exp(-x))
  
  def df(self, x):
    return self.forward(x)*(1-self.forward(x))
  
    

In [0]:
Neuron(np.array([2, 3, 4]).forward(np.array([1, 1, 2]))

SyntaxError: ignored