# Deep Learning

**Goals**:

* Goal understand better the link with gradients and backproapagation
* Review backpropagation
* IMplement dense layer and neural netowork
* tets implemenattion
* Learn batch normalization
* IMplement Batch normalization layer


In [1]:
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf

from typing import List, Set, Dict, Tuple, Optional, Union

%matplotlib inline
np.random.seed(1)

import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print(tf.__version__)

2.4.1


## Backpropagation

* https://kevinzakka.github.io/2016/09/14/batch_normalization/
* https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795
* https://medium.com/@a.mirzaei69/implement-a-neural-network-from-scratch-with-python-numpy-backpropagation-e82b70caa9bb





### Implement simple regression first


1. One layer Dense
1. Regression

Steps:
    
* Implement parts 
    1. activatoins
    1. code one layer 


### Forward propagation

<img src="../images/simple_nn.jpg" width="600"/>

$
z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \\
a^{[l]} = g^{[l]}(z^{[l]})
$

<img src="../images/forward_step.png" width="800"/>

### Backward propagation

refs:
* https://sudeepraja.github.io/Neural/ Matrix notation
* https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd


Often people confuse backward propagation with gradient descent. The gradient descent algorithm minimizes (optmizes) the cost function. And the backpropagation updates all parameters in the neural network based on the chain rule fomula and the gradient returned in the gradient descent algorithm   


* The cost or objective function:  

$
J(\theta = {W,b}) = \frac{1}{m} \sum^m_{i=1} \ell(\hat{y^{(i)}} = h_{\theta}(x^{(i)}), y^{(i)})
$

* The loss function:

$
\ell(\hat{y^{(i)}}, y^{(i)})
$

* The activation function of the last layer:

$
g = h_{\theta}(x^{(i)})
$

Reviewing chaing rule:

* For function of 1 variable:
$
y =  f(u(x)) \\
\frac{d}{dx}y = \frac{d}{du}y \frac{d}{dx}u
$

* For 2 vriables 

$
y =  f(u(x)) \\
\frac{d}{dx}y = \frac{d}{du}y \frac{d}{dx}u
$

To optmize the cost funtction $J(W^{[1]},b^{[1]},..., W^{[l]}, b^{[l]}, ...,W^{[L]}, b^{[L]} )$, where $L$ is the number of layers, we need to compute:


$
\frac{\partial}{\partial W^{[l]}}J = \frac{\partial}{\partial z^{[l]}}J \frac{\partial}{\partial W^{[l]}}z^{[l]} \\ 
\frac{\partial}{\partial W^{[l]}}J = \frac{\partial}{\partial z^{[l]}}J a^{[l-1]} \\
\frac{\partial}{\partial b^{[l]}}J = \frac{\partial}{\partial z^{[l]}}J \frac{\partial}{\partial b^{[l]}}z^{[l]} \\ 
\frac{\partial}{\partial b^{[l]}}J = \frac{\partial}{\partial z^{[l]}}J
$


The common term is called the local gradient:

$
\delta^{[l]} = \frac{\partial}{\partial z^{[l]}}J
$


-----------------


* Forwards pass in mtrix notation:

https://sudeepraja.github.io/Neural/ Matrix notation

$
J(\theta = {W,b}) = \frac{1}{m} \sum^m_{i=1} \ell(\hat{y^{(i)}} = h_{\theta}(x^{(i)}), y^{(i)}) \\
a^{[l]} = g^{[l]}(z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]})
$

* Backwards pass

Last layer
$
\delta^{[L]} = (a^{[L]} - y) g'^{[L]}(W^{[L]}a^{[L-1]}) \\  
$

Hidden Layers

$
\delta^{[l]} = (W^{[l]})^t \delta^{[l+1]} g'^{[l]}(W^{[l]}a^{[l-1]} + b^{[l]})  
$

Weight updates:


$
\frac{\partial}{\partial W^{[l]}}J  = \delta^{[l]} (a^{[l]})^t \\
W^{[l]} = W^{[l]} - \alpha \frac{\partial}{\partial W^{[l]}}J \\
\frac{\partial}{\partial b^{[l]}}J  = \delta^{[l]} \\
b^{[l]} = b^{[l]} - \alpha \frac{\partial}{\partial b^{[l]}}J
$





In [8]:
class Layer():
    
    def _get_activation_func(self,activation_type):
            
        if activation_type == 'relu':
            
            def activation_func(z: np.array):
    
                a = np.maximum(0.0,z)
        
                return a
        else:
            
            def activation_func(z: np.array):
    
                a = 1.0/(1.0 + np.exp(-z))
        
                return a
        
        return activation_func

    def __init__(self, input_size, output_size, activation_type):

        self.activation_type = activation_type
        
        self.input_size = input_size
        self.output_size = output_size
        
        self.activation = self._get_activation_func(activation_type)
            
class DenseLayer(Layer):
    
    def __init__(self, input_size, output_size, activation_type):
        
        super().__init__( input_size, output_size, activation_type)

    def forward_propagation(self, X_in: np.array, W: np.array, b: np.array) -> Tuple[np.array]:
        
        z = W.dot(X_in)
        a = self.activation(z)
        
        return a, z
    
    def  backward_propagation(self, x):
        pass

In [3]:
in_size = 1
out_size = 1
batch_size = 3

l1 = DenseLayer(in_size,out_size,'relu')
l1.activation_type

l2 = DenseLayer(in_size,out_size, 'sigmoid')
l2.activation_type


X = np.array([[1.0, 2.0], 
              [1.0,4.0],
             [0.0, 0.0]])

print(f"batch_size: {X.shape[0]}; in_size: {X.shape[1]}")

W = np.array([-1.0, 0.5]).reshape((1,2))
W.shape
print(f"out_size: {W.shape[0]}; in_size: {W.shape[1]}")


b = np.zeros((1,))

print("Expected Z")
W.dot(X[0,:])
W.dot(X[1,:])
W.dot(X[2,:])

print("Layers output")

for k in range(batch_size):
    
    Xb = X[k,:]

    l1.forward_propagation(Xb,W,b)
    l2.forward_propagation(Xb,W,b)

'relu'

'sigmoid'

batch_size: 3; in_size: 2


(1, 2)

out_size: 1; in_size: 2
Expected Z


array([0.])

array([1.])

array([0.])

Layers output


(array([0.]), array([0.]))

(array([0.5]), array([0.]))

(array([1.]), array([1.]))

(array([0.73105858]), array([1.]))

(array([0.]), array([0.]))

(array([0.5]), array([0.]))

In [4]:
class NN():

    def __init__(self, layers, seed= 2021):
        
        self._seed = seed 
        np.random.seed(seed)

        self.layers = layers
        self.n_layers = len(layers)
        self.params_values = {}

        self._init_layer_parameter_random()
                        
    # All bias are zero
    def _init_layer_parameter_random(self):
        
        for idx, layer in enumerate(self.layers):
            
            layer_idx = idx + 1
            
            # We need initialize with small values. Because high values are regions with
            # samall (almost vanish) gradients for sigmois, tanh funcitons
            W = np.random.randn(layer.output_size, layer.input_size) * 0.01
            self.params_values['W' + str(layer_idx)] =  W
            
            b = np.random.randn(layer.output_size, 1) * 0.01
            self.params_values['b' + str(layer_idx)] = b

    def _init_layer_paremeter_xavier(self):
        
        pass
        
        
    def predict(self, X_in: np.array) -> np.array:
        
        y, _ = self.forward_propagation(X_in)
        
        return y
     
    def forward_propagation(self, X_in: np.array) -> Tuple[np.array, dict] : 
        # forward propagation
        
        memory = {}
        A_curr = X_in
        
        for idx, layer in enumerate(self.layers):
            
            layer_idx = idx + 1
            A_prev = A_curr
            
            W_curr = self.params_values["W" + str(layer_idx)]
            b_curr = self.params_values["b" + str(layer_idx)]
            
            A_curr, Z_curr = layer.forward_propagation(A_curr, W_curr, b_curr) 
            
            # Need this for back propagation 
            memory["A" + str(idx)] = A_prev  # A0 = X_in
            memory["Z" + str(layer_idx)] = Z_curr
    
        return A_curr, memory
    
    def train(self, X, Y, epochs=3):
        
        cost_history = []
        accuracy_history = []
        
        for e in range(epochs):
            
            y_hat, memory = forward_propagation(X_in)
            
            cost = self.loss(y_hat, y)
            acc = self.accuracy(y_hat, y)
            
            cost_history.append(cost)
            accuracy_history.append(acc)
            

            grads = self.backward_propagation(y_hat, y, memory)
        
        

tetsing parameter init with previous layers



In [5]:
l1 = DenseLayer(input_size=2,output_size=1,activation_type = 'relu')
l2 = DenseLayer(input_size=1,output_size=1,activation_type = 'sigmoid')

layers = [l1,l2]

nn = NN(layers)
nn.params_values

y , memory = nn.predict(X[1,:])
y 
memory

{'W1': array([[0.01488609, 0.00676011]]),
 'b1': array([[-0.00418451]]),
 'W2': array([[-0.00806521]]),
 'b2': array([[0.00555876]])}

array([0.49991546])

{'A0': array([1., 4.]),
 'Z1': array([0.04192653]),
 'A1': array([0.04192653]),
 'Z2': array([-0.00033815])}

## Batch normalization  

(Breaktrough in the area)
refs:
* https://towardsdatascience.com/understanding-batch-normalization-with-examples-in-numpy-and-tensorflow-with-interactive-code-7f59bb126642 <= very good in simple.
* https://kevinzakka.github.io/2016/09/14/batch_normalization/
* Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)
    * Authors: Sergey Ioffe (same author of PLDA and works at Google) n Christian Szegedy (google)
    * https://arxiv.org/pdf/1502.03167.pdf Paper **TODO** Read the paper. It is simple and easy to understand/ It is a good gain experience in reading paper  
* https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization tensprflow doc


refs:
* https://www.deeplearningbook.org/contents/optimization.html
* https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
* https://arxiv.org/pdf/1502.03167.pdf
* https://arxiv.org/pdf/1702.03275.pdf
* https://www.youtube.com/watch?v=nUUqwaxLnWs
* https://arxiv.org/pdf/1805.11604.pdf
* https://towardsdatascience.com/understanding-batch-normalization-with-examples-in-numpy-and-tensorflow-with-interactive-code-7f59bb126642 <= very good in simple.
* https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795    
* https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/

**It is being asked more often in the job interview**

Batch normalization is the technique to improve the performance and stability of neural networks by normalizing the inputs in every layer so that they have mean output activation of zero and standard deviation of one.


Don’t Use With Dropout:

Batch normalization offers some regularization effect, reducing generalization error, perhaps no longer requiring the use of dropout for regularization.


* Input are the values of x over a batch: $B = {x_1, x_2,..., x_i,..., x_m}$
    * where $m$ is the batch size
* Output: $y_i = BN_{\gamma,\beta}(x_i)$
* Learning parameters: $\gamma$ and $\beta$
* Normalization:

$
\mu_B = \frac{1}{m} \sum_{i=1}^m x_i \\
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 \\
z_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}} \\
y_i = BN_{\gamma,\beta}(x_i) \equiv \gamma z_i + \beta
$


In [6]:
# Batch
X = np.random.uniform(0,5.0,size=(10))
X

print(f"shape: {X.shape}; mean: {X.mean():.2f}; std: {X.std():.2f}")


array([3.31080257, 3.92155066, 0.48447198, 0.29285643, 4.81197995,
       3.08278722, 0.43314981, 2.80636181, 3.08262354, 4.81921511])

shape: (10,); mean: 2.70; std: 1.64


In [7]:
gamma = 1.0
beta = 0.0
epsilon = 0.0

# because we did not train the layer, we are passing the mean and the variance of the batch
Y = tf.nn.batch_normalization(X,
                    mean = X.mean(axis=0),        # batch mean
                    variance = X.var(axis=0),     # batch var
                    offset = beta,scale = gamma,  # batch beta and gamma See equations  
                    variance_epsilon = epsilon)   # batch epsilon See equations

Y.numpy()

# comparing with numpy

Z = (X - X.mean(axis=0))/np.sqrt(X.var(axis=0) + epsilon)
Y = gamma * Z + beta
Y

# Expectd zero mean and unit variance
print(f"shape: {Y.shape}; mean: {Y.mean():.2f}; std: {Y.std():.2f}")

array([ 0.36919292,  0.74114184, -1.35205787, -1.46875279,  1.28341815,
        0.23033032, -1.38331335,  0.06198574,  0.23023064,  1.2878244 ])

array([ 0.36919292,  0.74114184, -1.35205787, -1.46875279,  1.28341815,
        0.23033032, -1.38331335,  0.06198574,  0.23023064,  1.2878244 ])

shape: (10,); mean: -0.00; std: 1.00
