# Deep Neural Network Step by Step

**Notation**:
- Superscript $[l]$ denotes a quantity associated with the $l^{th}$ layer. 
- Superscript $(i)$ denotes a quantity associated with the $i^{th}$ example. 
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.

<a name='0'></a>
## Packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

<a name='1'></a>
## 1 - Initialization

In [None]:
def initialize_parameters(layer_dims, initialization):
    """
    Arguments:
    layer_dims -- list containing the dimensions of each layer in our network
    initialization -- the initialization method, random or he
    
    Returns:
    parameters -- python dictionary containing the parameters
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    parameters = {}
    L = len(layer_dims)
        
    if initialization == "random":
        for l in range(1, L):
            parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])     
            parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
    elif initialization == "he":
        for l in range(1, L):
            parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2/layers_dims[l-1]) 
            parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
    return parameters

<a name='2'></a>
## 2 - Forward Propagation

<a name='2-1'></a>
### 2.1 - Linear Forward 
The linear forward module (vectorized over all the examples) computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{1}$$

where $A^{[0]} = X$. 

In [None]:
def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer; numpy array of shape (size of previous layer, number of examples)
    W -- weights matrix; numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector; numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """

    Z = np.dot(W,A) + b
    cache = (A, W, b)
    
    return Z, cache

<a name='2-2'></a>
### 2.2 - Linear-Activation Forward

We'll use only the Sigmoid and ReLU activation functions.

In [None]:
def sigmoid(Z):
    """
    Implements the sigmoid activation in numpy
    
    Arguments:
    Z -- numpy array; any shape
    
    Returns:
    A -- output of sigmoid(z); same shape as Z
    cache -- returns Z as well
    """
    
    A = 1/(1 + np.exp(-Z))
    cache = Z
    
    return A, cache

In [None]:
def relu(Z):
    """
    Implement the RELU function.

    Arguments:
    Z -- Output of the linear layer; any shape

    Returns:
    A -- Post-activation parameter; same shape as Z
    cache -- a python dictionary containing "A"
    """
    
    A = np.maximum(0,Z)
    cache = Z 
    
    return A, cache

For convenience, we are going to group the linear and activation functions into just one.

In [None]:
def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A -- activations from previous layer; numpy array of shape (size of previous layer, number of examples)
    W -- weights matrix; numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector; numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer; stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function
    cache -- a python tuple containing "linear_cache" and "activation_cache"
    """
    
    if activation == "sigmoid":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    
    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)

    cache = (linear_cache, activation_cache)

    return A, cache

<a name='2-3'></a>
### 2.3 - Foward Pass

In [None]:
def forward_pass(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters()
    
    Returns:
    AL -- activation value from the output (last) layer
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L of them, indexed from 0 to L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2                 

    for l in range(1, L):
        A_prev = A 
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], 'relu')
        caches.append(cache)

    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], 'sigmoid')
    caches.append(cache)
          
    return AL, caches

<a name='3'></a>
## 3 - Cost Function

We'll implement the cross-entropy cost $J$, using the following formula: 
$$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))\tag{2}$$


In [None]:
def compute_cost(AL, Y):
    """
    Implement the the cross-entropy cost function.

    Arguments:
    AL -- activation value from the last layer, shape of (1, number of examples)
    Y -- label vector, shape of (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    
    m = Y.shape[1]
    cost = -np.sum(Y*np.log(AL) + (1 - Y)*(np.log(1 - AL)))/m   
    cost = np.squeeze(cost)

    return cost

<a name='4'></a>
## 4 - Backward Propagation
<a name='6-1'></a>

### 4.1 - Linear Backward
Through backpropagation and the chain rule, we can compute $dW^{[l]}$, $db^{[l]}$ and $dA^{[l-1]}$ using $dZ^{[l]}$.
<br>

$$ dW^{[l]} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{3}$$
<br>

$$ db^{[l]} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{4}$$
<br>

$$ dA^{[l-1]} = W^{[l] T} dZ^{[l]} \tag{5}$$

In [None]:
def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = np.dot(dZ,A_prev.T)/m
    db = np.sum(dZ, axis=1, keepdims=True)/m
    dA_prev = np.dot(W.T,dZ)
    
    return dA_prev, dW, db

<a name='4-2'></a>
### 4.2 - Linear-Activation Backward

Next, we implement the backward step for the two activation functions we are using it.

In [None]:
def relu_backward(dA, cache):
    """
    Implement the backward propagation for a single RELU unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- where we store 'Z' for computing the backward propagation

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    dZ = np.array(dA, copy=True) 
    dZ[Z <= 0] = 0
    
    return dZ

In [None]:
def sigmoid_backward(dA, cache):
    """
    Implement the backward propagation for a single SIGMOID unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- where we store 'Z' for computing the backward propagation

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    s = 1/(1 + np.exp(-Z))
    dZ = dA*s*(1 - s)
    
    return dZ

For convenience, we going to group the linear and activation functions into just one.

In [None]:
def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR -> ACTIVATION layer.
    
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we stored for computing backward propagation
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

<a name='4-3'></a>
### 4.3 - Backward Pass

To initialize the backward pass we need to compute $dA^{[L]}$ (the derivative of cost with respect to $A^{[L]}$), which is derived using calculus. 

$$dA^{[L]} = -\bigg(\frac{Y}{A^{[L]}} - \frac{1 - Y}{1 - A^{[L]}}\bigg) \tag{6}$$

In [None]:
def backward_pass(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)
    
    # Initializing the backpropagation
    dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

    current_cache = caches[L-1]
    dA_prev, dW, db = linear_activation_backward(dAL, current_cache, 'sigmoid')
    
    grads["dA" + str(L-1)] = dA_prev
    grads["dW" + str(L)] = dW
    grads["db" + str(L)] = db
    
    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev, dW, db = linear_activation_backward(grads["dA" + str(l+1)], current_cache, 'relu')
        
        grads["dA" + str(l)] = dA_prev
        grads["dW" + str(l+1)] = dW
        grads["db" + str(l+1)] = db
        
    return grads

<a name='4-4'></a>
### 4.4 - Update Parameters

For last, we'll update the parameters of the model using the gradient descent: 

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{7}$$

$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{8}$$

where $\alpha$ is the learning rate.

In [None]:
def update_parameters(params, grads, learning_rate):
    """
    Update the parameters using gradient descent
    
    Arguments:
    params -- python dictionary containing the parameters 
    grads -- python dictionary containing the gradients
    
    Returns:
    parameters -- python dictionary containing the updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    parameters = params.copy()
    L = len(parameters) // 2 

    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db" + str(l+1)]

    return parameters