## 1 - Packages

In [141]:
import numpy as np
from testCases_v4a import *

## 2-initialization
### 3.1 - 2-layer neural network.
**Exercise**: Create and initialize the parameters of the 2-layer neural network.

**Solution**: I created a parameter initializer which by giving a list of numbers of units in each layer it randomly initializes the parameters W and constructs zeros of b parameters, by considering its size of output and input. which length of this list demonstrate the number of layers - 2 (minus input and outpus size)
+ **N**: List of number of Units in each layer
+ **len(N)**: Number of layers
+ **parameters**: Dictionary of parameters

In [142]:
def initialize_params(N:list):
    np.random.seed(1)
    parameters = {}
    for i in range(1, len(N)):
        parameters[f'W{i}'] = np.random.randn(N[i], N[i-1]) * 0.01
        parameters[f'b{i}'] = np.zeros((N[i], 1))
    return parameters

In [143]:
N = [3, 2, 1]
initialize_params(N)

{'W1': array([[ 0.01624345, -0.00611756, -0.00528172],
        [-0.01072969,  0.00865408, -0.02301539]]),
 'b1': array([[0.],
        [0.]]),
 'W2': array([[ 0.01744812, -0.00761207]]),
 'b2': array([[0.]])}

### 3.2 - L-layer Nueral Network
**Exercise**: Implement initialization for an L-layer Neural Network. 

**Solution**: Using the previous implemented solution we can give any **value of L**, then randomly initialize its parameters, given the number of **Units** of each **Layer**.

In [144]:
N = [4, 5, 2]
initialize_params(N)

{'W1': array([[ 0.01624345, -0.00611756, -0.00528172, -0.01072969],
        [ 0.00865408, -0.02301539,  0.01744812, -0.00761207],
        [ 0.00319039, -0.0024937 ,  0.01462108, -0.02060141],
        [-0.00322417, -0.00384054,  0.01133769, -0.01099891],
        [-0.00172428, -0.00877858,  0.00042214,  0.00582815]]),
 'b1': array([[0.],
        [0.],
        [0.],
        [0.],
        [0.]]),
 'W2': array([[-0.01100619,  0.01144724,  0.00901591,  0.00502494,  0.00900856],
        [-0.00683728, -0.0012289 , -0.00935769, -0.00267888,  0.00530355]]),
 'b2': array([[0.],
        [0.]])}

## 4 - Forward Propagation module
### 4.1 - Linear Forward

**Exercise**: Build the linear part of forward propagation.

**Solution**: Since, $$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$
+ Use the output of activated previous layer to **forward propagate** to next layer.
+ Also cache the values of **A**, **W**, **b** of the current layer to use for **backward propagation**.

In [145]:
def linear_forward(A, W, b):
    return W.dot(A) + b, (A, W, b)

In [146]:
# test the implementation
A, W, b = linear_forward_test_case()

print('Z is:', linear_forward(A, W, b)[0])

Z is: [[ 3.26295337 -1.23429987]]


### 4.2 - Linear-Activation 
**Exercise**: Implement the forward propagation of the *LINEAR->ACTIVATION* layer. 

**Solution**: Give mathematical relation,
$$A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} +b^{[l]})$$
+ **activation g**: it can be range of functions:
    - sigmoid function: (between -1, +1) $$\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$$
    - tanh funtion: $$\frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$$
    - softmax function: (between 0, +1) $$\sigma(Z_j) = \frac{e^{Z_j}}{ \sum_{k=1}^{K}e^{-(Z_k)}}$$
    - relu function: $$Relu(Z) = max(0, Z)$$
    - leaky relu function $$LRelu(Z) = max(\epsilon, Z)$$
+ **linear_forward**: It is using linear activation function, so we call it **Z** and then feed it to activation function.
+ **cache**: at last cache any current parameter to use it in **backward propagation** step.
    - linear_cache: it is parameters **A, W, b** of current layer.
    - activation_cache: it is output **Z** of current layer.
+ **A_prev**: it is the **activated** output of previous layer.

In [147]:
def linear_activation_forward(A_prev, W, b, activation):
    Z, linear_cache = linear_forward(A_prev, W, b)

    if activation == 'sigmoid':
        return 1/(1+np.exp(-Z)), (linear_cache, Z)
    
    elif activation == 'relu':
        return np.maximum(0, Z), (linear_cache, Z)
    
    elif activation == 'lrelu':
        return np.maximum(0.01, Z), (linear_cache, Z)
    
    elif activation == 'tanh':
        return np.exp(Z) - np.exp(-Z) / np.exp(Z) + np.exp(-Z), (linear_cache, Z)
    
    else:  # if not specified it would be linear activation function
        return Z, (linear_cache, Z)

In [148]:
A_prev, W, b = linear_activation_forward_test_case()

activations = ['sigmoid', 'relu', 'lrelu', 'tanh']
for activation in activations:
    print(f'with {activation}: A =', linear_activation_forward(A_prev, W, b, activation)[0])

with sigmoid: A = [[0.96890023 0.11013289]]
with relu: A = [[3.43896131 0.        ]]
with lrelu: A = [[3.43896131 0.01      ]]
with tanh: A = [[ 31.18564924 -57.08171647]]


### 4.3 - L-Layer Model
<img src="images/model_architecture_kiank.png" style="width:600px;height:300px;">
<caption><center> **Figure 2** : *[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID* model</center></caption><br>
**Exercise**: Implement the forward propagation of the above model.


In [149]:
def L_model_forward(X, parameters):
    A = X
    size = len(parameters)//2
    caches = []

    for layer in range(1, size):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters[f'W{layer}'], parameters[f'b{layer}'], 'relu')
        caches.append(cache)
    
    A_Last, cache = linear_activation_forward(A, parameters[f'W{size}'], parameters[f'b{size}'], 'sigmoid')
    caches.append(cache)

    return A_Last, caches

In [150]:
X, parameters = L_model_forward_test_case_2hidden()
A_Last, caches = L_model_forward(X, parameters)
print(f"AL = {A_Last}")
print("Length of caches list = " + str(len(caches)))

AL = [[0.03921668 0.70498921 0.19734387 0.04728177]]
Length of caches list = 3


## 5 - Cost Function
**Exercise**: Compute the cross-entropy cost $J$, using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$$

In [151]:
def compute_cost(A_Last, Y):
    return (-1/Y.shape[1]) * (np.dot(Y, np.log(A_Last.T)) + np.dot((1 - Y), np.log(1-A_Last.T))).squeeze()

In [152]:
Y, A_Last = compute_cost_test_case()

print('cost is :', compute_cost(A_Last, Y))

cost is : 0.2797765635793422


## 6 - Backward Propagation module
### 6.1 - Linear backward
<img src="images/linearback_kiank.png" style="width:250px;height:300px;">
<caption><center> **Figure 4** </center></caption>

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$.Here are the formulas you need:
$$ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$
**Exercise**: Use the 3 formulas above to implement linear_backward().(given dZ)

In [153]:
def linear_backward(dZ, cache):
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = (1/m) * np.dot(dZ, A_prev.T)
    db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    return dA_prev, dW, db

In [154]:
dZ, linear_cache = linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print(f"dA_prev = {dA_prev}")
print(f"dW = {dW}")
print(f"db = {db}")

dA_prev = [[-1.15171336  0.06718465 -0.3204696   2.09812712]
 [ 0.60345879 -3.72508701  5.81700741 -3.84326836]
 [-0.4319552  -1.30987417  1.72354705  0.05070578]
 [-0.38981415  0.60811244 -1.25938424  1.47191593]
 [-2.52214926  2.67882552 -0.67947465  1.48119548]]
dW = [[ 0.07313866 -0.0976715  -0.87585828  0.73763362  0.00785716]
 [ 0.85508818  0.37530413 -0.59912655  0.71278189 -0.58931808]
 [ 0.97913304 -0.24376494 -0.08839671  0.55151192 -0.10290907]]
db = [[-0.14713786]
 [-0.11313155]
 [-0.13209101]]


### 6.2 - Linear activation backward
If $g(.)$ is the activation function, 
`sigmoid_backward` and `relu_backward` compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$
which is,$$dZ^{[l]} = W^{[l+1] T} dZ^{[l+1]} * g'(Z^{[l]}) \tag{11}$$

**Exercise**: Implement the backpropagation for the *LINEAR->ACTIVATION* layer. 
**Solution**: By calculating the derivative of each activation function we have,
if $g(Z)=a$ then,
+ **sigmoid**: $$g'(Z)=dZ=a(1-a)$$
+ **tanh**: $$1-a^2$$
+ **relu**:<br> $0$ if $z<0$, $1$ if $z>=0$

In [155]:
def linear_activation_backward(dA, cache, activation):
    if activation == 'sigmoid':
        dZ = dA * (1/(1+np.exp(-cache[1]))) * (1 - 1/(1+np.exp(-cache[1])))
    
    elif activation == 'relu':
        dZ = np.where(cache[1] >= 0, dA, 0)
    
    elif activation == 'lrelu':
        dZ = np.where(cache[1] >= 0, dA, 0.01)
    
    elif activation == 'tanh':
        dZ = dA * 1 - np.power(((np.exp(cache[1]) - np.exp(-cache[1])) / (np.exp(cache[1]) + np.exp(-cache[1]))), 2)
    
    dA_prev, dW, db = linear_backward(dZ, cache[0])

    return dA_prev, dW, db

In [156]:
dAL, linear_activation_cache = linear_activation_backward_test_case()

activations = ['sigmoid', 'relu', 'lrelu', 'tanh']
for activation in activations:
    dA_prev, dW, db = linear_activation_backward(dAL, linear_activation_cache, activation)
    print (f"{activation}:")
    print ("dA_prev = {}".format(dA_prev))
    print ("dW = {}".format(dW))
    print ("db = {}\n-------------------".format(db))

sigmoid:
dA_prev = [[ 0.11017994  0.01105339]
 [ 0.09466817  0.00949723]
 [-0.05743092 -0.00576154]]
dW = [[ 0.10266786  0.09778551 -0.01968084]]
db = [[-0.05729622]]
-------------------
relu:
dA_prev = [[ 0.44090989 -0.        ]
 [ 0.37883606 -0.        ]
 [-0.2298228   0.        ]]
dW = [[ 0.44513824  0.37371418 -0.10478989]]
db = [[-0.20837892]]
-------------------
lrelu:
dA_prev = [[ 0.44090989 -0.01057952]
 [ 0.37883606 -0.00909008]
 [-0.2298228   0.00551454]]
dW = [[ 0.4533396   0.36950544 -0.11101633]]
db = [[-0.20337892]]
-------------------
tanh:
dA_prev = [[ 0.44273331  0.74825519]
 [ 0.38040277  0.64291152]
 [-0.23077325 -0.39002551]]
dW = [[-0.13307594  0.67292997  0.33515262]]
db = [[-0.56287443]]
-------------------


### 6.3 - L-Model Backward
<img src="images/mn_backward.png" style="width:450px;height:300px;">
<caption><center>  **Figure 5** : Backward pass  </center></caption>

**Exercise**: Implement backpropagation for the *[LINEAR->RELU] $\times$ (L-1) -> LINEAR -> SIGMOID* model.

**Solution**: first calculate the dA and dZ of the **last layer** we get,
$$dA = \frac{A - Y}{A(1-A)}$$
then using the estimated derivative of sigmoid activation function we have,
$$dZ = \frac{A - Y}{A(1-A)} * A(1-A) = A-Y$$

In [157]:
def L_model_backward(A_Last, Y, caches):
    grads = {}             # keep the gradients here
    L = len(caches)        # number of layers
    m = A_Last.shape[1]    # number of examples
    Y = Y.reshape(A_Last.shape)  # Y has same shape as A_Last

    grads[f'dA{L-1}'], grads[f'dW{L}'], grads[f'db{L}'] = linear_activation_backward(
        -(np.divide(Y, A_Last) - np.divide(1 - Y, 1 - A_Last)),
        caches[L-1], # last layer
        'sigmoid'
    )

    for l in reversed(range(L-1)):
        grads[f'dA{l}'], grads[f'dW{l+1}'], grads[f'db{l+1}'] = linear_activation_backward(
            grads[f'dA{l+1}'],
            caches[l], # last layer
            'relu'
        )

    return grads

In [158]:
AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print_grads(grads)

dW1 = [[0.41010002 0.07807203 0.13798444 0.10502167]
 [0.         0.         0.         0.        ]
 [0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 = [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
dA1 = [[ 0.12913162 -0.44014127]
 [-0.14175655  0.48317296]
 [ 0.01663708 -0.05670698]]


### 6.4 - Update Parameters
if $\alpha$ is the learning rate then,
$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$

**Exercise**: Implement `update_parameters()` to update your parameters using gradient descent.

**Solution**:

In [159]:
def update_parameters(parameters, grads, learning_rate):
    L = len(parameters) // 2

    for l in range(L):
        parameters[f'W{l+1}'] = parameters[f'W{l+1}'] - learning_rate * grads[f'dW{l+1}']
        parameters[f'b{l+1}'] = parameters[f'b{l+1}'] - learning_rate * grads[f'db{l+1}']
    
    return parameters

In [160]:
parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

L = len(parameters) // 2
for l in range(L):
    print(f"W{l+1} = {parameters[f'W{l+1}']}")
    print(f"b{l+1} = {parameters[f'b{l+1}']}")

W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]
