<img src="cover.png" style="width:800px;height:500px;">

In [1]:
!export PATH=/Library/TeX/texbin:$PATH
#!pip install nbconvert


Welcome to the second ML1 exercise. We will build our first neural network, which will have a hidden layer. We will also see how to train a neural network for various purposes. We will recognise the difference between several activation functions. 

**You will learn how to:**
- Implement a 2-class classification neural network with a single hidden layer
- Use units with a non-linear activation function, such as tanh 
- Compute the cross entropy loss 
- Implement forward and backward propagation

**Notation**:
- Superscript $[l]$ denotes a quantity associated with the $l^{th}$ layer. 
    - Example: $a^{[L]}$ is the $L^{th}$ layer activation. $W^{[L]}$ and $b^{[L]}$ are the $L^{th}$ layer parameters.
- Superscript $(i)$ denotes a quantity associated with the $i^{th}$ example. 
    - Example: $x^{(i)}$ is the $i^{th}$ training example.
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.
    - Example: $a^{[l]}_i$ denotes the $i^{th}$ entry of the $l^{th}$ layer's activations).

## Introduction
In this notebook we will train a Neural Network with a single hidden layer.

**Here is our neural network**:
<img src="NN.png" style="width:800px;height:300px;">

### Input layer
Each node in the input layer refers to each feature in the dataset.

### Hidden layers
The next layers after the input layer where all the magic happens. The hidden layer takes the input layer and applies a non-linear activation function to it. Mathematically, we can represent the function of the hidden layer as follows:
$$f(x) =  \sigma(Wx + b) $$
Where $\sigma $ refers to the non-linear activation function

To keep things simple, we will only use one hidden layer in our model for this notebook. Increasing the number of hidden layers tends to increase the model complexity and training time.

### Output layer
These are the output our network arrives at after it performs its calculations.

### Activation functions
When designing the neural network model architecture, we also need to decide what activation functions to use for each layer. Activation functions have an important role to play in neural networks. You can think of activation functions as transformers in neural networks; they take an input value, transform the input value, and pass the transformed value to the next layer. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

**Why do we need Non-linear activation functions?**  
A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

### Import Libraries:
first we will import all the packages that are required for this exercise. 
- [numpy](www.numpy.org) is the main package for scientific computing with Python.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- np.random.seed(1) is used to keep all the random function calls consistent. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(1)

#### 1. Linear 
You can interpret this as no activation function. We usually use it in the output layer in regression networks as we don't need any transformation to the outputs.
$$f(x) = x$$

In [None]:
def linear(x):
    return x

def der_linear(x):
    return np.ones_like(x)

In [None]:
## plot the function 
x  = np.linspace(-10, 10, 100)
f  = linear(x)
df = der_linear(x)
plt.plot(x, f, label="linear") 
plt.plot(x, df, label="derivative")
plt.xlabel("x") 
plt.ylabel("linear(X)")
plt.legend()
plt.show()

#### 2. Sigmoid
The Sigmoid Activation Function simply takes a value and squashes it between 0 and 1. You can interpret it as the probability of an output prediction. Therefore, we usually use it in the output layer in binary classification networks. Besides that, we sometimes use it in hidden layers. However, it should be noted that the sigmoid function is monotonic but its derivative is not. Hence, the neural network may get stuck at a suboptimal solution.
$$f(x) = \sigma(x) = \frac{1}{1 + e^{-x}}$$
Its derivative 
$$f'(x) = \sigma(x)*(1-\sigma(x)) $$

In [None]:
def sigmoid(x):
    return 1/(1 + np.exp(-x))

def der_sigmoid(x):
    return sigmoid(x)* (1- sigmoid(x))

In [None]:
## plot the function 
x  = np.linspace(-10, 10, 100)
f  = sigmoid(x)
df = der_sigmoid(x)
plt.plot(x, f, label="sigmoid") 
plt.plot(x, df, label="derivative")
plt.xlabel("x") 
plt.ylabel("Sigmoid(X)")
plt.legend()
plt.show()

#### When we will use Sigmoid: ####
* If you want output value between 0 to 1 use sigmoid at output layer neuron only
* When you are doing binary classification problem use sigmoid

#### Problem with sigmoid: ####
* When the x value is small or big the slope is zero, then there is no learning.

#### 3. Hyperbolic Tangent
The tanh function is just another possible functions that can be used as a nonlinear activation function between layers of a neural network. It actually shares a few things in common with the sigmoid activation function. They both look very similar. But while a sigmoid function will map input values to be between 0 and 1, Tanh will map values to be between -1 and 1.

$$f(x) = tanh(x) = \frac {sinh(x)}{cosh(x)} = \frac {e^x - e^{-x}}{e^x + e^{-x}}$$

Its derivative 
$$f'(x) = 1 - tanh(x)^2$$

In [None]:
def tanh(x):
    return np.tanh(x)

def der_tanh(x):
    return 1- tanh(x)**2

In [None]:
## plot the function 
x  = np.linspace(-10, 10, 100)
f  = tanh(x)
df = der_tanh(x)
plt.plot(x, f, label="hyperbolic") 
plt.plot(x, df, label="derivative")
plt.xlabel("x") 
plt.ylabel("tanh(x)")
plt.legend()
plt.show()

#### When we will use tanh: ####
* Usually used in hidden layers of a neural network. It helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier (Optimization is easier in this method). 

#### Problem with tanh: ####
* Vanishing gradient

#### 4. ReLU
Today, ReLU is the most popular choice of activation function for DNNs, and it has become a default choice for activation functions. Its range is from 0 to infinity, and both the function itself and its derivative are monotonic. One drawback of the ReLU function is the inability to appropriately map the negative part of the input where all negative inputs are transformed to zero. To fix the "dying negative" problem in ReLU, *Leaky ReLU* was invented to introduce a small slope in the negative part.
$$f(x) = \max(0,x)$$

The derivative of the ReLU is :

$$f'(x) = \begin{cases} 1 & \mbox{if } x > 0 \\ 0 & x <= 0 \end{cases}$$

**Instructions**:
- You will Compare a vector x with sclar 0 and return a new array containing the element-wise maxima 
    - Use: `numpy.maximum(x, 0)`.
- You will Compare vector x with sclar 0 and return a new array containing either ones or zeros
    - Use: `np.where(x <= 0)`.

In [None]:
def relu(x) :
    return np.maximum(x, 0)

def der_relu(x):
    i     = np.where(x <= 0)
    df    = np.ones_like(x)
    df[i] = 0 
    return df

In [None]:
## plot the function 
x = np.linspace(-10, 10, 100)
f = relu(x)
df = der_relu(x)
plt.plot(x, f,  label="relu")
plt.plot(x, df, label="derivative")
plt.xlabel("x") 
plt.ylabel("ReLU(X)")
plt.legend()
plt.show()

#### Problem with ReLU: ####
* The only issue is that the derivative is not defined at $x = 0$, which we can overcome by assigning the derivative to $0$ at $x = 0$. However, this means that for $x <= 0$ the gradient is zero and again can’t learn.

#### 5. Leaky ReLU
Leaky ReLU is an improved version of the ReLU function. ReLU function, the gradient is 0 for x<0, which made the neurons die for activations in that region, a leaky ReLU will instead have a small negative slope (of 0.01, or so). 
$$f(x) = max(0.01x, x)$$

The derivative of the leaky ReLU is :

$$f'(x) = \begin{cases} 1 & \mbox{if } x > 0 \\ 0.01 & x <= 0 \end{cases}$$

In [None]:
def leaky_relu(x) :
    return np.maximum(x, 0.01*x)

def der_leaky_relu(x):
    i     = np.where(x <= 0)
    df    = np.ones_like(x)
    df[i] = 0.01 
    return df

In [None]:
## plot the function 
x = np.linspace(-10, 10, 100)
f = leaky_relu(x)
df = der_leaky_relu(x)
plt.plot(x, f,  label="leaky_relu")
plt.plot(x, df, label="derivative")
plt.xlabel("x") 
plt.ylabel("Leaky_ReLU(x)")
plt.legend()
plt.show()

#### 6. Softmax
Logistic Regression, softmax is a generalized logistic function used for multiclass classification. Hence, we use it in the output layer in multiclass classification networks.

#### Properties of Softmax Function ####

1. The calculated probabilities will be in the range of 0 to 1.
2. The sum of all the probabilities is equals to 1.

#### Softmax Function Usage ####
1. Used in multiple classification logistic regression model.
2. In building neural networks softmax functions used in different layer level and multilayer perceptrons.

In [None]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

In [None]:
x = np.array([2.1, -0.2 , -1.2])
y_probas = softmax(x)
print('Probabilities:\\n', y_probas)
print('The sum of Probabilities: ', np.sum(y_probas))

## Steps involved in the implementation of a neural network:
The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
    - Forward propagation
    - Compute loss
    - Backward propagation to get the gradients
    - Update parameters (gradient descent, optimizer)

### Forward propagation
On a feedforward neural network, we have a set of input features and some random weights. Forward propagation is all steps from input to prediction:

**Mathematically**:   
For one example $x^{(i)}$  
$$z^{[1] (i)} =  W^{[1]} x^{(i)} + b^{[1]}\tag{1}$$ 
$$a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}$$
$$z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2]}\tag{3}$$
$$\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}$$
$$y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } \hat{y}^{(i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}$$

<img src="NN.png" style="width:800px;height:300px;">

### Backpropagation:
During backpropagation, we calculate the error between predicted output and target output (the loss) $\mathcal{L}$ as follows: 
$$ \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) =  - y^{(i)}  \log(\hat{y}^{(i)}) - (1-y^{(i)} )  \log(1-\hat{y}^{(i)})$$

* It says that if you make wrong prediction, loss(error) becomes big.
    * **Example:** our real image is **sign one** and its label is 1 $(y = 1)$, then we make prediction $\hat{y} = 1$. When we put y and y_head into loss(error) equation the result is 0. We make correct prediction therefore our loss is 0. However, if we make wrong prediction like $\hat{y}= 0$, loss(error) is infinity.

After that, the cost function is summation of loss function. Each input creates loss function. Cost function is summation of loss functions that is created by each input image.
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)})\tag{6}$$

Then we will use an algorithm (gradient descent) to update the weight values to reduce the weights.


#### What is Gradient Descent? ####
Gradient Descent is an Optimization algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.
Working: (Iterative)
1. Start with initial values.
2. Calculate cost.
3. Update values using the update function.
4. Returns minimized cost for our cost function

<img src="GD.jpg" style="width:350px;height:250px;"> 

#### Chain rule ####
<img src="BP.png" style="width:900px;height:200px;">

**Mathematically**: 
The goal of backpropagation is to change the weights and biases of the network to reduce loss; in other words, we tune the network parameters to improve its predictions. We want to get $(dW^{[2]},db^{[2]},dW^{[1]},db^{[1]})$.

$$ dW^{[2]} = \frac{\partial \mathcal{L} }{\partial W^{[2]}} = \frac{\partial \mathcal{L} }{\partial a^{[2]}} \frac{\partial a^{[2]} }{\partial z^{[2]}}\frac{\partial z^{[2]} }{\partial W^{[2]}}\tag{7}$$
$$ db^{[2]} = \frac{\partial \mathcal{L} }{\partial b^{[2]}} = \frac{\partial \mathcal{L} }{\partial a^{[2]}} \frac{\partial a^{[2]} }{\partial z^{[2]}}\frac{\partial z^{[2]} }{\partial b^{[2]}}\tag{8}$$
$$dW^{[1]} = \frac{\partial \mathcal{L} }{\partial W^{[1]}} =  \frac{\partial \mathcal{L} }{\partial a^{[2]}} \frac{\partial a^{[2]} }{\partial z^{[2]}}\frac{\partial z^{[2]} }{\partial a^{[1]}}\frac{\partial a^{[1]} }{\partial z^{[1]}}\frac{\partial z^{[1]} }{\partial W^{[1]}} \tag{9}$$
$$db^{[1]} = \frac{\partial \mathcal{L} }{\partial W^{[1]}} =  \frac{\partial \mathcal{L} }{\partial a^{[2]}} \frac{\partial a^{[2]} }{\partial z^{[2]}}\frac{\partial z^{[2]} }{\partial a^{[1]}}\frac{\partial a^{[1]} }{\partial z^{[1]}}\frac{\partial z^{[1]} }{\partial b^{[1]}} \tag{10}$$



**The Initialization of Backpropagation**  
To backpropagate through this network, we know that the output is, $a^{[2]} = \sigma(z^{[2]})$. Your code thus needs to compute $da^{[2]}= \frac{\partial \mathcal{L}}{\partial a^{[2]}}$.
$$da^{[2]} = \frac{\partial \mathcal{L} }{\partial a^{[2]}} =  - (\frac{y }{a^{[2]}} - \frac{1-y }{1-a^{[2]}})\tag{11}$$
then,
$$dz^{[2]} = \frac{\partial \mathcal{L} }{\partial z^{[2]}}   = \frac{\partial \mathcal{L} }{\partial a^{[2]}}  \frac{\partial a^{[2]} }{\partial z^{[2]}}  = da^{[2]} \sigma'(z^{[2]}) \tag{12}$$

We can now calculate the way to update the parameters $W^{[2]}$ and $b^{[2]}$ to reduce the weight
$$  dW^{[2]} = \frac{\partial \mathcal{L} }{\partial W^{[2]}} =  \frac{\partial \mathcal{L}}{\partial z^{[2]}} \frac{\partial z^{[2]} }{\partial W^{[2]}}= dz^{[2]} a^{[1] T} \tag{13}  $$
$$ db^{[2]} =  \frac{\partial \mathcal{L} }{\partial b^{[2]}} =  \frac{\partial \mathcal{L}}{\partial z^{[2]}} \frac{\partial z^{[2]} }{\partial b^{[2]}}= dz^{[2]} \tag{14}  $$



To compute $da^{[1]}= \frac{\partial \mathcal{L}}{\partial a^{[1]}}$.
$$da^{[1]} = \frac{\partial \mathcal{L} }{\partial a^{[1]}} = \frac{\partial\mathcal{L}}{\partial z^{[2]}} \frac{\partial z^{[2]} }{\partial a^{[1]}} = W^{[2] T} dz^{[2]} \tag{15}$$
then,
$$dz^{[1]} = \frac{\partial \mathcal{L} }{\partial z^{[1]}}   = \frac{\partial \mathcal{L} }{\partial a^{[1]}}  \frac{\partial a^{[1]} }{\partial z^{[1]}} = da^{[1]}  \tanh'(z^{[1]}) \tag{16} $$
To update $W^{[1]},b^{[1]}$ we can use the  equations below
$$ \frac{\partial \mathcal{L} }{\partial W^{[1]}} =  \frac{\partial \mathcal{L}}{\partial z^{[1]}} \frac{\partial z^{[1]} }{\partial W^{[1]}}= dz^{[1]} x^{T} \tag{17}  $$
$$ \frac{\partial \mathcal{L} }{\partial b^{[1]}} =  \frac{\partial \mathcal{L}}{\partial z^{[1]}} \frac{\partial z^{[1]} }{\partial b^{[1]}}= dz^{[1]} \tag{18}  $$

### Update Parameters

In this section you will update the parameters of the model, using gradient descent: 

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{19}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{20}$$

where $\alpha$ is the learning rate and $l \in \{1,2\}$.

## Implementation
In this practical sesion we will start with implementing a shallow network from scratch

### Load Data:
We will be using pandas to load the CSV data to a pandas data frame.
### Overview the Data Set
For this exercise, we will use the "Sign Language Digit Data Set". In this data set, there are 2062 sign language digit images. You know that the numbers are from 0 to 9, so there are ten unique characters. To start the exercise: We will use only the symbols 0 and 1 for simplicity. In data, sign zero is between the indices 204 and 408. The number of zeros is 205. The sign one is also between the indices 822 and 1027. The number with the first sign is 206. Therefore, we will use 205 samples from each class (labels).

**Note:** 205 samples are tiny for deep learning. But since it is a tutorial, it is not that important.


Lets prepare our X and Y arrays. X is image array (zero and one signs) and Y is label array (0 and 1).

In [None]:
# load data set
x_l = np.load('./X.npy')
Y_l = np.load('./Y.npy')
img_size = 64
plt.subplot(1, 2, 1)
plt.imshow(x_l[260].reshape(img_size, img_size),cmap='gray')
plt.axis('off')
plt.subplot(1, 2, 2)
plt.imshow(x_l[900].reshape(img_size, img_size),cmap='gray')
plt.axis('off')
plt.show()

**What we will do to preprocess the dataset:**
* Choose our labels (classes) that are sign zero and sign one
* Create and flatten train and test sets
* Our final inputs(images) and outputs(labels or classes) should look like this:

<img src="example.png" style="width:500px;height:500px;">

To create an image array, we will concatenate null character and one character arrays (to create a small dataset). Then we will create a label array 0 for images with zero sign and 1 for images with one sign. 

In [None]:
# Join a sequence of arrays along an row axis.
X = np.concatenate((x_l[204:409], x_l[822:1027] ), axis=0) # from 0 to 204 is zero sign and from 205 to 410 is one sign 
z = np.zeros(205)
o = np.ones(205)
Y = np.concatenate((z, o), axis=0).reshape(X.shape[0],1)
print("X shape: " , X.shape)
print("Y shape: " , Y.shape)

The shape of the X is (410, 64, 64). 410 means that we have 410 images (zero and one signs). 64 means that our image size is 64x64 (64x64 pixels). The shape of the Y is (410,1). 410 means that we have 410 labels (0 and 1).

Lets split X and Y into train and test sets.

In [None]:
# Then lets create x_train, y_train, x_test, y_test arrays
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=42)
number_of_train = X_train.shape[0]
number_of_test = X_test.shape[0]
print("number of samples in training data set: " , number_of_train)
print("number of samples in test data set: " , number_of_test)

* `test_size` = percentage of test size.
* `random_state` = use same seed while randomizing. It means that if we call train_test_split repeatedly, it always * creates same train and test distribution because we have same random_state.

Now we have 3 dimensional input array (X) so we need to make it flatten (2D) in order to use as input for our first deep learning model. Our label array (Y) is already flatten(2D) so we leave it like that.

In [None]:
X_train_flatten = X_train.reshape(number_of_train,X_train.shape[1]*X_train.shape[2])
X_test_flatten = X_test .reshape(number_of_test,X_test.shape[1]*X_test.shape[2])
print("X train flatten",X_train_flatten.shape)
print("X test flatten",X_test_flatten.shape)

we have 62 images and each image has 4096 pixels in image test array.

In [None]:
x_train = X_train_flatten.T
x_test = X_test_flatten.T
y_train = Y_train.T
y_test = Y_test.T
print("x train: ",x_train.shape)
print("x test: ",x_test.shape)
print("y train: ",y_train.shape)
print("y test: ",y_test.shape)

In [None]:
x_train[0].shape

### Initialize the model's parameters:
We will begin to use random weights that we will optimize using backward propagation. The number of parameters we have to initialize is 
* $n_x$ -- size of the input layer
* $n_h$ -- size of the hidden layer
* $n_y$ -- size of the output layer

**Exercise**: Implement the function `initialize_parameters()` to be used to initialize parameters for a two layer model.

**Remember** that when we compute $W X + b$ in python, it carries out broadcasting. For example, if: 

$$ W = \begin{bmatrix}
    j  & k  & l\\
    m  & n & o \\
    p  & q & r 
\end{bmatrix}\;\;\; X = \begin{bmatrix}
    a  & b  & c\\
    d  & e & f \\
    g  & h & i 
\end{bmatrix} \;\;\; b =\begin{bmatrix}
    s  \\
    t  \\
    u
\end{bmatrix}\tag{10}$$

Then $WX + b$ will be:

$$ WX + b = \begin{bmatrix}
    (ja + kd + lg) + s  & (jb + ke + lh) + s  & (jc + kf + li)+ s\\
    (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\
    (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u
\end{bmatrix}\tag{11}  $$


**Instructions**:
- Make sure your parameters' sizes are right. Refer to the neural network figure above if needed.
- You will initialize the weights matrices with random values. 
    - Use: `np.random.randn(a,b) * 0.01` to randomly initialize a matrix of shape (a,b).
- You will initialize the bias vectors as zeros. 
    - Use: `np.zeros((a,b))` to initialize a matrix of shape (a,b) with zeros.

In [None]:
def initialize_parameters(layers_dims, init_method ="random"):
    np.random.seed(1)
    n_x, n_h, n_y = layers_dims
    if init_method == "random":
        ### START CODE HERE ### (≈ 4 lines of code)
        W1 = np.random.randn(n_h, n_x) * 0.01
        b1 = np.zeros(shape=(n_h, 1))
        W2 = np.random.randn(n_y, n_h) * 0.01
        b2 = np.zeros(shape=(n_y, 1))
        ### END CODE HERE ###
    elif init_method == "xavier":
        W1 = np.random.randn(n_h,n_x)*np.sqrt(1/n_h)
        b1 = np.zeros(shape=(n_h, 1))
        W2 = np.random.randn(n_y,n_h)
        b2 = np.zeros(shape=(n_y, 1))
    elif init_method == "zeros":
        W1 = np.zeros(shape=(n_h, n_x))
        b1 = np.zeros(shape=(n_h, 1))
        W2 = np.zeros(shape=(n_y, n_h))
        b2 = np.zeros(shape=(n_y, 1))
    
    # Test if the shape correct is
    assert(W1.shape == (n_h, n_x))
    assert(b1.shape == (n_h, 1))
    assert(W2.shape == (n_y, n_h))
    assert(b2.shape == (n_y, 1))
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters 

Test Code:

In [None]:
n_x=3 
n_h=2 
n_y=1
layers_dims = [n_x,n_h,n_y]
parameters = initialize_parameters(layers_dims,init_method = "random")
print("W1 = \\n" + str(parameters["W1"]))
print("b1 = \\n" + str(parameters["b1"]))
print("W2 = \\n" + str(parameters["W2"]))
print("b2 = \\n" + str(parameters["b2"]))

In [None]:
parameters = initialize_parameters(layers_dims,init_method = "xavier")
print("W1 = \\n" + str(parameters["W1"]))
print("b1 = \\n" + str(parameters["b1"]))
print("W2 = \\n" + str(parameters["W2"]))
print("b2 = \\n" + str(parameters["b2"]))

### Implementation of Forward propagation :
**Exercise:**
Forward propagation is all steps from input to prediction, use the equations (1)-(5) and the help functions `tanh` and `sigmoid` to implement it.

In [None]:
def forward_propagation(X, parameters):
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    # Implement Forward Propagation to calculate A2 (probabilities)
    ### START CODE HERE ### (≈ 4 lines of code)
    z1 = np.dot(W1, X) + b1
    a1 = tanh(z1)
    z2 = np.dot(W2, a1) + b2
    a2 = sigmoid(z2)
    ### END CODE HERE ###
        
    # Test if the shape correct is
    assert(a2.shape == (1, X.shape[1]))
    cache = {"z1": z1,
             "a1": a1,
             "W1": W1,
             "b1": b1,
             "z2": z2,
             "a2": a2,
             "W2": W2,
             "b2": b2}    
    return a2, cache

### Implementation of Loss function:
**Exercise:**
Implement the loss function with equation 6, this function takes the output of `forward_propagation` and the true label as input and returns the loss.

In [None]:
def compute_loss(a2, Y):
    m = Y.shape[1]
    ### START CODE HERE ### (≈ 2 lines of code)
    logprobs = np.multiply(-np.log(a2),Y) + np.multiply(-np.log(1 - a2), 1 - Y)
    loss = 1./m * np.nansum(logprobs)
    ### END CODE HERE ###
    return loss

### Implementation of Backpropagation :
**Exercise:**
Backpropagation is all steps from cost calculation to calculation of the gradient, use the equations (11)-(18) and the help functions `der_tanh` and `der_sigmoid` to implement it.

In [None]:
def backward_propagation(X, Y, cache):
    
    m = X.shape[1]
    # Retrieve each variables from the dictionary "cache"
    z1 = cache['z1']
    a1 = cache['a1']
    W1 = cache['W1']
    b1 = cache['b1']
    z2 = cache['z2']
    a2 = cache['a2']
    W2 = cache['W2']
    b2 = cache['b2']
    ### START CODE HERE ### (8 line of code) 
    # Initializing the backpropagation ()
    da2 =  - (np.divide(Y, a2) - np.divide(1 - Y, 1 - a2))
    
    # 2th layer (SIGMOID -> LINEAR) gradients
    dz2 = np.multiply(da2,der_sigmoid(z2)) # da2 * sigmoid'    
    
    #  Implement the linear portion of backward propagation for a single layer (approx. 2 lines)
    dW2 = 1/m * np.dot(dz2, a1.T)
    db2 = 1/m * np.sum(dz2, axis=1, keepdims = True)
    
    # LINEAR z2 -> activation a2
    da1 = np.dot(W2.T, dz2)
    
    # 1th layer (tanh -> LINEAR) gradients.
    dz1 = np.multiply(da1, der_tanh(z1))
    
    # Implement the linear portion of backward propagation for a single layer (approx. 2 lines)
    dW1 = 1/m * np.dot(dz1, X.T)
    db1 = 1/m * np.sum(dz1, axis=1, keepdims = True)
    ### END CODE HERE ###
        
    # Test if the shape correct is
    assert (da2.shape == a2.shape)
    assert (dz2.shape == z2.shape)
    assert (dW2.shape == W2.shape)
    assert (db2.shape == b2.shape)
    assert (da1.shape == a1.shape)
    assert (dz1.shape == z1.shape)
    assert (dW1.shape == W1.shape)
    assert (db1.shape == b1.shape)
    gradients = {"da2": da2, "dz2": dz2, "dW2": dW2, "db2": db2,
                 "da1": da1, "dz1": dz1, "dW1": dW1, "db1": db1}
    
    return gradients

### Update parameters.

**Exercise**: Implement the update rule. Use gradient descent. You have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

**General gradient descent rule**: $ \theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.

**Illustration**: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.

In [None]:
def update_parameters(parameters, grads, learning_rate):
    n = len(parameters) // 2 # number of layers in the neural networks

    # Update rule for each parameter
    for k in range(n):
        parameters["W" + str(k+1)] = parameters["W" + str(k+1)] - learning_rate * grads["dW" + str(k+1)]
        parameters["b" + str(k+1)] = parameters["b" + str(k+1)] - learning_rate * grads["db" + str(k+1)]
        
    return parameters

### Optimization
Now we want to train our model and combine all functions to first calculate the forward propagation, then estimate the loss function and the gradient, and finally update the parameters to reduce the loss. update the parameters using the gradient decay.

**Exercise:** Write down the optimization function. The goal is to learn $w$ and $b$ by minimizing the cost function $J$.

In [None]:
# GRADED FUNCTION: optimize
def optimize(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "random"):
    
    grads      = {}
    costs      = [] # to keep track of the loss
    accuracies = [] # to keep track of the accuracy
    m = X.shape[1] # number of examples
    layers_dims = [X.shape[0], 10, 1]
    
    # Initialize parameters dictionary.
    if initialization == "zeros":
        parameters = initialize_parameters(layers_dims, init_method ="zeros")
    elif initialization == "random":
        parameters = initialize_parameters(layers_dims, init_method ="random")
    elif initialization == "xavier":
        parameters = initialize_parameters(layers_dims, init_method ="xavier")

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        a2, cache = forward_propagation(X, parameters)
        
        # Loss
        cost = compute_loss(a2, Y)

        # Backward propagation.
        grads = backward_propagation(X, Y, cache)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 1000 iterations
        if print_cost and i % 1000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
            p = predict(X, parameters)
            accuracy = np.mean(p[0,:] == Y[0,:])
            print("Accuracy after iteration {}: {}".format(i, accuracy))
            costs.append(cost)
            accuracies.append(accuracy)
            
    # plot the loss, accuracy
    fig1, (ax1, ax2) = plt.subplots(figsize=(10,12), nrows=2, ncols=1)
    ax1.plot(costs)
    ax1.set_ylabel('cost')
    ax1.set_xlabel('iterations (per hundreds)')
    ax1.set_title("Learning rate =" + str(learning_rate))
    ax2.plot(accuracies)
    ax2.set_ylabel('accuracy')
    ax2.set_xlabel('iterations (per hundreds)')
    ax2.set_title("Learning rate =" + str(learning_rate))
    plt.show()
    return parameters

In [None]:
parameters = optimize(x_train, y_train, learning_rate = 0.01, num_iterations = 10000, print_cost = True, initialization = "random")

###  Predictions

Now is the time to use our model to predict by building predict() function.
Use forward propagation to predict results.

**Reminder**: predictions = $y_{prediction} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases}
      1 & \text{if}\ activation > 0.5 \\
      0 & \text{otherwise}
    \end{cases}$  
    
As an example, if you would like to set the entries of a matrix X to 0 and 1 based on a threshold you would do: ```X_new = (X > threshold)```

In [None]:
def predict(X, parameters):
    # Prediction for one image or for a batch of images  
    m = X.shape[1] # batch of images      
    p = np.zeros((1,m), dtype = np.int)
    
    # Forward propagation
    a2, caches = forward_propagation(X, parameters)
    
    # convert probas to 0/1 predictions
    for i in range(0, a2.shape[1]):
        if a2[0,i] > 0.5:
            p[0,i] = 1
        else:
            p[0,i] = 0
            
    return p

Test Code

In [None]:
plt.imshow(X_test[0].reshape(img_size, img_size),cmap='gray')
plt.show()
real_image = X_test[0]
prep_image = np.expand_dims(x_test[:,0],axis=1)
p = predict(prep_image, parameters) 
print("the model prediction {}".format(p))

In [None]:
p = predict(x_test, parameters)
accuracy = np.mean(p[0,:] == y_test[0,:])
print("Accuracy: {}".format(accuracy))

## Further analysis

#### 1. Choice of learning rate

**Reminder**:
The learning rate $\alpha$  determines how rapidly we update the parameters. If the learning rate is too large we may "overshoot" the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That's why it is crucial to use a well-tuned learning rate.

Let's compare the learning curve of our model with several choices of learning rates. Run the cell below. This should take about 1 minute. Feel free also to try different values than the three we have initialized the `learning_rates` variable to contain, and see what happens. 

#### 2. Varying the hidden layer size ####
In the example above we picked a hidden layer size of 3. Let’s now get a sense of how varying the hidden layer size affects the result.

#### 3. Preventing overfitting in neural networks ####

#### Dropout #### 

means ignoring a certain set of hidden nodes during the learning phase of a neural network. And those hidden nodes are chosen randomly given a specified probability. In the forward pass during a training iteration, the randomly selected nodes are temporarily not used in calculating the loss; in the backward pass, the randomly selected nodes are not updated temporarily.

#### Early stopping ####
As the name implies, training a network with early stopping will end if the model performance doesn't improve for a certain number of iterations. The model performance is measured on a validation set that is different from the training set, in order to assess how well it generalizes. During training, if the performance degrades after several (let's say 50) iterations, it means the model is overfitting and not able to generalize well anymore. Hence, stopping the learning early in this case helps prevent overfitting.

### Exercises

1. Instead of batch gradient descent, use minibatch gradient descent (more info) to train the network. Minibatch gradient descent typically performs better in practice.
2. We used a fixed learning rate $\epsilon$ for gradient descent. Implement an annealing schedule for the gradient descent learning rate (more info).
3. We used a $\tanh$ activation function for our hidden layer. Experiment with other activation functions (some are mentioned above). Note that changing the activation function also means changing the backpropagation derivative.
4. Extend the network from two to three classes. You will need to generate an appropriate dataset for this.
5. Extend the network to four layers. Experiment with the layer size. Adding another hidden layer means you will need to adjust both the forward propagation as well as the backpropagation code.

## References
1. https://www.geeksforgeeks.org/activation-functions/
2. https://medium.com/@omkar.nallagoni/activation-functions-with-derivative-and-python-code-sigmoid-vs-tanh-vs-relu-44d23915c1f4
3. https://maelfabien.github.io/deeplearning/act/#linear-activation
4. Deep Learning Specialization on Coursera
5. https://github.com/Kulbear/deep-learning-coursera
6. https://www.kaggle.com/kanncaa1/deep-learning-tutorial-for-beginners