# Programming a deep feed-forward network

This notebook is based on a fabulous [Kaggle tutorial by DATAI](https://www.kaggle.com/kanncaa1/deep-learning-tutorial-for-beginners) and uses the "sign language digits data set", also found through the link.

We start by loading the relevant packages:

In [None]:
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 1. The dataset (this part is identical to the logistic regression exercise)

The dataset contains 64x64 images of the signs used to represent the ten digits, 0-9. Indexes 204 to 408 of the dataset show the sign for zero and indexes 822 to 1027 show the sign for one.

In [None]:
X = np.load('digits_X.npy')
y = np.load('digits_y.npy')

Each value of `X` is a matrix with pixel values, while each value of `y` is a vector representing the value of the digit (one-hot encoded):

In [None]:
X[204].shape

In [None]:
y[204]

We can, of course, display the images:

In [None]:
img_size = 64
plt.subplot(1, 2, 1)
plt.imshow(X[204].reshape(img_size, img_size))
plt.axis('off')
plt.subplot(1, 2, 2)
plt.imshow(X[822].reshape(img_size, img_size))
plt.axis('off')

We only need the zeros and ones for our purposes. Hence, start by gathering only the relevant X-variables:

In [None]:
X = np.concatenate((X[204:409], X[822:1028] ), axis=0)

For the ys, we also only want the relevant ones. Moreover, we want to make sure that instead of a vector, we simply have 0 if the digit is zero and 1 if it is one:

In [None]:
z = np.zeros(409-204)
o = np.ones(1028-822)
y = np.concatenate((z, o), axis=0).reshape(X.shape[0],1)

With the `reshape`, we make sure that `y` is a vector with two dimensions:

In [None]:
print(X.shape)
print(y.shape)

Next, we split the data into training and testing with 15% in the test set (you know the drill):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=172)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Finally, we need to "flatten" the Xs. Currently, our input is three-dimensional (each observation is a matrix). However, when we run regressions (or train models more generally), we usually have two-dimensional inputs, as it makes things a lot easier to work with. There are exceptions to this of course, specifically when using convolutional neural networks, but let's not get ahead of ourselves.

What we will do is to convert each matrix (each observation's X-value) to a vector, simply by stacking all the columns of the matrix. If $X^{(i)} \in \mathcal{R}^{n \times m}$, then the fitting vector $\hat{X}^{(i)} \in \mathcal{R}^{n m}$. So we reshape accordingly:

In [None]:
X_train_flat = X_train.reshape(X_train.shape[0],X_train.shape[1]*X_train.shape[2])
X_test_flat = X_test.reshape(X_test.shape[0],X_test.shape[1]*X_test.shape[2])
print(X_train_flat.shape)
print(X_train_flat.shape)

We have 4096 pixels per observation, neatly stacked in a vector. All observations together (349 for train, 62 for test), gives us a (two-dimensional) matrix.

## 2. A neural network with an (arbitrarily large) hidden layer of neurons

This time, we are creating a model that uses multiple neurons instead of just one. In particular, we will use one hidden layer with `hidden_layer_size` neurons (and the ReLU activation function), and an output layer with a single neuron performing the final binary classification (what activation function do we use here?)

The principle approach is the same as before:

0. Choose hyperparameters
1. Initialize the model parameters $\theta$ (random weights!)
2. Until we cannot improve the cost function anymore (or we reach a certain numer of iterations):
- Given your current model parameters, compute the cost function $J(\theta)$ (forward propagation)
- From the cost function, go backward to compute all the relevant derivatives (back-propagation)
- Update the parameters: $\theta := \theta - \alpha \nabla_{\theta} J(\theta)$

### Initialization

Keep in mind that each neuron in the hidden layer has weights for all $m$ incoming edges (i.e. one for each `dimension`), as well as one bias term. The single neuron in the output layer has one weight for each of its incoming edges, as well as its own bias term.

We will usually use dictionaries to store parameters when we have many. The function below is partially completed for you - can you finish it?

In [None]:
def initialize_parameters(seed=392,dimension=4096,hidden_layer_size=3):
    np.random.seed(seed)
    parameters = {'weights1': np.random.rand(dimension,hidden_layer_size)*0.01,  # Use np.random.rand(dim1,dim2)*0.01, inputing the correct dimensions
                  'bias1': np.zeros((1,hidden_layer_size)),   # Use np.zeros(shape), inputing the correct shape
                  'weights2': np.random.randn(hidden_layer_size,1)*0.01,   # Use np.random.randn(dim1,dim2)*0.01, inputing the correct dimensions
                  'bias2': np.zeros((1,1))}    # Use np.zeros(shape), inputing the correct shape
    return parameters

A quick try:

In [None]:
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_size=3)
print(parameters["weights1"])
print(parameters["bias1"])
print(parameters["weights2"])
print(parameters["bias2"])

### Forward propagation

The forward propagation step is quite similar to what we saw in the logistic regression. Of course, our model for $\hat{y}$ is now a whole lot more complex. But that doesn't matter: the neural network is essentially a computation graph, so we just go layer by layer, and computations are quite easy at each layer. We will make use of two helper functions to compute the ReLU activation (at the hidden layer) and the logistic sigmoid activation (at the output layer).

In [None]:
def relu(z):
    return np.maximum(0,z)     # for proper vectorization, you might want to look up `np.maximum`

In [None]:
def sigma(z):
    return 1/(1 + np.exp(-z))  # you have seen this before in logistic regression

Remember that we want to make fast computations. Hence, our functions need to be able to take a whole matrix of values and compute the activation for each element (they need to be "vectorized"). Try it out for both activation functions below:

In [None]:
Z = np.array([[1,-2,3],
              [0,6,-2],
              [3,-1,0]])
print(relu(Z))
print(sigma(Z))

Next comes the actual forward propagation step. Remember that we need to compute:
1. For each neuron at the first layer:
- $Z^{[1]}$ = the weighted sum of the inputs X, to which we add the bias
- $A^{[1]}$ = the actual activation: the neuron's activation function applied to $Z^{[1]}$
2. For the neuron at the second layer:
- $Z^{[2]}$ = the weighted sum of the inputs $A^{[1]}$, to which we add the bias
- $A^{[2]}$ = the actual activation: the neuron's activation function applied to $Z^{[2]}$
3. The cost function, given $\hat{y} = A^{[2]}$: We will stick with what we saw before in binary classification, so $J=\frac{1}{n}\sum_{i=1}^n L^{(i)}$ with $L^{(i)} = -y^{(i)} \log \hat y^{(i)} - (1-y^{(i)}) (1-\log \hat y^{(i)})$

Aside from the cost, the forward propagation should return the computed Z's and A's in what we call a "cache". This is important for back-propagation later down the line.

We start with a naive implementation, where we compute the activations for each neuron separately. Do you see what happens in the function below? Can you complete the missing pieces?

A few hints:
- you might benefit from adding print statements and trying out the function using `forward_propagation_naive(X_train_flat,y_train,parameters,2)`
- The input matrix $X$ has dimensions $(n,m)$, where $n$ is the number of observations and $m$ the number of features
- The weight matrix $W^{[1]}$ has dimensions $(m,\text{hidden_layer_size})$
- The bias vector $b^{[1]}$ has dimensions $(1,\text{hidden_layer_size})$
- The weight matrix $W^{[2]}$ has dimensions $(\text{hidden_layer_size},1)$
- The bias vector $b^{[2]}$ has dimensions $(1,1)$

In [None]:
def forward_propagation_naive(X,y,parameters,hidden_layer_size):
    # Layer 1 (hidden layer)
    Z1 = np.zeros((X.shape[0],hidden_layer_size))
    A1 = np.zeros((X.shape[0],hidden_layer_size))
    for neuron in range(hidden_layer_size):
        w = parameters['weights1'][:,neuron] # find the right weight (recall that we stacked our weights with shape (in,out))
        b = parameters['bias1'][0,neuron]    # find the right bias term (recall that we stacked our biases with shape (1,out))
        z = np.dot(X,w) + b                  # compute z, using np.dot. Think of the correct dimensions!
        Z1[:,neuron] = z
        A1[:,neuron] = relu(z)
    
    # Layer 2 (output layer)
    Z2 = np.dot(A1,parameters['weights2']) + parameters['bias2'] # at the second layer, there is only one node. Use np.dot again, and watch out for the correct dimensions
    A2 = sigma(Z2)
    
    # Compute the cost
    yHat = A2
    cost = np.sum(-y*np.log(yHat) - (1-y)*np.log(1-yHat))/X.shape[0]
    
    # Compute the cache
    cache = {'Z1': Z1,
             'A1': A1,
             'Z2': Z2,
             'A2': A2}
    
    return cost, cache

Try it out. If there are no mistake, the code below should print out 0.7387257645994343

In [None]:
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_size=3)
cost, _ = forward_propagation_naive(X_train_flat,y_train,parameters,3)
print(cost)

We know that vectorization is faster, so we will vectorize not just on the observations, but also on the neurons within a layer (and then see that this is quite a bit faster). Can you complete the function?

A few additional hints here:
- Each neuron has its own total input z for each observations. Hence, $Z^{[1]}$ should have dimensions $(n,\text{hidden_layer_size})$
- The same logic holds for the neuron at the second layer. Hence, $Z^{[2]}$ should have dimensions $(n,1)$
- The activation matrix $A^{[l]}$ has the same dimensions as the total input matrix $Z^{[l]}$.

In [None]:
def forward_propagation(X,y,parameters):
    # Layer 1 (hidden layer)
    Z1 = np.dot(X,parameters['weights1']) + parameters['bias1']  # Use np.dot!
    A1 = relu(Z1)
    
    # Layer 2 (output layer)
    Z2 = np.dot(A1,parameters['weights2']) + parameters['bias2'] # Use np.dot!
    A2 = sigma(Z2)
    
    # Compute the cost - this is exactly as before!
    yHat = A2
    cost = np.sum(-y*np.log(yHat) - (1-y)*np.log(1-yHat))/X.shape[0]
    
    # Compute the cache
    cache = {'Z1': Z1,
             'A1': A1,
             'Z2': Z2,
             'A2': A2}
    
    return cost, cache

Try it out. If there are no mistake, the code below should print out
1. 0.7387257645994343
1. (349,3)
1. (349,3)
1. (349,1)
1. (349,1)

In [None]:
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_size=3)
cost, cache = forward_propagation(X_train_flat,y_train,parameters)
print(cost)
print(cache['Z1'].shape)
print(cache['A1'].shape)
print(cache['Z2'].shape)
print(cache['A2'].shape)

Let's now compare the difference in computation time. What we will do is to create 200 sets of initial parameters for a network of width 10 and apply forward propagation once:

In [None]:
iterations = 200
time_naive = 0
cost_naive = 0
time_vectorized = 0
cost_vectorized = 0

for it in range(iterations):
    parameters = initialize_parameters(seed=np.random.randint(1),dimension=4096,hidden_layer_size=10)
    # Running things with a for-loop:
    tic = time.process_time()
    cost,_ = forward_propagation_naive(X_train_flat,y_train,parameters,10)
    toc = time.process_time()
    time_naive += 1000*(toc-tic)
    cost_naive += cost
    # Running things "vectorized":
    tic = time.process_time()
    cost,_ = forward_propagation(X_train_flat,y_train,parameters)
    toc = time.process_time()
    time_vectorized += 1000*(toc-tic)
    cost_vectorized += cost

print ("Naive: Cost = " + str(cost_naive/iterations) + ", computation time = " + str(time_naive/iterations) + "ms")
print ("Vectorized: Cost = " + str(cost_vectorized/iterations) + ", computation time = " + str(time_vectorized/iterations) + "ms")

### Back-propagation

We move onto the second step of our update: finding the gradients. Make sure you use the chain rule. We will discuss in the tutorial how to derive the derivatives, but for the programming part, the relevant computations can be found below:
- `dZ2` $= \nabla_{Z^{[2]}} J = \frac{1}{n}(A^{[2]} - y)$  (this should give you a $(n,1)$ matrix - why?)
- `dW2` $=\nabla_{W^{[2]}} J  = (A^{[1]})^T  (\nabla_{Z^{[2]}} J)$ (this should give you a $(\text{hidden_layer_size},1)$ matrix - why?)
- `db2` $=\nabla_{b^{[2]}} J = \sum_{i=1}^n \frac{\partial J}{\partial z_1^{[2](i)}}$ (you are summing up over the entries of dZ2)
- `dZ1` $= \nabla_{Z^{[1]}} J = (\nabla_{Z^{[2]}} J) (W^{[2]})^T \circ E^{[1]}$. Here, $\circ$ is element-wise multiplication and $E^{[1]}$ is a matrix of the same dimensions as $Z^{[1]}$ that is 1 when the entry is positive and 0 otherwise (this should give you a $(n,\text{hidden_layer_size})$ matrix - why?)
- `dW1` $=\nabla_{W^{[1]}} J  = (X^T)(\nabla_{Z^{[1]}} J)$ (this should give you a $(m,\text{hidden_layer_size})$ matrix - why?)
- `db1` = $\left[ \nabla_{b^{[1]}_1} J, \nabla_{b^{[1]}_2} J,..., \nabla_{b^{[1]}_{\text{hidden_layer_size}}} J \right] = \left[\sum_{i=1}^n \frac{\partial J}{\partial z_1^{[1](i)}}, \sum_{i=1}^n \frac{\partial J}{\partial z_2^{[1](i)}}, ..., \sum_{i=1}^n \frac{\partial J}{\partial z_{\text{hidden_layer_size}}^{[1](i)}} \right]$ (you are summing up over **one** of the axes of dZ1 - be careful to choose the right one)

A final hint about computing $E^{[1]}$. See the code below:

In [None]:
Z = np.array([[1,-2,3],
              [0,6,-2],
              [3,-1,0]])
np.where(Z>0,1,0)

We can now define the back-propagation step, which returns the gradients:

In [None]:
def back_propagation(X,y,parameters,cache):
    dZ2 = (cache['A2'] - y)/X.shape[0]
    dW2 = np.dot(cache['A1'].T,dZ2)
    db2 = np.sum(dZ2,axis=0,keepdims=True)
    dZ1 = np.dot(dZ2,parameters['weights2'].T)*np.where(cache['Z1']>0,1,0)
    dW1 = np.dot(X.T,dZ1)
    db1 = np.sum(dZ1,axis=0,keepdims=True) # make sure to sum up over the right axis (see the np.sum documentation), and to set keepdims=True
    grads = {'weights1': dW1,
             'bias1': db1,
             'weights2': dW2,
             'bias2': db2}
    return grads

If everything is programmed correctly, the below code should print out

1. -0.000508086877680383
1. (4096, 3)
1. (1, 3)
1. (3, 1)
1. (1, 1)

In [None]:
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_size=3)
cost, cache = forward_propagation(X_train_flat,y_train,parameters)
grads = back_propagation(X_train_flat,y_train,parameters,cache)
print(grads['weights1'][0,0])
print(grads['weights1'].shape)
print(grads['bias1'].shape)
print(grads['weights2'].shape)
print(grads['bias2'].shape)

### Putting it together: parameter updating and training

In each iteration of our learning procedure, we update the parameters, hopefully moving closer towards the optimum. How we update the parameters is determined by the gradient, as well as the learning rate (a hyper-parameter). Let's define one update step:
1. Compute the `forward_propagation` step (returning `cost` and `cache`)
1. Compute the `back_propagation` step (using `cache` from `forward_propagation`)
1. Update each entry in `parameters` as follows: $\theta := \theta - \alpha \nabla_{\theta} J$ (Because we made sure above that the shapes are "in the right way", we don't have to worry about individual parameters, but can update whole groups - also, we made sure that the parameters and their gradients are referenced in the same way in both dictionaries. Note that $\alpha$ is the learning rate)
1. Return the updated dictionary `parameters` and the `cost` from `forward_propagation`

In [None]:
def parameter_update(X,y,parameters,learning_rate):
    cost, cache = forward_propagation(X,y,parameters)
    grads = back_propagation(X,y,parameters,cache)
    for entry in parameters:
        parameters[entry] = parameters[entry] - learning_rate*grads[entry]#
    return parameters, cost

We can now train our model, by running the parameter update multiple times. We will use 3 neurons at the hidden layer, a learning rate of 0.01 and run the algorithm for 2,500 iterations. Can you adjust the function below? Make sure to initialize the parameters with our custom-made function. Also, each time you run the parameter-update, store the resulting `cost` in a list `cost_list`. At the end, return the final `parameter` dictionary and the `cost_list`.

In [None]:
def model_training(X,y,hidden_layer_size=3,learning_rate=0.01,iterations=2500,verbose=True):
    parameters = initialize_parameters(seed=np.random.randint(1),dimension=X.shape[1],hidden_layer_size=hidden_layer_size) # Initialize the parameters
    cost_list = []
    for it in range(iterations):
        parameters,cost = parameter_update(X,y,parameters,learning_rate) # for each iteration, update the parameters using forward and back propagation
        cost_list.append(cost)  # Also, make sure to add the cost to the cost_list
        if verbose:
            print('Cost after iteration %i: %f' %(it,cost))
    return parameters, cost_list

Now, train the model and display the training loss:

In [None]:
parameters, cost_list = model_training(X_train_flat,y_train)

In [None]:
plt.plot(range(len(cost_list)),cost_list)
plt.show()

### Making predictions

We don't just want to train a neural network, we also want to use it to make predictions. For this purpose, we create a `predict` function, that takes an input X, as well as the parameters of the trained model.

Don't worry about computing the prediction - we have already done so, when we implemented the forward propagation. Note that forward propagation takes as input both an `X` and a `y`, but we don't care about the cost (only about the `yHat = cache['A2']`, so we can give an empty `y`, as long as it has the correct shape.

In [None]:
def predict(X,parameters):
    _, cache = forward_propagation(X,np.zeros((X.shape[0],1)),parameters)
    yHat = cache['A2']    # Get yHat from the cache
    y_prediction = (yHat > 0.5)      # Make a prediction - when yHat > 0.5, assume 1, otherwise 0
    return y_prediction

Let's see how well our predictions perform, both on the training set and the test set:

In [None]:
y_prediction_test = predict(X_test_flat,parameters)
y_prediction_train = predict(X_train_flat,parameters)

print("train accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_train - y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))

Getting there! Certainly some overfitting happening, but 95% test accuracy is not bad at all.

## 3. A neural network with arbitrarily many, arbitrarily large hidden layers of ReLUs

Let's go deeper. We are creating a model that uses multiple hidden layers. In particular, the `L-1` hidden layers will have sizes `hidden_layer_sizes` = $[n_1,n_2,...,n_{L-1}]$ (and the ReLU activation function). In addition, there is again an output layer with a single neuron performing the final binary classifcation (of course, using the logistic sigmoid function).

We use exactly the same approach as before, just that we need to "automate" our computations a bit more.

### Initialization

Keep in mind that each neuron in the hidden layer $l$ has weights for all $n_{l-1}$ incoming edges, as well as one bias term (in the case of layer 1, $n_0$ are the number of incoming edges from the input, so the number of features of `X`).

We will usually use dictionaries to store parameters when we have many. Note the naming convention for layers starting from 1 doesn't really fit well with the typical naming convention of Python. But we can make our life easier by treating the input layer as "Layer 0".

In [None]:
def initialize_parameters(seed=392,dimension=4096,hidden_layer_sizes=[3,3,3]):
    np.random.seed(seed)
    parameters = {}
    hidden_layer_sizes = [dimension] + hidden_layer_sizes
    
    for l in range(1,len(hidden_layer_sizes)+1):
        size_in = hidden_layer_sizes[l-1]
        if l < len(hidden_layer_sizes):
            size_out = hidden_layer_sizes[l]
            parameters['weights' + str(l)] = np.random.rand(size_in,size_out)*0.1 - 0.05    # Note the different initialization of parameters. This is to help learning along
        else:
            size_out = 1
            parameters['weights' + str(l)] = np.random.rand(size_in,size_out)*0.1 - 0.05    # Note the different initialization of parameters. This is to help learning along
        parameters['bias' + str(l)] = np.zeros((1,size_out))
    
    return parameters

A quick try:

In [None]:
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_sizes=[3,3,3])
print(parameters["weights1"])
print(parameters["bias1"])
print(parameters["weights2"])
print(parameters["bias2"])
print(parameters["weights3"])
print(parameters["bias3"])
print(parameters["weights4"])
print(parameters["bias4"])

### Forward propagation

Forward propagation works the same as before, just over a few more layers. We will make use again of two helper functions to compute the ReLU activation (at the hidden layers) and the logistic sigmoid activation (at the output layer).

In [None]:
def relu(z):
    return np.maximum(0,z)

In [None]:
def sigma(z):
    return 1/(1 + np.exp(-z))

Next comes the actual forward propagation step. Remember that we need to compute:
1. For each neuron at the hidden layers $l=1,...,L-1$:
- $Z^{[l]}$ = the weighted sum of the inputs A^{[l-1]}, to which we add the bias (in the case of $Z^{[1]}$, we simply have $A^{[0]} = X$)
- $A^{[l]}$ = the actual activation: the neuron's activation function applied to $Z^{[l]}$
2. For the neuron at the output layer $L$:
- $Z^{[L]}$ = the weighted sum of the inputs $A^{[L-1]}$, to which we add the bias
- $A^{[L]}$ = the actual activation: the neuron's activation function applied to $Z^{[L]}$
3. The cost function, given $\hat{y} = A^{[L]}$: We will stick with what we saw before in binary classification, so $J=\frac{1}{n}\sum_{i=1}^n L^{(i)}$ with $L^{(i)} = -y^{(i)} \log \hat y^{(i)} - (1-y^{(i)}) (1-\log \hat y^{(i)})$

In [None]:
def forward_propagation(X,y,parameters,hidden_layer_sizes):
    cache = {}
    
    # Hidden layers
    current_A = X
    for l in range(1,len(hidden_layer_sizes)+1):
        current_Z = np.dot(current_A,parameters['weights'+str(l)]) + parameters['bias'+str(l)]
        current_A = relu(current_Z)
        cache['Z'+str(l)] = current_Z
        cache['A'+str(l)] = current_A

    # Output layer
    L = len(hidden_layer_sizes) + 1
    current_Z = np.dot(current_A,parameters['weights'+str(L)]) + parameters['bias'+str(L)]
    current_A = sigma(current_Z)
    cache['Z'+str(L)] = current_Z
    cache['A'+str(L)] = current_A
    
    # Compute the cost
    yHat = current_A 
    cost = np.sum(-y*np.log(yHat) - (1-y)*np.log(1-yHat))/X.shape[0] 

    return cost, cache

Note: we could technically read the layers out from the parameters, but we make things a bit simpler here and simply give the layers as an input

Try it out. If there are no mistake, the code below should print out
1. 0.6931471805599453
1. (349,3)
1. (349,3)
1. (349, 2)
1. (349, 2)
1. (349,1)
1. (349,1)

In [None]:
hidden_layer_sizes = [3,2]
parameters = initialize_parameters(seed=123,dimension=4096,hidden_layer_sizes=hidden_layer_sizes)
cost, cache = forward_propagation(X_train_flat,y_train,parameters,hidden_layer_sizes)
print(cost)
print(cache['Z1'].shape)
print(cache['A1'].shape)
print(cache['Z2'].shape)
print(cache['A2'].shape)
print(cache['Z3'].shape)
print(cache['A3'].shape)

### Back-propagation

We move onto the second step of our update: finding the gradients. We do essentially the same as before, just that we keep going backward from layer $L-1$. The following generalizations of the derivatives may be helpful for the output layer (note - nothing changes here from before except the indexing):

- `dZ(L)` $= \nabla_{Z^{[L]}} J = \frac{1}{n}(A^{[L]} - y)$  (this should give you a $(n,1)$ matrix - why?)
- `dW(L)` $=\nabla_{W^{[L]}} J  = (A^{[L-1]})^T  (\nabla_{Z^{[L]}} J)$ (this should give you a $(n_{L-1},1)$ matrix - why?)
- `db(L)` $=\nabla_{b^{[L]}} J = \sum_{i=1}^n \frac{\partial J}{\partial z_1^{[L](i)}}$ (you are summing up over the entries of dZ(L))

And for the hidden layers $l=1,...,L-1$ (this is identical to the single hidden layer earlier, just that we generalize the indexes):

- `dZ(l)` $= \nabla_{Z^{[l]}} J = (\nabla_{Z^{[l+1]}} J) (W^{[l+1]})^T \circ E^{[l]}$. Here, $\circ$ is element-wise multiplication and $E^{[l]}$ is a matrix of the same dimensions as $Z^{[l]}$ that is 1 when the entry is positive and 0 otherwise (this should give you a $(n,n_l)$ matrix - why?)
- `dW(l)` $=\nabla_{W^{[l]}} J  = (A^{[l-1]})^T (\nabla_{Z^{[l]}} J)$ (this should give you a $(n_{l-1},n_l)$ matrix - why? Note that $A^{[0]} = X$, with $n_0 = m$)
- `db(l)` = $\left[ \nabla_{b^{[l]}_1} J, \nabla_{b^{[l]}_2} J,..., \nabla_{b^{[l]}_{n_l}} J \right] = \left[\sum_{i=1}^n \frac{\partial J}{\partial z_1^{[l](i)}}, \sum_{i=1}^n \frac{\partial J}{\partial z_2^{[l](i)}}, ..., \sum_{i=1}^n \frac{\partial J}{\partial z_{n_{l}}^{[l](i)}} \right]$ (you are summing up over **one** of the axes of dZ1 - be careful to choose the right one)

We can now define the back-propagation step, which returns the gradients. It might be helpful, instead of creating new variables `dZ` for each step, to simply overwrite the existing ones.

In [None]:
def back_propagation(X,y,parameters,cache,hidden_layer_sizes):
    grads = {}
    
    # Output layer
    L = len(hidden_layer_sizes) + 1
    dZ = (cache['A' + str(L)] - y)/X.shape[0]
    prev_A = cache['A' + str(L-1)]
    dW = np.dot(prev_A.T,dZ)
    db = np.sum(dZ,axis=0,keepdims=True)
    grads['weights' + str(L)] = dW
    grads['bias' + str(L)] = db
    
    # Hidden layers
    for l in range(L-1,0,-1):
        dZ = np.dot(dZ,parameters['weights' + str(l+1)].T)*np.where(cache['Z' + str(l)]>0,1,0)
        if l > 1:
            prev_A = cache['A' + str(l-1)]
        else:
            prev_A = X
        dW = np.dot(prev_A.T,dZ)
        db = np.sum(dZ,axis=0,keepdims=True)
        grads['weights' + str(l)] = dW
        grads['bias' + str(l)] = db
    
    return grads

If everything is programmed correctly, the below code should print out

1. -0.0014326647564469955
1. (4096, 3)
1. (1, 3)
1. (3, 2)
1. (1, 2)
1. (2, 1)
1. (1, 1)

In [None]:
hidden_layer_sizes = [3,2]
parameters = initialize_parameters(seed=456,dimension=4096,hidden_layer_sizes=hidden_layer_sizes)
cost, cache = forward_propagation(X_train_flat,y_train,parameters,hidden_layer_sizes)
grads = back_propagation(X_train_flat,y_train,parameters,cache,hidden_layer_sizes)
print(grads['bias3'][0,0])
print(grads['weights1'].shape)
print(grads['bias1'].shape)
print(grads['weights2'].shape)
print(grads['bias2'].shape)
print(grads['weights3'].shape)
print(grads['bias3'].shape)

### Putting it together: parameter updating and training

The parameter updating step is identical to before, only that we need to add the `hidden_layer_sizes` input

In [None]:
def parameter_update(X,y,parameters,learning_rate,hidden_layer_sizes):
    cost, cache = forward_propagation(X,y,parameters,hidden_layer_sizes)
    grads = back_propagation(X,y,parameters,cache,hidden_layer_sizes)
    for entry in parameters:
        parameters[entry] = parameters[entry] - learning_rate*grads[entry]
    return parameters, cost

We can now train our model, by running the parameter update multiple times. We will use 3 neurons at first hidden layer and 2 neurons at the second hidden layer. Notice the high number of iterations, that will be needed (but later on, we see how one can improve upon that).

Can you adjust the function below? Make sure to initialize the parameters with our custom-made function. Also, each time you run the parameter-update, store the resulting `cost` in a list `cost_list`. At the end, return the final `parameter` dictionary and the `cost_list`.

In [None]:
def model_training(X,y,hidden_layer_sizes=[3,2],learning_rate=0.005,iterations=20000,verbose=True):
    parameters = initialize_parameters(seed=np.random.randint(1),dimension=X.shape[1],hidden_layer_sizes=hidden_layer_sizes)
    cost_list = []
    for it in range(iterations):
        parameters,cost = parameter_update(X,y,parameters,learning_rate,hidden_layer_sizes)
        cost_list.append(cost)
        if verbose:
            print('Cost after iteration %i: %f' %(it,cost))
    return parameters, cost_list

Now, train the model and display the training loss. This might take a bit:

In [None]:
hidden_layer_sizes = [3,2]
parameters, cost_list = model_training(X_train_flat,y_train,hidden_layer_sizes)

In [None]:
plt.plot(range(len(cost_list)),cost_list)
plt.show()

We can see already by looking at the report from the model training, and even more clearly in the graph, that we need quite some time to get to any sensible state with the model. If you play around with the initialization, you will notice that it is hard to even find value for which the model trains at all. This complexity in optimization is inherent to deep neural networks. Luckily, there are advanced optimization algorithms that speed things up (and increase our chances of even getting to a reasonable training result). We will learn about those in the next lectures.

### Making predictions

As before, we can now use our model to make predictions. The function is basically as before, just that we have to make sure to get the correct activation matrix from the cache (the one at the last layer).

In [None]:
def predict(X,parameters,hidden_layer_sizes=[3,2]):
    _, cache = forward_propagation(X,np.zeros((X.shape[0],1)),parameters,hidden_layer_sizes)
    yHat = cache['A' + str(len(hidden_layer_sizes)+1)]    # Get yHat from the cache (the last layer's activation!)
    y_prediction = (yHat > 0.5)      # Make a prediction - when yHat > 0.5, assume 1, otherwise 0
    return y_prediction

Let's see how well our predictions perform, both on the training set and the test set:

In [None]:
y_prediction_test = predict(X_test_flat,parameters,hidden_layer_sizes)
y_prediction_train = predict(X_train_flat,parameters,hidden_layer_sizes)

print("train accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_train - y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))

A lot of training effort, and we actually did worse. With some regularization, we can probably do better, but we'll leave that for later.