# Bonus: Initialization \[10 points\]

<b><span style="color:#00CED1">Association:</span></b>
<span style="color:#00CED1">Otto-Friedrich University of Bamberg</span>
<span style="color:#00CED1">Chair of Explainable Machine Learning (xAI)</span>
<span style="color:#00CED1">Deep Learning Assignments</span>

<b><span style="color:#00CED1">Description:</span></b>
<span style="color:#00CED1">This notebook introduces different weight initialization methods including random, zeros, and He initialization.</span>
<span style="color:#00CED1">Students will implement all of them and learn how each of them lead to different results.</span>

<span style="color:#00CED1"><b>Author:</b> Sebastian Doerrich</span>
<span style="color:#00CED1"><b>Copyright:</b> Copyright (c) 2022, Chair of Explainable Machine Learning (xAI), Otto-Friedrich University of Bamberg</span>
<span style="color:#00CED1"><b>Credits:</b> Christian Ledig, Sebastian Doerrich</span>
<span style="color:#00CED1"><b>License:</b> CC BY-SA</span>
<span style="color:#00CED1"><b>Version:</b> 1.0</span>
<span style="color:#00CED1"><b>Python:</b> Python 3</span>
<span style="color:#00CED1"><b>Maintainer:</b> Sebastian Doerrich</span>
<span style="color:#00CED1"><b>Email:</b> sebastian.doerrich@uni-bamberg.de</span>
<span style="color:#00CED1"><b>Status:</b> Production</span>

## Context
Welcome to the first part of your third programming assignment.
In this first part you will implement different weight initialization methods and learn how all of them lead to different results.

## Motivation
Training your neural network requires specifying an initial value for the weights and bias parameters. A well-chosen initialization method can hereby help the learning process of the model.

## Instructions
- You will be using Python 3.
- After coding your function, run the cell right below it to check if your result is correct.

## Important Notes for Your Submission
Before submitting your assignment, please make sure you are not doing the following:

1. You have not added any _extra_ `print` statement(s) in the assignment.
2. You have not added any _extra_ code cell(s) in the assignment.
3. You have not changed any of the function parameters.
4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.
5. You are not changing the assignment code where it is not required, like creating _extra_ variables.

If you do any of the mentioned, our test scripts will fail and as a result you will receive **0 points** for the respective task.

## Table of Contents
- [0 - Import the Necessary Libraries](#0)
- [1 - Visualization](#1)
- [2 - Data Generation](#2)
- [3 - Preprocessing](#3)
- [4 - Initialization Methods](#4)
    - [4.1 - Zero Initialization](#4-1)
        - [4.1.1 - Applying the Zero Initialization](#4-1-1)
        - [4.1.2 - Analyzing the Zero Initialization](#4-1-2)
        - [4.1.3 - Discussing the Zero Initialization](#4-1-3)
    - [4.2 - Random Initialization](#4-2)
        - [4.2.1 - Applying the Random Initialization](#4-2-1)
        - [4.2.2 - Analyzing the Random Initialization](#4-2-2)
        - [4.2.3 - Discussing the Random Initialization](#4-2-3)
    - [4.3 - He Initialization](#4-3)
        - [4.3.1 - Applying the He Initialization](#4-3-1)
        - [4.3.2 - Analyzing the He Initialization](#4-3-2)
        - [4.3.3 - Discussing the He Initialization](#4-3-3)
    - [4.4 - Conclusion](#4-4)
- [5 - What to remember](#5)
- [6 - End of Exercise](#6)

<a name='0'></a>
## 0 - Import the Necessary Libraries ##

In [None]:
# Import packages
import numpy as np
import sklearn
import sklearn.datasets
import pickle

import matplotlib.pyplot as plt
%matplotlib inline

<a id='1'></a>
## 1 - Visualization ##

These functions will be used for visualization. We provide those for you, so you can plot your results. You don't have to code anything here.

In [None]:
class Visualization:
    """
    This class is used to collect all visualizations.
    """

    @staticmethod
    def visualize_the_data(train_X: np.ndarray, train_Y: np.ndarray, test_X: np.ndarray, test_Y: np.ndarray, scale=40, cmap=plt.cm.Spectral):
        """
        Visualize the samples.

        :param train_X: Training samples.
        :param train_Y: Training Labels.
        :param test_X: Test samples.
        :param test_Y: Test labels.
        :param scale: The marker size.
        :param cmap: The Colormap instance or registered colormap name used to map scalar data to colors.
        """

        # Create both plots
        _, (train_ax, test_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(8, 4))

        # Create plot for the training samples
        train_scatter = train_ax.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=scale, cmap=cmap, label=[0, 1])
        train_ax.set_ylabel("Feature #1")
        train_ax.set_xlabel("Feature #0")
        train_ax.set_title("Training data")
        train_ax_legend = train_ax.legend(*train_scatter.legend_elements(), title="Classes")
        train_ax.add_artist(train_ax_legend)

        # Create the plot for the test samples
        test_scatter = test_ax.scatter(test_X[:, 0], test_X[:, 1], c=test_Y, s=scale, cmap=cmap)
        test_ax.set_xlabel("Feature #0")
        test_ax.set_title("Test data")
        test_ax_legend = test_ax.legend(*test_scatter.legend_elements(), title="Classes")
        test_ax.add_artist(test_ax_legend)

    @staticmethod
    def plot_the_cost(iterations: list, costs: list, learning_rate: float):
        """
        Plot the cost of the network during training.

        :param iterations: Training iterations in the form [0, 1000, 2000, ..., 15000].
        :param costs: Costs of the network for each training iteration.
        :param learning_rate: Learning rate of the optimization algorithm.
        """

        plt.plot(iterations, costs)
        plt.ylabel('Cost')
        plt.xlabel('Iterations')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

    @staticmethod
    def plot_decision_boundary(Z_train: np.ndarray, X_train: np.ndarray, Y_train: np.ndarray, Z_test: np.ndarray, X_test: np.ndarray, Y_test: np.ndarray):
        """
        Plot the decision boundary of the model.

        :param Z_train: Decision Function value for the whole grid for the training data.
        :param X_train: Training samples.
        :param Y_train: Training Labels.
        :param Z_test: Decision Function value for the whole grid for the test data.
        :param X_test: Test samples.
        :param Y_test: Test Labels.
        """

        # Initialize the hyperparameters
        h = 0.01

        # Set min and max values and give it some padding
        x_min_train, x_max_train = X_train[0, :].min() - 1, X_train[0, :].max() + 1
        y_min_train, y_max_train = X_train[1, :].min() - 1, X_train[1, :].max() + 1

        x_min_test, x_max_test = X_test[0, :].min() - 1, X_test[0, :].max() + 1
        y_min_test, y_max_test = X_test[1, :].min() - 1, X_test[1, :].max() + 1

        # Generate a grid of points with distance h between them
        xx_train, yy_train = np.meshgrid(np.arange(x_min_train, x_max_train, h), np.arange(y_min_train, y_max_train, h))

        xx_test, yy_test = np.meshgrid(np.arange(x_min_test, x_max_test, h), np.arange(y_min_test, y_max_test, h))

        # Create both plots
        _, (train_ax, test_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(8, 4))

        # Plot the decision boundary and training samples
        train_ax.set_xlim([-1.5,1.5])
        train_ax.set_ylim([-1.5,1.5])
        train_ax.set_ylabel("Feature #1")
        train_ax.set_xlabel("Feature #0")
        train_ax.set_title("Training data")

        train_ax.contourf(xx_train, yy_train, Z_train, cmap=plt.cm.Spectral)
        train_scatter = train_ax.scatter(X_train[0, :], X_train[1, :], c=Y_train, cmap=plt.cm.Spectral)

        train_ax_legend = train_ax.legend(*train_scatter.legend_elements(), title="Classes")
        train_ax.add_artist(train_ax_legend)

        # Plot the decision boundary and test samples
        test_ax.set_xlim([-1.5,1.5])
        test_ax.set_ylim([-1.5,1.5])
        test_ax.set_ylabel("Feature #1")
        test_ax.set_xlabel("Feature #0")
        test_ax.set_title("Test data")
        test_ax.contourf(xx_test, yy_test, Z_test, cmap=plt.cm.Spectral)
        test_scatter = test_ax.scatter(X_test[0, :], X_test[1, :], c=Y_test, cmap=plt.cm.Spectral)

        test_ax_legend = test_ax.legend(*test_scatter.legend_elements(), title="Classes")
        test_ax.add_artist(test_ax_legend)

<a id='2'></a>
## 2 - Data Generation \[2 points\] ##

As always, the first task is to load the dataset. However, this time we will generate the samples ourselves instead of loading them from somewhere. We will use a simple toy dataset provided by the scikit-learn package for this purpose. The respective function creates 2D-samples forming a large circle (class 0) which contains a smaller circle (class 1), as it is presented in the figure below:
<div>
<img src="../img/initialization_dataset.png" width="700"/>
</div>


<b><span style="color:teal">TODO:</span> <b>
<ol>
<li><span style="color:teal">Implement the method <span style="color:#DC143C"><em>create_dataset</em></span> which generates our training and test samples.</span></li>
</ol>

In [None]:
def generate_dataset(n_samples_train: int, n_samples_test: int, noise_train: float, noise_test: float, seed_train: int, seed_test: int):
    """
    Generate the data set.

    :param n_samples_train: Number of training samples.
    :param n_samples_test: Number of test samples.
    :param noise_train: Standard deviation of Gaussian noise added to the training samples.
    :param noise_test: Standard deviation of Gaussian noise added to the test samples.
    :param seed_train: Random seed for generation of the training samples.
    :param seed_test: Random seed for generation of the test samples.

    :return: Training samples, training labels, test samples, test labels
    """

    train_X, train_Y = None, None
    test_X, test_Y = None, None

    #############################################################################
    #                            START OF YOUR CODE                             #
    # TODO:                                                                     #
    #    1) Generate the training samples using the function                    #
    #       sklearn.datasets.make_circles() with the given parameters.          #
    #    2) Create the test samples using the function                          #
    #       sklearn.datasets.make_circles() with the given parameters.          #
    #    3) sklearn.datasets.make_circles() uses randomization. Hence, you have #
    #       to use the provided seed, so that your results match mine.          #
    #############################################################################


    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return train_X, train_Y, test_X, test_Y

In [None]:
# Test your code

!python ./tests/test_initialization.py --test_case TestDataGeneration

In [None]:
# Plot the dataset
# Initialize the hyperparameters
n_samples_train, n_samples_test = 300, 100
noise_train, noise_test = .05, .05
seed_train, seed_test = 1, 2

# Generate the dataset
train_X, train_Y, test_X, test_Y = generate_dataset(n_samples_train, n_samples_test, noise_train, noise_test, seed_train, seed_test)

# Plot the dataset
Visualization.visualize_the_data(train_X, train_Y, test_X, test_Y)

<a id='3'></a>
## 3 - Preprocessing \[2 points\] ##

So that we are able to process our samples further, we need to preprocess them first i.e. bring them in a proper format.

<b><span style="color:teal">TODO:</span> <b>
<ol>
<li><span style="color:teal">Implement the method <span style="color:#DC143C"><em>preprocess</em></span> which preprocesses our training and test sets.</span></li>
</ol>

In [None]:
def preprocess(train_X: np.ndarray, train_Y: np.ndarray, test_X: np.ndarray, test_Y: np.ndarray):
    """
    Preprocess the data.

    :param train_X: Generated training samples.
    :param train_Y: Generated training labels.
    :param test_X: Generated test samples.
    :param test_Y: Generated test labels.

    :return: preprocessed training samples, preprocessed training labels, preprocessed test samples, preprocessed test labels
    """

    preprocessed_train_X, preprocessed_train_Y = None, None
    preprocessed_test_X, preprocessed_test_Y = None, None

    #############################################################################
    #                            START OF YOUR CODE                             #
    # TODO:                                                                     #
    #    1) Transpose the train samples.                                        #
    #    2) Upscale the train labels by one dimension to a 2D array at index 0. #
    #    3) Transpose the test samples.                                         #
    #    4) Upscale the test labels by one dimension to a 2D array at index 0.  #
    #############################################################################


    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return preprocessed_train_X, preprocessed_train_Y, preprocessed_test_X, preprocessed_test_Y

In [None]:
# Test your code

!python ./tests/test_initialization.py --test_case TestPreprocessing

<a id='4'></a>
## 4 - Initialization Methods \[4 points\] ##

You will use a 3-layer neural network to classify the samples which we have generated before. The model's layer structure can be written like:
- Fully Connected Layer -> ReLU Activation -> Fully Connected Layer -> ReLU Activation -> Fully Connected Layer -> Sigmoid Activation (Classifier)

Since this exercise was created for you to analyze the different types of initialization methods, you don't have to implement the model yourself. This time, we will provide you with the required model. However, you will have to experiment with the following initialization methods:
- *Zeros initialization*: Initializes the weights with zero.
- *Random initialization*: Initializes the weights to large random values.
- *He initialization*: Initializes the weights to random values scaled according to a paper by He et al., 2015.

As we now know, there are two types of parameters to initialize in a neural network:
- the weight matrices $(W^{[1]}, W^{[2]}, ..., W^{[L]})$
- the bias vectors $(b^{[1]}, b^{[2]}, ..., b^{[L]})$

<a id='4-1'></a>
## 4.1 - Zero Initialization \[2 points\] ##

Zero Initialization initializes all parameters to zero.

<b><span style="color:teal">TODO:</span> <b>
<ol>
<li><span style="color:teal">Implement the method <span style="color:#DC143C"><em>initialize_parameters_zeros</em></span> which initializes all network weights as well as the bias with zeros.</span></li>
</ol>

In [None]:
def initialize_parameters_zeros(layers_dims: list):
    """
    Initialize the weights and bias of the network with 0.

    :param layers_dims: List containing the size of each layer.

    :return: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                b1 -- bias vector of shape (layers_dims[1], 1)
                ...
                WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                bL -- bias vector of shape (layers_dims[L], 1)
    """

    parameters = {}

    #############################################################################
    #                            START OF YOUR CODE                             #
    # TODO:                                                                     #
    #    1) Iterate through (nearly) all layers.                                #
    #       - Hint: There is one layer, for which we don't want to initialize   #
    #               weights and bias.                                           #
    #    2) Initialize the weight matrix of each layer with zeros.              #
    #    3) Initialize the bias vector of each layer with zeros.                #
    #############################################################################


    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return parameters

In [None]:
# Test your code

!python ./tests/test_initialization.py --test_case TestZeroInitialization

<a id='4-1-1'></a>
## 4.1.1 - Applying the Zero Initialization ##

Like said before, we went ahead and implemented the model for you. Run the cell below to print out the cost of a model for the first 15000 iterations for which:
- The network weights were initialized with zeros.
- The network bias was initialized with zeros.

In [None]:
# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

iterations = data["zero initialization"]["iterations"]
costs = data["zero initialization"]["costs"]
learning_rate = data["zero initialization"]["learning_rate"]

# Print out the cost over the first 15000 iterations
for iteration, cost in zip(iterations, costs):
    print("Cost after iteration {}: {}".format(iteration, cost))

# Plot the cost over the first 15000 iterations
Visualization.plot_the_cost(iterations, costs, learning_rate)

# Print the accuracy of our model
accuracy_train = data["zero initialization"]["accuracy train set"]
accuracy_test = data["zero initialization"]["accuracy test set"]

print ("Accuracy of the model on the train set: ", accuracy_train)
print ("Accuracy of the model on the test set: ", accuracy_test)

<a id='4-1-2'></a>
## 4.1.2 - Analyzing the Zero Initialization ##

As it can be seen, the performance is terrible. The cost doesn't decrease, and the algorithm performs no better than random guessing.
Why is that? Take a look at the details of the predictions and the decision boundary by running the cell below.

In [None]:
# Generate the samples
# Plot the dataset
# Initialize the hyperparameters
n_samples_train, n_samples_test = 300, 100
noise_train, noise_test = .05, .05
seed_train, seed_test = 1, 2

# Generate the dataset
train_X, train_Y, test_X, test_Y = generate_dataset(n_samples_train, n_samples_test, noise_train, noise_test, seed_train, seed_test)

# Preprocess the dataset
train_X, train_Y, test_X, test_Y = preprocess(train_X, train_Y, test_X, test_Y)

# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

predictions_train = data["zero initialization"]["predictions train"]
predictions_test = data["zero initialization"]["predictions test"]

# Print the detailed predictions
print ("Predictions training: " + str(predictions_train))
print ("predictions test: " + str(predictions_test))

# Plot the decision boundary of our model
decision_train = data["zero initialization"]["decision train"]
decision_test = data["zero initialization"]["decision test"]
Visualization.plot_decision_boundary(decision_train, train_X, train_Y, decision_test, test_X, test_Y)

<a id='4-1-3'></a>
## 4.1.3 - Discussing the Zero Initialization ##

__Note__: For sake of simplicity calculations below are done using only one example at a time.

Since the weights and biases are zero, multiplying by the weights creates the zero vector which gives 0 when the activation function is ReLU. As `z = 0`

$$a = ReLU(z) = max(0, z) = 0$$

At the classification layer, where the activation function is sigmoid you then get (for either input):

$$\sigma(z) = \frac{1}{ 1 + e^{-(z)}} = \frac{1}{2} = y_{pred}$$

As for every example you are getting a 0.5 chance of it being true our cost function becomes helpless in adjusting the weights.

Your loss function:
$$ \mathcal{L}(a, y) =  - y  \ln(y_{pred}) - (1-y)  \ln(1-y_{pred})$$

For `y=1`, `y_pred=0.5` it becomes:

$$ \mathcal{L}(0, 1) =  - (1)  \ln(\frac{1}{2}) = 0.6931471805599453$$

For `y=0`, `y_pred=0.5` it becomes:

$$ \mathcal{L}(0, 0) =  - (1)  \ln(\frac{1}{2}) = 0.6931471805599453$$

As you can see with the prediction being 0.5 whether the actual (`y`) value is 1 or 0 you get the same loss value for both, so none of the weights get adjusted and you are stuck with the same old value of the weights.

This is why you can see that the model is predicting 0 for every example! No wonder it's doing so badly.

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, so you might as well be training a neural network with $n^{[l]}=1$ for every layer. This way, the network is no more powerful than a linear classifier like logistic regression.

<a id='4-2'></a>
## 4.2 - Random Initialization \[2 points\] ##

To break symmetry, initialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs. In this exercise, you'll see what happens when the weights are initialized randomly, but to very large values (scaled by \*10).

<b><span style="color:teal">TODO:</span> <b>
<ol>
<li><span style="color:teal">Implement the method <span style="color:#DC143C"><em>initialize_parameters_random</em></span> which initializes all network weights randomly and the bias with zeros.</span></li>
</ol>

In [None]:
def initialize_parameters_random(layers_dims: list, seed: int):
    """
    Initialize the weights and bias of the network randomly.

    :param layers_dims: List containing the size of each layer.
    :param seed: Seed for the random operation(s).

    :return:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """

    parameters = {}

    #############################################################################
    #                            START OF YOUR CODE                             #
    # TODO:                                                                     #
    #    1) Use the provided seed to make sure your "random" weights match ours.#
    #    2) Iterate through (nearly) all layers.                                #
    #       - Hint: There is one layer, for which we don't want to initialize   #
    #               weights and bias.                                           #
    #    3) Initialize the weight matrix of each layer randomly.                #
    #    4) Multiply each weight with the factor 10 to create large weights.    #
    #    5) Initialize the bias vector of each layer with zeros.                #
    #############################################################################


    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return parameters

In [None]:
# Test your code

!python ./tests/test_initialization.py --test_case TestRandomInitialization

<a id='4-2-1'></a>
## 4.2.1 - Applying the Random Initialization ##

Run the cell below to print out the training costs of a model for the first 15000 iterations, for which:
- The network weights were initialized with large random values (*10).
- The network bias was initialized with zeros.

In [None]:
# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

iterations = data["random initialization"]["iterations"]
costs = data["random initialization"]["costs"]
learning_rate = data["random initialization"]["learning_rate"]

# Print out the cost over the first 15000 iterations
for iteration, cost in zip(iterations, costs):
    print("Cost after iteration {}: {}".format(iteration, cost))

# Plot the cost over the first 15000 iterations
Visualization.plot_the_cost(iterations, costs, learning_rate)

# Print the accuracy of our model
accuracy_train = data["random initialization"]["accuracy train set"]
accuracy_test = data["random initialization"]["accuracy test set"]

print ("Accuracy of the model on the train set: ", accuracy_train)
print ("Accuracy of the model on the test set: ", accuracy_test)

<a id='4-2-2'></a>
## 4.2.2 - Analyzing the Random Initialization ##

If you see "inf" as the cost after the iteration 0, this is because of numerical round-off. A more numerically sophisticated implementation would fix this, but for the purposes of this notebook, it isn't really worth worrying about.

In any case, you've now broken the symmetry, and this gives noticeably better accuracy than before. The model is no longer outputting all 0s. Progress!
Let's take a look at the details of the predictions and the decision boundary by running the cell below.

In [None]:
# Generate the samples
# Plot the dataset
# Initialize the hyperparameters
n_samples_train, n_samples_test = 300, 100
noise_train, noise_test = .05, .05
seed_train, seed_test = 1, 2

# Generate the dataset
train_X, train_Y, test_X, test_Y = generate_dataset(n_samples_train, n_samples_test, noise_train, noise_test, seed_train, seed_test)

# Preprocess the dataset
train_X, train_Y, test_X, test_Y = preprocess(train_X, train_Y, test_X, test_Y)

# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

predictions_train = data["random initialization"]["predictions train"]
predictions_test = data["random initialization"]["predictions test"]

# Print the detailed predictions
print ("Predictions training: " + str(predictions_train))
print ("predictions test: " + str(predictions_test))

# Plot the decision boundary of our model
decision_train = data["random initialization"]["decision train"]
decision_test = data["random initialization"]["decision test"]
Visualization.plot_decision_boundary(decision_train, train_X, train_Y, decision_test, test_X, test_Y)

<a id='4-2-3'></a>
## 4.2.3 - Discussing the Random Initialization ##

**Observations**:
- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when $\log(a^{[3]}) = \log(0)$, the loss goes to infinity.
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.
- Initializing weights to very large random values doesn't work well.
- Initializing with small random values should do better. The important question is, how small should be these random values be? Let's find out up next!

**Optional Read:**
The main difference between Gaussian variable (`numpy.random.randn()`) and uniform random variable is the distribution of the generated random numbers:
- numpy.random.rand() produces numbers in a [uniform distribution](https://raw.githubusercontent.com/jahnog/deeplearning-notes/master/Course2/images/rand.jpg).
- and numpy.random.randn() produces numbers in a [normal distribution](https://raw.githubusercontent.com/jahnog/deeplearning-notes/master/Course2/images/randn.jpg).

When used for weight initialization, randn() helps most the weights to avoid being close to the extremes, allocating most of them in the center of the range.
An intuitive way to see it is, for example, if you take the [sigmoid() activation function](https://raw.githubusercontent.com/jahnog/deeplearning-notes/master/Course2/images/sigmoid.jpg).
You’ll remember that the slope near 0 or near 1 is extremely small, so the weights near those extremes will converge much more slowly to the solution, and having most of them near the center will speed the convergence.

<a id='4-3'></a>
## 4.3 - He Initialization \[2 points\] ##

Finally, try **He Initialization**; this is named for the first author of [He et al., 2015](https://arxiv.org/pdf/1512.03385.pdf). If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l]}$ of $\sqrt{\frac{1}{\text{layers_dims[l-1]}}}$ where He initialization would use $\sqrt{\frac{2}{\text{layers_dims[l-1]}}}$.

<b><span style="color:teal">TODO:</span> <b>
<ol>
<li><span style="color:teal">Implement the method <span style="color:#DC143C"><em>initialize_parameters_he</em></span> which initializes all network weights the same way as He et al. proposed it in 2015 for layers with a ReLU activation.</span></li>
</ol>

In [None]:
def initialize_parameters_he(layers_dims: list, seed: int):
    """
    Initialize the weights and bias of the network like He et al. proposed it in 2015.

    :param layers_dims: List containing the size of each layer.
    :param seed: Seed for the random operation(s).

    :return:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """

    parameters = {}

    #############################################################################
    #                            START OF YOUR CODE                             #
    # TODO:                                                                     #
    #    1) Use the provided seed to make sure your "random" weights match ours.#
    #    2) Iterate through (nearly) all layers.                                #
    #       - Hint: There is one layer, for which we don't want to initialize   #
    #               weights and bias.                                           #
    #    3) Initialize the weight matrix of each layer as He et al. proposed it #
    #    4) Initialize the bias vector of each layer with zeros.                #
    #############################################################################


    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return parameters

In [None]:
# Test your code

!python ./tests/test_initialization.py --test_case TestHeInitialization

<a id='4-3-1'></a>
## 4.3.1 - Applying the He Initialization ##

Run the cell below to print out the training costs of a model for the first 15000 iterations, for which:
- The network weights were initialized after the approach of He et al. (2015).
- The network bias was initialized with zeros.

In [None]:
# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

iterations = data["he initialization"]["iterations"]
costs = data["he initialization"]["costs"]
learning_rate = data["he initialization"]["learning_rate"]

# Print out the cost over the first 15000 iterations
for iteration, cost in zip(iterations, costs):
    print("Cost after iteration {}: {}".format(iteration, cost))

# Plot the cost over the first 15000 iterations
Visualization.plot_the_cost(iterations, costs, learning_rate)

# Print the accuracy of our model
accuracy_train = data["he initialization"]["accuracy train set"]
accuracy_test = data["he initialization"]["accuracy test set"]

print ("Accuracy of the model on the train set: ", accuracy_train)
print ("Accuracy of the model on the test set: ", accuracy_test)

<a id='4-3-2'></a>
## 4.3.2 - Analyzing the He Initialization ##

Accuracy of the network using He initialization looks great! Let's take a look at the details of the predictions and the decision boundary by running the cell below.

In [None]:
# Generate the samples
# Plot the dataset
# Initialize the hyperparameters
n_samples_train, n_samples_test = 300, 100
noise_train, noise_test = .05, .05
seed_train, seed_test = 1, 2

# Generate the dataset
train_X, train_Y, test_X, test_Y = generate_dataset(n_samples_train, n_samples_test, noise_train, noise_test, seed_train, seed_test)

# Preprocess the dataset
train_X, train_Y, test_X, test_Y = preprocess(train_X, train_Y, test_X, test_Y)

# Extract the data
with open("../data/initialization.pkl", "rb") as f:
    data = pickle.load(f)

predictions_train = data["he initialization"]["predictions train"]
predictions_test = data["he initialization"]["predictions test"]

# Print the detailed predictions
print ("Predictions training: " + str(predictions_train))
print ("predictions test: " + str(predictions_test))

# Plot the decision boundary of our model
decision_train = data["he initialization"]["decision train"]
decision_test = data["he initialization"]["decision test"]
Visualization.plot_decision_boundary(decision_train, train_X, train_Y, decision_test, test_X, test_Y)

<a id='4-3-3'></a>
## 4.3.3 - Discussing the He Initialization ##

**Observations**:
- The model with He initialization separates the blue and the red dots very well in a small number of iterations.

<a id='4-4'></a>
## 4.4 - Conclusion ##

You've tried three different types of initializations. For the same number of iterations and same hyperparameters, the comparison is:

<table>
    <tr>
        <td>
            <b>Model</b>
        </td>
        <td>
            <b>Train accuracy</b>
        </td>
        <td>
            <b>Problem/Comment</b>
        </td>
    </tr>
        <td>
        3-layer NN with zeros initialization
        </td>
        <td>
        50%
        </td>
        <td>
        fails to break symmetry
        </td>
    <tr>
        <td>
        3-layer NN with large random initialization
        </td>
        <td>
        83%
        </td>
        <td>
        too large weights
        </td>
    </tr>
    <tr>
        <td>
        3-layer NN with He initialization
        </td>
        <td>
        99%
        </td>
        <td>
        recommended method
        </td>
    </tr>
</table>

<a id='5'></a>
## 5 - What to remember ##

- Different initializations lead to very different results.
- Random initialization is used to break symmetry and make sure different hidden units can learn different things.
- Resist initializing to values that are too large!
- He initialization works well for networks with ReLU activations
- It's okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.

<a id='6'></a>
## 6 - End of Exercise ##

<div>
<img src="../img/memes/meme_congrats_02.png" width="700"/>
</div>

Created with and licensed under [Adobe Express](https://www.adobe.com/de/express/)