In [1]:
%load_ext notexbook

In [2]:
%texify -fs 18

# Artificial Neural Networks _from the ground up_

Deep learning allows computational models that are composed of
multiple processing **layers** to learn representations of data
with multiple levels of abstraction.

These methods have dramatically improved the state-of-the-art in speech recognition,
visual object recognition, object detection and many other domains
such as drug discovery and genomics.

# Artificial Neurons

In machine learning and cognitive science, an artificial neural network (ANN)
is a network inspired by **biological** neural networks which are used to estimate or approximate functions that can depend on a large number of inputs that are generally unknown.

Warren **McCulloch** and Walter **Pitts** published the first concept of a simplified brain cell, the so-called **McCulloch-Pitts** (`MCP`) neuron, in 1943.



>Biological neurons are interconnected nerve cells in the brain that are involved in the processing and
transmitting of chemical and electrical signals

<img src="./figures/artificial_neuron.png" alt="Artificial Neuron Model" class="maxw50" />

> McCulloch and Pitts described such a nerve cell as a simple **logic gate** with binary outputs. 
>
> Multiple signals arrive at the _dendrites_, they are then integrated into the cell body, and, if the accumulated signal exceeds a certain **threshold**, the neuron **spikes**, and an **output signal** is generated that will be passed on by the axon.
>
> Only a few years later, _Frank Rosenblatt_ published the first concept of the *Perceptron* learning rule based on the `MCP` neuron model.
>
> With his perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the optimal weight coefficients that would then be multiplied with the input features in order to make the decision of whether a neuron spikes or not. 
>
> In the context of supervised learning and classification, such an algorithm could then be used to predict whether a new data point belongs to one class or the other (i.e. Binary Classification problem)

<span class="fn"><b>Adapted from</b> _Machine Learning with Scikit-learn and PyTorch_, S. Rascka, Packt Publishing, 2022 </span>

An early version of ANN built from one node was called the **Perceptron** (a.k.a. the _Neuron Model_)

<img src="figures/perceptron.jpg" class="maxw30">

The Perceptron is an algorithm for supervised learning of binary classification, deciding whether an input (represented by a vector of numbers)
belongs to one class or another.

<img src="figures/perceptron_binary.png" class="maxw35" />

<span class="fn"><b>Image Source:</b> _Machine Learning with Scikit-learn and PyTorch_, S. Rascka, Packt Publishing, 2022 </span>



In the perceptron model, predictions are expressed as a linear combination of weights $W$ and input features $x$: $z = w^Tx + b$

A Perceptron Network can be designed to have *multiple layers*, leading to the **Multi-Layer Perceptron** (aka `MLP`)

<img src="figures/mlp.jpg" class="maxw30" />

## Perceptron Model

The Perceptron Learning rule of Rosenblatt's model focuses on mimicking how a _single neuron_ in the brain works[$^1$](#fn1): it either _sparks_ or it does not. 
Therefore, the Perceptron algorithm (_and its learning rule, ed._) is fairly simple, and the model can be summarized by the following pseudo-code:

1. Initialise the weights **$w$** = $(w_1, \ldots, w_m)$ and the bias unit $b$ to (_nearly_) zero;

(for each sample $x^{(i)} = (x^{(i)}_1, \ldots, x^{(i)}_m$) in the training set $X_{train}$)

2. Compute $z^{(i)} = w^Tx^{(i)} + b$
    
3. Compute the predicted value 
    <span class="inline-math">
    $$
    \widehat{y}^{(i)} = 
\begin{cases}
    1 & \text{if } z^{(i)} \geq 0 \\
    0 & \text{otherwise}
\end{cases}
$$
</span>
    
4. Update the weights and the bias unit, accordingly.

Therefore, the **output** value is the class label $\widehat{y}^{(i)}$ predicted by the **threshold function** _(Heaviside step function)_, 
and each weight $w^{(i)}_{j}$ and the bias unit are simultaneously updated:

$$
w^{(i)}_{j} = w^{(i)}_{j} + \Delta w^{(i)}_{j} \\
b = b + \Delta b
$$

The updated values (i.e. the _deltas_) are calculated according to a parameter $\eta$ also referred to as the _learning rate_:

$$
\Delta w^{(i)}_{j} = \eta (y^{(i)} - \widehat{y}^{(i)}) x^{(i)}_{j} \\
\Delta b = \eta (y^{(i)} - \widehat{y}^{(i)})
$$



**Note**: _The superscript $(i)$ refers to the i-th sample.
The subscript $j$ refers to the j-th dimension/feature_

<span id="fn1"><b>[1]:</b> For this reason, the Perceptron model is also referred to sometimes as the _neuron model_, as opposed to a more general _neural network_.</span>

In the perceptron model, the predicted value $\hat{y}$ is directly determined by the unit step function:

<img src="figures/perceptron_scheme.png" class="maxw40" alt="Perceptron Schematic" />

<span class="fn"><b>Image Source:</b> _Machine Learning with Scikit-learn and PyTorch_, S. Rascka, Packt Publishing, 2022 </span>

### The Perceptron in action

In [None]:
# Package imports
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons
import matplotlib

# Display plots inline and change default figure size
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

#### Generating a dataset

Let's start by generating a dataset we can play with. 

Fortunately, [scikit-learn](http://scikit-learn.org/) has some useful dataset generators, so we don't need to write the code ourselves. We will go with the [`make_moons`](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html) function.

In [None]:
# Generate a dataset and plot it
np.random.seed(0)
X, y = make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral)
plt.show()

The dataset we generated has **two** classes (i.e. Binary classification problem), plotted as red and blue dots. 

You can think of the blue dots as male patients and the red dots as female patients, with the $x_1$ and $x_2$ axis being medical measurements. 

Our goal is to train a **Perceptron** classifier that predicts the correct class (`male` vs `female`) given the $x_1$ and $x_2$ coordinates. 

**Note**: We intentionally generated a dataset that is **not** *linearly separable* (i.e. _we cannot draw a straight line that separates the two classes_). On the other hand, we know that the Perceptron model is indeed a _linear model_!

<img src="figures/linear_sep.png" alt="Linear vs Non-Linear Separability" class="maxw45" />

Therefore there is _no way_ that the `Perceptron` will be able to correctly predict the classes (_unless you hand-engineer non-linear features (such as polynomials) that work well for the given dataset, ed._).

In fact, this is one of the **major advantages** of Neural Networks: there is **no need** to worry about [feature engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/). 
The _hidden layer of a neural network_ will learn features for you (will come back to this soon!)

<span class="fn"><b>Image Source:</b> _Machine Learning with Scikit-learn and PyTorch_, S. Rascka, Packt Publishing, 2022 </span>

#### Decision Boundary 

Before proceeding, let's write our last utility function that will be used to plot the _decision boundary_ of our classifier:

**Note**: This is an helper function. Feel free to ignore the code, if you're not entirely familiar with the syntax.

In [None]:
# Helper function to plot a decision boundary.
# If you don't fully understand this function don't worry, it just generates the contour plot below.
def plot_decision_boundary(pred_func, X, y):
    
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict the function value for the whole gid
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral, edgecolors='k', linewidths=0.5)

### Perceptron Classifier

In [None]:
from sklearn.linear_model import Perceptron

pp = Perceptron()
pp.fit(X, y)

In [None]:
# Plot the decision boundary
plot_decision_boundary(lambda x: pp.predict(x), X=X, y=y)
plt.title("Perceptron")
plt.show()

The graph shows the decision boundary learned by our Perceptron classifier. 

It separates the data as good as it can using a straight line, but it's unable to capture the "moon shape" of our data.

#### Towards a Neural Network Model: The Adaline Model

Another type of single-layer neural network is the `ADAptive LInear NEuron` (`Adaline`)[2](#fn2). 

Adaline was published by Bernard Widrow and his doctoral student Tedd Hoff only a few years after Rosenblatt’s perceptron algorithm, and it can be considered an **improvement** on the latter ).

The `Adaline` algorithm is particularly interesting because it illustrates the key concepts of defining and **minimizing continuous loss functions**. 

This lays the groundwork for understanding how other machine learning algorithms for classification work (e.g. logistic regression, support vector machines, and **multi-layer neural networks**).

<span id="fn2"><b>[2]</b> _An Adaptive “Adaline” Neuron Using Chemical “Memistors”_, Technical Report Number 1553-2 by B. Widrow and colleagues, Stanford Electron Labs, Stanford, CA, October 1960</span>

The key difference between the `Adaline` rule (also known as the _Widrow-Hoff rule_) and the `Perceptron` rule is that the weights are updated based on a (_linear_, ed.) **activation function** rather than a unit step function like in the perceptron. 

In Adaline, this linear activation function $\sigma(z)$ is simply the **identity function** of the net input, so that $\sigma(z) = z$.

<img src="figures/perc_adaline.png" alt="Perceptron and Adaline Learning models" class="maxw50" />

<span class="fn"><b>Image Source:</b> _Machine Learning with Scikit-learn and PyTorch_, S. Rascka, Packt Publishing, 2022 </span>

## Multi-Layer Perceptron

We will be starting our journey into **Neural Nets** by immediately describing what our _final goal_ is:

**Understanding how the learning process of a Neural Network actually works**

<img src="./figures/learning_process.png" alt="The learning process" class="maxw55" />

<span class="fn"><b>[1]:</b> Image from [Deep Learning with PyTorch, Luca Antiga et. al.](https://www.manning.com/books/deep-learning-with-pytorch)</span>

I am sure you can appreciate how Neural Networks are indeed simple machines (in the sense of based on a _simple learning process_).
However, as it is always customary to say:
> (Neural Networks)..are indeed that simple. The complexity comes with the details.

### TL,DR; Derivatives, Gradients, and Chain Rule for the Backward Propagation

The main principle on which the learning process of `NN` is built on leverages on a very powerful algorithm (and technique) called **Backward Propagation**.

More specifically, the Backward propagation for `NN` is more the _side-effect_ of the optimisation problem we are trying to solve. 

In optimisation, and operation research, it is well known that we could leverage on the **gradients** to calculate the _minimum_ (or the _maximum_ ) of any function $f(x), x \in \mathbb{R}^n$. 

In particular, if we follow the opposite of the direction of the gradient, we will be looking for a minimum of the target function $f(x)$.

<img src="figures/grad_minimum.png" class="maxw40" />

In `NN` terms, this $f(x)$ is the **error** or the **loss** function that we want to minimise and in this sense, the **Backward Prop** is a side-effect of the optimisation as this method is the crucial foundation that we adopt to propagate the result of the minimisation throughout the connected layers. 

Therefore, now the question is: how to we calculate those gradients ?

In addition, two very compulsory requirements about this calculation: 
- **computationally exact**;  
- **computationally efficient**.

### MLP: 3-Layer NN

Let's now build a 3-layer neural network with one **input layer**, one **hidden layer**, and one **output layer**. 

A few notes:

1. The number of nodes in the input layer is determined by the dimensionality of our data: `2`; 
2. Similarly, the number of nodes in the output layer is determined by the number of classes we have, also `2`. 
    - Because we only have 2 classes, we could actually just have a single output node predicting `0` or `1`, but having `2` makes it easier to extend the network to more classes later on. 
3. The input to the network will be the $x_1$ and $x_2$ coordinates (i.e. _features_), and its output will be **two probabilities**, one for class `0` (e.g. "female") and one for class `1` (i.e. "male"). 

The schematic of the overall model looks something like this:

<img src='figures/nn-3-layer-network.png' class="maxw50" />

We can **choose** arbitrarily the dimensionality (i.e. the number of nodes) of the **hidden layer**. 

The more nodes (i.e. _model parameters_) we add to the hidden layer, the **more complex** functions we will be able fit. 

But higher dimensionality comes at a cost! First, more computation is required to make predictions, and learn the network parameters. 

A bigger number of parameters also means we become more prone to **overfitting** our data. 

**Note**: How to choose the size of the hidden layer? 

While there are some general guidelines and recommendations, it always depends on your specific problem and is more of an art than a science. We will play with the number of nodes in the hidden layer later on and see how it affects our output.

#### Activation Function

The next step is to pick an *activation function* for our hidden layer. 

The activation function transforms the inputs of the layer into its outputs. 

A **non-linear activation function** is what allows us to fit non-linear data. 

Common chocies for activation functions are [`tanh`](https://reference.wolfram.com/language/ref/Tanh.html), the [`sigmoid`](https://en.wikipedia.org/wiki/Sigmoid_function) (a.k.a. Logistic function), or [`ReLU`](https://en.wikipedia.org/wiki/Rectifier_(neural_networks).

<img src="figures/sigmoid.png" class="maxw40" />

In this example, we will be using `tanh`, which performs quite well in many scenarios. 

A nice **property** of all the aforementioned functions is that their derivate can be computed easily, and it's _numerically stable_.

For example, the derivative of $\tanh x$ is $1-\tanh^2 x$. This is useful because it allows us to compute $\tanh x$ once and re-use its value later on to get the derivative.

Similarly, the derivative of $ReLU(x) = \max{0, x}$ is equal to `1` if $x > 0$, `0` otherwise. 



##### Decision Function (i.e. activation function of last layer)

Because we want our network to output **probabilities**, the activation function for the output layer will be the [softmax](https://en.wikipedia.org/wiki/Softmax_function), which is simply a way to convert raw scores to probabilities. 

Another way to think of softmax is as a _generalisation_ of the logistic (sigmoid) function to multiple classes.

### The Learning Algorithm

#### Forward Propagation

Our network makes predictions using *forward propagation*, which is just a bunch of matrix multiplications and the application of the activation function(s) we defined above. 

<img src="figures/fwd_step_net.png" class="maxw35" />

If $x$ is the 2-dimensional input to our network then we calculate our prediction $\widehat{y}$ (also two-dimensional) as follows:

$$
\begin{aligned}
z_1 & = xW_1 + b_1 \\
a_1 & = \tanh(z_1) \\
z_2 & = a_1W_2 + b_2 \\
a_2 & = \widehat{y} = \mathrm{softmax}(z_2)
\end{aligned}
$$

$z_i$ is the weighted sum of inputs of layer $i$ (bias included) and $a_i$ is the output of layer $i$ after applying the activation function. 

$W_1, b_1, W_2, b_2$ are  parameters of our network, which we need to learn from our training data. 

You can think of them as matrices transforming data between layers of the network. 

Looking at the matrix multiplications above we can figure out the dimensionality of these matrices. 

If we use `500` nodes for our hidden layer then 

$W_1 \in \mathbb{R}^{2\times500}$, 

$b_1 \in \mathbb{R}^{500}$, 

$W_2 \in \mathbb{R}^{500\times2}$, 

$b_2 \in \mathbb{R}^{2}$. 

$\Rightarrow$ Now you see why we have more parameters if we increase the size of the hidden layer.

#### Learning the Parameters: Backward Propagation

Learning the parameters for our network means finding the values of ($W_1, b_1, W_2, b_2$) that **minimise the error on our training data**. 

**But how do we define the error ?** 

We call the function that measures our error the **loss function**. 

A common choice with the softmax output is the [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression). 

If we have $N$ training samples and $C$ classes, then the loss for our prediction $\widehat{y}$ with respect to the true labels $y$ is given by:

$$
\begin{aligned}
L(y,\widehat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\widehat{y}_{n,i}
\end{aligned}
$$



The formula looks complicated, but all it really does is **sum over our training samples** and add to the `loss` if we predicted the incorrect class. 

So, the further away $y$ (the correct labels) and $\widehat{y}$ (our predictions) are, the greater our loss will be. 

Remember that our goal is to find the parameters that minimize our loss function. 

We can use [gradient descent](http://cs231n.github.io/optimization-1/) to find its minimum. 

<img src="figures/bkwd_step_net.png" class="maxw35" />

**Note**: In the following cells, we will implement the **most vanilla version** of gradient descent, also called _batch gradient descent_ with a fixed learning rate. 

Variations such as `SGD` (Stochastic Gradient Descent) or `minibatch gradient descent` typically perform better in practice. 

So if you are serious, you'll want to use one of these, and ideally you would also [decay the learning rate over time](http://cs231n.github.io/neural-networks-3/#anneal).

As an input, gradient descent needs the gradients (vector of derivatives) of the loss function with respect to our parameters: $\frac{\partial{L}}{\partial{W_1}}$, $\frac{\partial{L}}{\partial{b_1}}$, $\frac{\partial{L}}{\partial{W_2}}$, $\frac{\partial{L}}{\partial{b_2}}$. 

To calculate these gradients we use the famous (and already mentioned) *backpropagation algorithm*, which is a way to efficiently calculate the gradients starting from the output. 

I won't go into detail how backpropagation works, but there are many excellent explanations ([here](http://colah.github.io/posts/2015-08-Backprop/) or [here](http://cs231n.github.io/optimization-2/)) on the internet.



Applying the backpropagation formula we find the following (believe me on this 🙃):

$$
\begin{aligned}
& \delta_3 = \widehat{y} - y \\
& \delta_2 = (1 - \tanh^2z_1) \circ \delta_3W_2^T \\
& \frac{\partial{L}}{\partial{W_2}} = a_1^T \delta_3  \\
& \frac{\partial{L}}{\partial{b_2}} = \delta_3\\
& \frac{\partial{L}}{\partial{W_1}} = x^T \delta_2\\
& \frac{\partial{L}}{\partial{b_1}} = \delta_2 \\
\end{aligned}
$$

### Implementation

Now we are ready for our implementation. We start by defining some useful variables and parameters for gradient descent:

In [None]:
# Gradient descent parameters (I picked these by hand)
eta = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength
NUM_EPOCHS = 20000

First let's implement the loss function we defined above. We use this to evaluate how well our model is doing:

In [None]:
def calculate_loss(model, y):
    """
    Helper function to evaluate the total loss on the dataset
    """
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    N = len(y)
    
    # Forward propagation to calculate our predictions
    z1 = X.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    
    # Calculating the loss
    corect_logprobs = -np.log(probs[range(N), y])
    data_loss = np.sum(corect_logprobs)
    
    # Add regulatization term to loss (optional)
    data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    return 1/N * data_loss

We also implement a helper function to calculate the output of the network. It does forward propagation as defined above and returns the class with the highest probability.

In [None]:

def predict(model, x):
    """
    Helper function to predict an output in two classes (0 or 1)
    """
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    # Forward propagation
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return np.argmax(probs, axis=1)

Finally, here comes the function to train our Neural Network. It implements batch gradient descent using the backpropagation derivates we found above.

In [None]:
def mlp_fit(nn_hdim, X, y, num_epochs=NUM_EPOCHS, print_loss=False):
    """
    This function learns parameters for the neural network and returns the model.
    
    Parameters
    ----------
    nn_hdim: int, Number of nodes in the hidden layer
    
    num_epochs: int, Number of passes through the training data 
    for gradient descent (default: 20,000)

    print_loss: bool, If True, print the loss every 1000 iterations (default: False)
    
    Returns
    -------
    model: dictionary containing a key for each model parameter (namely W1, b1, W2, b2)
    """
    
    num_examples = X.shape[0] # training set size
    nn_input_dim = X.shape[1] # input layer dimensionality
    nn_output_dim = len(y) # output layer dimensionality
    
    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # This is what we return at the end
    model = {}
    
    # Gradient descent. For each batch...
    for i in range(0, num_epochs):

        # Forward pass
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backward pass
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # Gradient descent parameter update
        W1 += -eta * dW1
        b1 += -eta * db1
        W2 += -eta * dW2
        b2 += -eta * db2
        
        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
        
        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we don't want to do it too often.
        if print_loss and i % 1000 == 0:
          print("Loss after iteration %i: %f" %(i, calculate_loss(model, y)))
    
    return model

### A network with a hidden layer of size 3

Let's see what happens if we train a network with a hidden layer size of 3.


In [None]:
# Build a model with a 3-dimensional hidden layer
model = mlp_fit(3, X=X, y=y, print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x), X, y)
plt.title("Decision Boundary for hidden layer size 3")
plt.show()

Yay! This looks pretty good. Our neural networks was able to find a decision boundary that successfully separates the classes.

#### Without re-inventing the wheel

In [None]:
from sklearn.neural_network import MLPClassifier as MLP

In [None]:
MLP?

## Exercise: Varying the hidden layer size

In the example above we picked a hidden layer size of `3`. 

Let's now get a sense of how varying the hidden layer size affects the result.


In [None]:
hidden_layer_dimensions = [1, 3, 5, 10, 30]

In [None]:
plt.figure(figsize=(10, 10))

n_rows = (len(hidden_layer_dimensions) // 2) + (len(hidden_layer_dimensions) % 2)
n_cols = 2

for i, nn_hdim in enumerate(hidden_layer_dimensions):
    plt.subplot(n_rows, n_cols, i+1)
    plt.title('Hidden Layer size %d' % nn_hdim)
    model = mlp_fit(nn_hdim, X=X, y=y)
    plot_decision_boundary(lambda x: predict(model, x), X=X, y=y)
plt.show()

We can see that while a hidden layer of low dimensionality nicely capture the general trend of our data, but higher dimensionalities are prone to overfitting. 

They are "memorising" the data as opposed to fitting the general shape. 

If we were to evaluate our model on a separate **test set** (and you should!) the model with a smaller hidden layer size would likely perform better because it generalizes better. 

We could counteract overfitting with stronger regularization, but picking the correct size for hidden layer is a much more "economical" solution.

# Exercises

Here are some things you can try to become more familiar with the code:

1. Try to repeat the learning process by generating two different sets, one for training and one for test.

2. We used a $\tanh$ activation function for our hidden layer. Experiment with other activation functions (some are mentioned above). Note that changing the activation function also means changing the backpropagation derivative.

**Optional** (only if you feel confident with Python programming)

3. Extend the network from two to three classes. You will need to generate a new dataset for this. Have a look at the documentation of [`sklearn.datasets.make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification).

4. Extend the network to four layers. Experiment with the layer size. Adding another hidden layer means you will need to adjust both the forward propagation as well as the backpropagation code.


## Exercise 1

### Data Partitining

In [None]:
# Shuffling & train/test split
shuffle_idx = np.arange(len(y))
shuffle_rng = np.random.RandomState(12345)
shuffle_rng.shuffle(shuffle_idx)
X, y = X[shuffle_idx], y[shuffle_idx]

# shuffling and slicing, 70/30 partitions
train_size = X.shape[0] * 70 // 100
X_train, X_test = X[shuffle_idx[:train_size]], X[shuffle_idx[train_size:]]
y_train, y_test = y[shuffle_idx[:train_size]], y[shuffle_idx[train_size:]]

# Optional: Data standardisation (mean zero, unit variance)
mu, sigma = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mu) / sigma
X_test = (X_test - mu) / sigma

In [None]:
# Build a model with a 3-dimensional hidden layer
model = mlp_fit(3, X=X_train, y=y_train, print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(model, x), X_test, y_test)
plt.title("Decision Boundary for hidden layer size 3")
plt.show()

## Exercise 2

In [None]:
# your code here

## Exercise 3 (Optional)

In [None]:
# your code here

## Exercise 4 (Optional)

In [None]:
# your code here

# Addendum

- Another terrific reference to start is the online book http://neuralnetworksanddeeplearning.com/. Highly recommended!
- Introduction to PyTorch ([notebook](./extra_pytorch_nn.ipynb))