<h1 style="font-size:40px;"><center>Exercise I:<br> Training of simple MLP models
</center></h1>

# Introduction
## Short summary
In this exercise you will: 

* train MLPs for simple classification and regression problems.
* learn how hyper-parameters such as learning rate, batch size and number of epochs affect training.

You should write the report of the exercise within this notebook. The details of how to do that can be found below in section "Writing the report".

**Deadline for submitting the report: See Canvas assignment.**

## The data
We will use two synthetic different data sets in this exercise

### syn2
The *syn2* dataset represents a binary classification problem. The input data is 2D which allows for an easy visual inspection of the different classes and the decision boundary implemented by the network. The dataset is generated using random numbers each time you run the cell. This means that each time you generate the data it will be slightly different. You can control this by having a fixed *seed* to the random number generator. The cell "PlotData" will plot the *syn2* dataset.

### regr1
The *regr1* dataset represents a regression problem. It has one input and one target variable. It a cosine function, with the possibility to add some noise and dampening on the output. Again see the cell "PlotData" to look at the dataset.

## The exercises
There are 8 questions in this exercise. These 8 questions can be found in three different cells below (see section "The Different Cells"). The first 6 questions will use the *regr1* dataset and questions 7-8 will use *syn2*.

## The different 'Cells'
This notebook contains several cells with python code, together with the markdown cells (like this one) with only text. Each of the cells with python code has a "header" markdown cell with information about the code. The table below provides a short overview of the code cells. 

| #  |  CellName | CellType | Comment |
| :--- | :-------- | :-------- | :------- |
| 2  | Init | Needed | Sets up the environment|
| 3  | Data | Needed | Defines the functions to generate the artificial datasets |
| 4  | PlotData | Information | Plots the 2D classification datasets |
| 5  | MLP | Needed | Defines the MLP model |
| 6  | Training | Needed | Functions for training and testing the MLP model |
| 7  | Boundary | Needed | Functions for showing classification boundaries and errors | 
| 8  | Ex1 | Exercise | For question 1-4 |
| 9  | Ex2 | Exercise | For question 5-6 |
| 10 | Ex3 | Exercise | For question 7-8 |

To start with the exercise you need to run all cells with the celltype "Needed". The very first time we suggest that you enter each of the needed cells, read the cell instruction and run the cell. It is important that you do this in the correct order, starting from the top and work your way down the cells. Later, when you have started to work with the notebook it may be easier to use the command "Run All" or "Run all above" found in the "Cell" dropdown menu.

## Writing the report
The report should be written within this notebook. We have prepared the last cell in this notebook for you where you should write the report. The report should contain 4 parts:

* Name:
* Introduction: A **few** sentences where you give a short introduction to the content and purpose of the exercise.
* Answers to questions: For each of the questions provide an answer. It can be short answers or longer ones depending on the nature of the questions, but try to be efficient in your writing. (Don't include lots of program output or plots that aren't central to answering the question.)
* Conclusion: Summarize your findings in a few sentences.

It is important that you write the report in this last cell and **not** after each question! 

## Last but not least
Have fun!

---

# CellName: Init (#2)
**CellType: Needed**  
**Cell instruction:**

In the cell below, we will import needed libraries. 

Run the cell by entering into the cell and press "CTRL Enter".

In [None]:
import torch
device = 'cpu'
# Uncomment this to use CUDA acceleration if available
# device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"PyTorch: Using {device} device")
# The floating point data type can be changed here
dtype_torch = torch.float32

from torch.utils.data import DataLoader, TensorDataset
from torch import nn, Tensor
from collections import OrderedDict
import torchmetrics

import matplotlib.pyplot as plt
import numpy as np

# CellName: Data (#3)
**CellType: Needed**  
**Cell instruction:**

This cell defines the two synthetic datasets. The last function is used for standardization of the data. 

Run the cell by entering into the cell and press "CTRL Enter".

In [None]:
def syn2(N):
    "Generate data for a classification problem in 2D."
    x = np.empty(shape=(N, 2))
    d = np.empty(shape=(N, 1))
    N1 = N // 2

    # Positive samples
    x[:N1,:] = 0.8 + np.random.normal(size=(N1, 2))
    # Negative samples
    x[N1:,:] = -.8 + np.random.normal(size=(N-N1, 2))

    # Target
    d[:N1] = 1
    d[N1:] = 0

    return x, d

def regr1(N, periods=2, damp=False, v=0):
    "Generate data for 1D regression problem with damped cosine and noise"
    dx = 2*periods*np.pi / (N-1)
    x = np.arange(N) * dx

    if damp:
        d = np.cos(x)*np.exp(-x*0.05)
    else:
        d = np.cos(x)
    noise = lambda n : np.random.normal(size=n)
    std_signal = np.std(d)
    d = d + v * std_signal * noise(N)

    return x[:, None], d[:, None]

def standard(x):
    "Mean and stddev across samples"
    return np.mean(x, axis=0), np.std(x, axis=0)

# CellName: PlotData (#4)
**CellType: Information**  
**Cell instruction:**

Here, we generate 100 cases for *syn2* and *regr1* datasets and plot them. 

Run the cell by entering into the cell and press "CTRL Enter". 

**Note!** This cell is not needed for the actual exercises, it is just to visualize the two datasets.

In [None]:
# seed = 0 means random, seed > 0 means fixed
seed = 0
np.random.seed(seed) if seed else None

x, d = syn2(100)
plt.figure()
plt.scatter(x[:,0], x[:,1], c=d)

# Regression, one period, no noise
x, d = regr1(100, 2, False, 0)
plt.figure()
plt.scatter(x, d)

# Regression, 1.5 period, exponential damping, some noise
x, d = regr1(100, 3, True, 0.2)
plt.figure()
plt.scatter(x, d)

# CellName: MLP (#5)
**CellType: Needed**  
**Cell instruction:**

This cell defines the MLP model. Several MLP hyperparameters are needed to define a model.
Here is a list of them:  
(**Note:** They can all be specified when you call
this function in later cells. The ones specified in this cell are the default values.)

* inputs: the input dimension (integer)

* output: the input dimension (integer)

* nodes: size of the network, eg `[5]` for a one hidden layer with 5 nodes and `[5, 3]` for a two layer network with 5 and 3 hidden nodes each.

* activation: the activation function. Most common are
    * `nn.ReLU`
    * `nn.Tanh`
        
* output_activation: the activation function for the output nodes. Most common are
    * `None` (linear activation)
    * `nn.Sigmoid`
    * `nn.Softmax`
      
Run the cell by entering into the cell and press "CTRL Enter".

In [None]:
class Network(nn.Module):
    "A simple MLP with one or more fully connected layers"

    def __init__(self, *, inputs=1, outputs=1, nodes=[4], activation=nn.Tanh, out_activation=None):
        """
        Args:
            inputs (int, optional): The number of input nodes.
            outputs (int, optional): The number of output nodes.
            nodes (list, optional): A list of layer sizes.
            activation: Activation function (or None for linear). Defaults to nn.Tanh
            out_activation (optional): Activation function for output layer.
        """
        super().__init__()

        seqstack = OrderedDict()
        prevn = inputs
        for i, n in enumerate(nodes):
            seqstack[f"layer{i+1}"] = nn.Linear(prevn, n, dtype=dtype_torch)
            prevn = n
            if activation is not None:
                seqstack[f"act{i+1}"] = activation()
        seqstack["layerN"] = nn.Linear(prevn, outputs, dtype=dtype_torch)
        if out_activation is not None:
            seqstack["actN"] = out_activation()
        self.mlp_stack = nn.Sequential(seqstack)

    def forward(self, x : Tensor):
        "Apply the network stack on some input"
        return self.mlp_stack(x)

    def predict(self, input_data):
        """
        Apply the network on a set of input data.

        Args:
            input_data (np.ndarray): Input data

        Returns:
            pred (np.ndarray): Predicted output.
        """
        self.eval()
        inp = torch.tensor(input_data, dtype=dtype_torch, device=device)
        with torch.no_grad():
            pred = self(inp)
        return pred.cpu().numpy()

    def __str__(self):
        s = super().__str__()
        ps = ["Named parameters:"] + [
            f"{name}: {param.numel()}" for name, param in
             self.mlp_stack.named_parameters() if param.requires_grad]
        totp = sum(p.numel() for p in self.mlp_stack.parameters() if p.requires_grad)
        return s + f"\nTrainable parameters: {totp}\n" + "\n  ".join(ps) + "\n"

# CellName: Training (#6)
**CellType: Needed**  
**Cell Instruction:**

This cell defines functions for training the model for a single epoch (`train_epoch`),
evaluating the performance in the validation data (`test`) and training and validating over
many epochs (`train_loop`). Finally, it defines a function (`plot_training`) for plotting
the training progress.

The `train_loop` function takes a previously defined `Network` model, two PyTorch `DataLoader`s
that provide the data for training and test, and several hyperparameters:

* loss_fn: The error function used during training. There are three common ones
    * `nn.MSELoss` (mean squared error)
    * `nn.BCELoss` (binary cross entropy)
    * `nn.CrossEntropyLoss` (categorical cross entropy)

* optimizer: The error minimization method, which is constructed with information about the model and a learning rate. Common choices are
    * `torch.optim.SGD`
    * `torch.optim.Adam`

* metrics: Additional metrics to compute and print besides the loss. We use the [torcheval.metrics package](https://docs.pytorch.org/torcheval/main/torcheval.metrics.html) and pass the metric(s) as a dict with `{name: metric}`. Examples:
    * `{'accuracy': ptmetrics.BinaryAccuracy()}`
    * `{'MSE': ptmetrics.MeanSquaredError()}`

Run the cell by entering into the cell and press "CTRL Enter".

In [None]:
def train_epoch(*, model : Network, dataloader : DataLoader,
                loss_fn, optimizer : torch.optim.Optimizer):
    """
    Train a model for a single epoch.

    Args:
        model (Network): The network.
        dataloader (DataLoader): Batch DataLoader with training data.
        loss_fn (Loss): Loss function, e.g. nn.MSELoss.
        optimizer (Optimizer): The optimizer used to update the network.

    Returns:
        train_loss (float): Training error over all batches.
    """
    model.train()
    train_loss = 0
    for X, y in dataloader:
        X, y = X.to(device), y.to(device)   # Move data to GPU if necessary
        optimizer.zero_grad()   # Reset the gradients

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)
        train_loss += loss.item() * len(X)

        # Backpropagation
        loss.backward()
        optimizer.step()
    return train_loss / len(dataloader.dataset)

def test(*, model : Network, dataloader : DataLoader, loss_fn, metrics=[]):
    """
    Test a model on a set of data.

    Args:
        model (Network): The network.
        dataloader (DataLoader): DataLoader with data to test.
        loss_fn (Loss): Loss function, e.g. nn.MSELoss.
        metrics (iterable): Additional metrics from torchmetrics.

    Returns:
        loss (float): Mean error over all batches.
    """
    model.eval()
    loss = 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            loss += loss_fn(pred, y).item() * len(X)
            for m in metrics:
                m.update(pred, y)
    return loss / len(dataloader.dataset)


def train_loop(*, model : Network, train_dataloader : DataLoader,
               val_dataloader : DataLoader = None, loss_fn,
               optimizer : torch.optim.Optimizer, epochs : int,
               print_every:int = 100, metrics=None, print_final=True):
    """
    Train and optionally test a model.

    Args:
        model (Network): The network.
        train_dataloader (DataLoader): Training data.
        val_dataloader (DataLoader, optional): Validation data.
        loss_fn (Loss): Loss function, e.g. nn.MSELoss.
        optimizer (Optimizer): An optimizer from torch.optim.
        epochs (int): Number of epochs to train for.
        print_every (int, optional): Print loss every so many epochs. Defaults to 100.
        metrics (dict(name: metric), optional): Record/print these additional metrics.
        print_final(bool, optional): Print final metrics. Defaults to True.

    Returns:
        train_losses (list(float)): Training loss during each epoch.
        val_losses (list(float)): Validation loss after each epoch.
        metrics_res (dict(name: list(float))): Values of metrics after each epoch.
    """
    train_losses = []
    val_losses = []
    val_loss = np.nan

    # Move metrics to CPU/GPU and prepare for their output
    metrics = {name: m.to(device) for name, m in (metrics or {}).items()}
    metrics_res = {name: [] for name in metrics.keys()}

    for t in range(epochs):
        train_loss = train_epoch(model=model, dataloader=train_dataloader,
                           loss_fn=loss_fn, optimizer=optimizer)
        train_losses.append(train_loss)
        if val_dataloader is not None:
            for m in metrics.values():
                m.reset()
            val_loss = test(dataloader=val_dataloader, model=model,
                            loss_fn=loss_fn, metrics=metrics.values())
            val_losses.append(val_loss)
            for name, m in metrics.items():
                metrics_res[name].append(m.compute().cpu())
        if (print_every > 0 and t % print_every == 0) or (
                print_every >= 0 and t + 1 == epochs):
            extras = [f" {n} {v[-1]:<7f}" if torch.isreal(v[-1])
                      else f" {n} {v[-1]}"
                      for n, v in metrics_res.items()]
            print(f"Epoch {t+1:<7d} train {train_loss:<7f} "
                  f" validation {val_loss:<7f}", "".join(extras))
    if print_final:
        print("\n** Validation metrics after training **\n"
              f"Loss {val_losses[-1]:<7g}")
        for n, v in metrics_res.items():
            if torch.isreal(v[-1]):
                print(f"{n} {v[-1]:<7g}")
            else:
                print(f"{n}:")
                print(v[-1])
        print()
    return train_losses, val_losses, metrics_res

def plot_training(train_loss, val_loss, metrics_res={}):
    "Plot the training history"
    plt.figure()
    plt.ylabel('Loss / Metric')
    plt.xlabel('Epoch')
    plt.plot(train_loss, label="Training loss")
    plt.plot(val_loss, label="Validation loss")
    for name, res in metrics_res.items():
        if torch.isreal(res[0]):
            plt.plot(res, label=name)
    plt.legend(loc='best')
    plt.show()

# CellName: Boundary (#7)
**CellType: Needed**  
**Cell Instruction:**

This cell defines a function for presenting the output of binary MLP classifiers, plotting the decision boundary for a problem with 2D input. In brief, this function defines a grid that covers the input data. Each grid point is then used as an input to the trained MLP and to compute an output. If the output is close to 0.5 it is marked as the boundary.

Run the cell by entering into the cell and press "CTRL Enter".

In [None]:
def decision_boundary(X : np.ndarray, Y1 : np.ndarray, model):
    """
        Plot classfication and decision boundary for binary classification problem

        Args:
            X (np.ndarray): input
            Y1 (np.ndarray): target
            model (Network): the model

        Returns:
            None.
    """
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    # grid stepsize
    h = 0.025

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    Z[Z > .5] = 1
    Z[Z <= .5] = 0

    Y_pr = model.predict(X).flatten()
    Y = Y1.flatten()

    Y_pr[Y_pr > .5] = 1
    Y_pr[Y_pr <= .5] = 0
    Y[(Y != Y_pr) & (Y == 0)] = 2
    Y[(Y != Y_pr) & (Y == 1)] = 3

    plt.figure()
    #plt.contourf(xx, yy, Z, cmap=plt.cm.PRGn, alpha = .9)
    plt.contour(xx, yy, Z, cmap=plt.cm.Paired)

    plt.scatter(X[Y == 1, 0], X[Y == 1, 1], marker='+', c='k')
    plt.scatter(X[Y == 0, 0], X[Y == 0, 1], marker='o', c='k')

    plt.scatter(X[Y == 3, 0], X[Y == 3, 1], marker = '+', c='r')
    plt.scatter(X[Y == 2, 0], X[Y == 2, 1], marker = 'o', c='r')

    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.show()

---
End of "Needed" and "Information" cells. Below are the cells for the actual exercise.

---

# CellName: Ex1 (#8)
**CellType: Exercise**  
**Cell instruction:**

Questions 1-4 look at three essential parameters that controls the training process of an MLP: the *learning rate*, *batch size* and *number of epochs* (or epochs for short). By training process we mean here the minimization of the given loss function. The task is to train an MLP with four hidden nodes that can fit the *regr1* dataset. In this version of the dataset, there is no noise. Therefore, we need not specify any seed.

The dataset and network have been selected so that it is possible, but not trivial, to get a good training result.
A successful training means here when the networks has reached a loss < 0.01, and visually have fitted the data accurately. In this exercise we do not care about possible overfitting, only about the minimization of the loss function, we therefore do not have a validation dataset.

## Question 1, variations in pre-defined MLP
For the first question you can simply run the cell below. It will load 50 samples from the *regr1* dataset (no noise added). The network has 4 hidden nodes in a single hidden layer, *tanh* activation function, linear output activation function, *stochastic gradient descent* as minimization method, MSE loss function, and a learning rate of 0.05.
It will train for 4000 epochs using a batchsize of 50, meaning that we efficiently are using ordinary gradient descent learning. Run this cell five times. **(a) Do you see the same loss vs epoch behavior each time your run?** If not, **why?** **(b) Do you observe that training fails, i.e. do not reach low loss, during any of these five runs?** 

## Question 2, vary learning rate
You will now study what happens when you train with different learning rates. Test at least 5 different learning rates in the range from 0.001 to 0.5. For each learning rate train the network three times and record the average MSE value over these three runs. **Present your average MSE results and discuss your findings**.

**Note:** You should keep the same settings as for Q1, only vary the learning rate. The learning rate is best investigated with (roughly) proportional changes rather than constant steps: For example, trying 0.5, 0.2, 0.1, 0.05, 0.02 etc. typically gives more interesting results than 0.5, 0.4, 0.3, 0.2, 0.1.

## Question 3, vary (mini)batch size
We now (hopefully) have discovered that the learning rate influences the efficiency of the loss minimization. We will now look at what happens when we use *stochastic gradient descent*, meaning that we will have a "batch size" that is smaller the the size of the training data. (We now adapt to the language of ANN packages such as PyTorch and use the word "batch" where most literature would use "mini-batch".) Use a fixed learning rate of 0.05, but test different batch sizes in the range 1 to 50. Train three different networks for each batch size, but this time record if the training was successful (i.e. MSE < 0.01) and approximately after how many epochs the good solution was found. **Present and discuss your findings**.

**Note:** The batch size should (fairly well) divide the total data size. With a data size of 50, good batch sizes are 50, 25, 13, 10, 5... Sizes that fit poorly, like 40, create a small last batch which could give you the disadvantages of both small and large batch sizes at the same time.

## Question 4, select good hyper-parameters
Find a combination of learning rate and batch size that gives a good solution within 1000 epochs. We always have to remember that two runs with identical hyper parameters (e.g. learning rate, batch size etc) will result in different final results. Your set of parameters should *most* of the times result in a good solution within 1000 epochs. **Present your best combination of learning rate and batch size, and its result**.

In [None]:
%%time

# Generate training data
x_trn, d_trn = regr1(50, 2, 0, 0.0)

# Standardization of inputs
mu, std = standard(x_trn)
x_trn = (x_trn - mu) / std

# Define the network, cost function and training settings
model_ex1 = Network(
    inputs=1,            # number of input nodes
    outputs=1,           # number of output nodes
    nodes=[4],           # number of nodes in hidden layer
    activation=nn.Tanh,  # activation function in hidden layer
    out_activation=None  # activation function in output layer (if not linear)
    ).to(device)         # move data to GPU or keep with CPU

# Optimization parameters
opt_method = torch.optim.SGD  # minimization method
learning_rate = 0.05          # learning rate
loss_fn = nn.MSELoss()        # loss function, MSE
number_epochs = 4000
minibatch_size = 50

# Additional metrics to print
metrics = {'MSE': torchmetrics.MeanSquaredError()}

# Set up the optimizer
optimizer = opt_method(model_ex1.parameters(), lr=learning_rate)

# Print a summary of the model
print(model_ex1)

# Turn the training data into a dataset with Tensors on the GPU or CPU
dset_trn = TensorDataset(torch.tensor(x_trn, device=device, dtype=dtype_torch),
                         torch.tensor(d_trn, device=device, dtype=dtype_torch))

# Create a batch loader for the training data
dl_trn = DataLoader(dset_trn, batch_size=minibatch_size)

# Train the network and print the progress
train_loss, val_loss, metrics_res = train_loop(
    model=model_ex1,
    train_dataloader=dl_trn,
    val_dataloader=dl_trn, # Test with the training data
    loss_fn=loss_fn,
    metrics=metrics,
    optimizer=optimizer,
    print_every=100,
    epochs=number_epochs)

# Plot the training history
plot_training(train_loss, val_loss, metrics_res)

# Predict output on the training data
d_pred = model_ex1.predict(x_trn)

# Plot the result
plt.figure()
plt.ylabel('Prediction / Target')
plt.xlabel('Input')
plt.scatter(x_trn, d_trn, label='Target')
plt.scatter(x_trn, d_pred, label='Prediction')
plt.title('Prediction vs Target')
plt.legend(loc='best')
plt.show()

# CellName: Ex2 (#9)
**CellType: Exercise**  
**Cell instruction:**  

The amount of weights in the network can also influence how long time we need to train, and of course if the problem itself is complex or not. The following two questions will highlight this.

## Question 5, vary epochs
The example below will load a slightly more complex *regr1* problem (an additional quarter of a period). We will use 10 hidden nodes for this problem. Use your optimal set of learning rate and batch size as found in Q4 and train the network below. **Compare the number of epochs needed to reach a good solution with that of Q4**. Note, you may need to vary the number of epochs a lot! If you cannot find a good solution in a reasonable number of epochs, you can "revert" the problem: optimize learning rate and batch size for Q5, and the see how those hyper-parameters perform on Q4.

## Question 6, vary network size and other hyper-parameters
Use the following line to load the *regr1* data set:

`x_trn, d_trn = regr1(75, 5, 1, 0.0)`

This will create an even more challenging regression task that may need an even larger network. Your task is to find a set of hyper-parameters (learning rate, batch size, epochs, 'size of the network') that result in a good solution. You can use more than one hidden layer for this task (if you want). To create many hidden layers, add many numbers to the `layers` list, for example: `layers = [10, 5, 5]`. **Present your set of good hyper-parameters and the result**. 

**Note:** If you cannot solve this task in *reasonable* time, present your best attempt!


In [None]:
%%time

# seed = 0 means random, seed > 0 means fixed
seed = 0
np.random.seed(seed) if seed else None

# Generate training data
# For Q5:
x_trn, d_trn = regr1(50, 2.5, 0, 0.0)

# For Q6:
#x_trn, d_trn = regr1(75, 5, 1, 0.0)

# Standardization of inputs
mu, std = standard(x_trn)
x_trn = (x_trn - mu) / std

# Define the network, cost function and training settings
model_ex2 = Network(
    inputs=1,            # number of input nodes
    outputs=1,           # number of output nodes
    nodes=[10],           # number of nodes in hidden layer
    activation=nn.Tanh,  # activation function in hidden layer
    out_activation=None  # activation function in output layer (if not linear)
    ).to(device)         # move data to GPU or keep with CPU

# Optimization parameters
opt_method = torch.optim.SGD  # minimization method
learning_rate = 0.05          # learning rate
loss_fn = nn.MSELoss()        # loss function, MSE
number_epochs = 1000
minibatch_size = 50

# Additional metrics to print
metrics = {'MSE': torchmetrics.MeanSquaredError()}

# Print a summary of the model
print(model_ex2)

# Set up the optimizer
optimizer = opt_method(model_ex2.parameters(), lr=learning_rate)

# Turn the training data into a dataset with Tensors on the GPU or CPU
dset_trn = TensorDataset(torch.tensor(x_trn, device=device, dtype=dtype_torch),
                         torch.tensor(d_trn, device=device, dtype=dtype_torch))

# Create a batch loader for the training data
dl_trn = DataLoader(dset_trn, batch_size=minibatch_size)

# Train the network and print the progress
train_loss, val_loss, metrics_res = train_loop(
    model=model_ex2,
    train_dataloader=dl_trn,
    val_dataloader=dl_trn, # Test with the training data
    loss_fn=loss_fn,
    metrics=metrics,
    optimizer=optimizer,
    print_every=100,
    epochs=number_epochs)

# Plot the training history
plot_training(train_loss, val_loss, metrics_res)

# Predict output on the training data
d_pred = model_ex2.predict(x_trn)

# Plot the result
plt.figure()
plt.ylabel('Prediction / Target')
plt.xlabel('Input')
plt.scatter(x_trn, d_trn, label='Target')
plt.scatter(x_trn, d_pred, label='Prediction')
plt.title('Prediction vs Target')
plt.legend(loc='best')
plt.show()

# CellName: Ex3 (#10)
**CellType: Exercise**  
**Cell instruction:**  

We will now look at the classification problem defined by the *syn2* dataset.
The cell below defines a single hidden node MLP. With this network you can only implement a linear decision boundary. Run the cell below to look at the resulting boundary that the MLP learns. The training accuracy is around 87-93%, because the data is generated randomly each time you run the code. 

## Question 7, optimize hyper-parameters for classification
Your task is now to reach a larger accuracy by fitting a model with more hidden nodes (and possibly more hidden layers). 
Your aim is to reach a training accuracy > 95%. To do that you need to tune the learning rate, batch size, epochs and the size of your MLP. **Present your set of hyper parameters that reach > 95% accuracy**

**Note**: To always generate exactly the same dataset each time you run the code you can set the *seed* to a value > 0. 

## Question 8, change learning algorithm
We have so far only used stochastic gradient descent (SGD), but we know that there are modifications of SGD that are very popular, e.g. Adam.
**Try the Adam optimizer for Q7, and compare (qualitatively) the results and the number of epochs needed to get them.**

The interpretation of the learning rate differs a bit between SGD and Adam. Since your learning rate was optimized for SGD in Q7, you could consider optimizing it again for Adam, before you compare SGD with Adam. **Present changes you needed to make to improve the results of the Adam optimizer, if any.**

**Info**: Adam has two extra parameters. The way we call the Adam optimizer, they will be kept at their default values *beta1* = 0.9 and *beta2* = 0.999. 

## Bonus tasks
The bonus tasks are provided if you have extra time and want to continue to explore methods that can further enhance the minimization of the loss function. **These tasks are not required for the course and do not influence any grading**. 

The tasks listed below also mean that you have to change the code in code cell *Training* (#6). There will be links to appropriate documentation below.

* Go back to Q7 and add use a momentum add-on to SGD. **Does momentum help?** (See documentation [here](https://docs.pytorch.org/docs/stable/generated/torch.optim.SGD.html))
* It is common to also introduce a mechanism that can lower the learning rate as we train. If we are using stochastic gradient descent the mini-batch gradients will never be zero, meaning that we will always make some small weight updates. PyTorch has methods that can lower the learning rate as we train (see [here](https://docs.pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)). Again go back to Q7 and now use an exponentially decaying learning rate (`ExponentialLR`). **Does it help?**


In [None]:
%%time

# seed = 0 means random, seed > 0 means fixed
seed = 0
np.random.seed(seed) if seed else None

# Generate training data
x_trn, d_trn = syn2(100)

# General standardization of input data
mu, std = standard(x_trn)
x_trn = (x_trn - mu) / std

# Define the network, cost function and training settings
model_ex3 = Network(
    inputs=x_trn.shape[1],      # number of input nodes
    outputs=1,                  # number of output nodes
    nodes=[1],                 # number of nodes in hidden layer
    activation=nn.Tanh,         # activation function in hidden layer
    out_activation=nn.Sigmoid   # activation function in output layer
    ).to(device)                # move data to GPU or keep with CPU

# Optimization parameters
opt_method = torch.optim.SGD    # minimization method
learning_rate = 0.05             # learning rate
loss_fn = nn.BCELoss()          # loss function, binary cross entropy
number_epochs = 1000
minibatch_size = 100

# Additional metrics to print
metrics = { 'accuracy': torchmetrics.Accuracy('binary') }

# Print a summary of the model
print(model_ex3)

# Set up the optimizer
optimizer = opt_method(model_ex3.parameters(), lr=learning_rate)

# Turn the training data into a dataset with Tensors on the GPU or CPU
dset_trn = TensorDataset(torch.tensor(x_trn, device=device, dtype=dtype_torch),
                         torch.tensor(d_trn, device=device, dtype=dtype_torch))

# Create a batch loader for the training data
dl_trn = DataLoader(dset_trn, batch_size=minibatch_size)

# Train the network and print the progress
train_loss, val_loss, metrics_res = train_loop(
    model=model_ex3,
    train_dataloader=dl_trn,
    val_dataloader=dl_trn, # Test with the training data
    loss_fn=loss_fn,
    metrics=metrics,
    optimizer=optimizer,
    print_every=100,
    epochs=number_epochs)

# Plot the training history
plot_training(train_loss, val_loss, metrics_res)

# Predict output on the training data
d_pred = model_ex3.predict(x_trn)

# Plot the decision boundary
decision_boundary(x_trn, d_trn, model_ex3)

# The report!

We have added intructions inside this report template. As you write your report, remove these instructions.

## Your name

## Introduction
A few sentences about the overall theme of the exercise.

## Answers to questions
Provide enough information to clarify the meaning of your answers, so that they can be understood by someone who does not scroll up and read the entire instruction.

The questions are repeated here, for clarity of what is demanded. If it does not fit your style to quote them verbatim, change the format. 

**Question 1**, variations in pre-defined MLP  
**(a)** Do you see the same loss vs epoch behavior each time your run? If not, why?  
**(b)** Do you observe that training fails, i.e. do not reach low loss, during any of these five runs? 

**Question 2**, vary learning rate  
Present your average MSE results and discuss your findings.

**Question 3**, vary (mini)batch size  
Present and discuss your findings.

**Question 4**, select good hyper-parameters  
Present your best combination of learning rate and batch size, and its result.

**Question 5**, vary epochs  
Compare the number of epochs needed to reach a good solution with that of Q4.  

Note: If you cannot find a good solution in a reasonable number of epochs, you can "revert" the problem: optimize learning rate and batch size for Q5, and the see how those hyper-parameters perform on Q4.

**Question 6**, vary network size and other hyper-parameters  
Present your set of good hyper-parameters and the result. 

Note: If you cannot solve this task in *reasonable* time, present your best attempt!

**Question 7**, optimize hyper-parameters for classification  
Present your set of hyper-parameters that reach > 95% accuracy

**Question 8**, change learning algorithm  
Try the Adam optimizer for Q7, and compare (qualitatively) the results and the number of epochs needed to get them. Present changes you needed to make to improve the results of the Adam optimizer, if any.

**Bonus tasks** (if you feel inspired)

## Summary
Connect the summary to your introduction, to provide a brief overview of your findings.
  