# Machine learning tutorial - Exercise

Machine learning (ML) has seen rapid growth in the past decade and has revolutionized many fields in both industry and research, achieving state-of-the-art results and solving highly complex problems. Due to the sheer vastness of this topic, it is impossible to give a full overview in a single tutorial.

Thus, we are focussing on equipping you with all the basic tools necessary to start into the field of ML and be able to utilize its power for your own research project. In particular, we will use one of the most popular ML frameworks, [pytorch](https://pytorch.org/), to solve two typical tasks: classification and regression.

## Choosing a dataset

There exist several curated "benchmark" datasets that have been used in many papers and competitions for comparing the performance of different machine learning algorithms. Since we assume the audience to be interested in scientific datasets in particular, we chose the [Higgs dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) for our classification task. For the regression task, we chose the [life expectancy dataset](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who) by the World Health Organisation (WHO). Let's get started!

**Tip: There are several interesting websites to look for openly available benchmark datasets, like [google dataset search](https://datasetsearch.research.google.com/), [kaggle](https://www.kaggle.com/datasets/) or [paperswithcode](https://paperswithcode.com/datasets). There are also built-in datasets that you can easily access using the pytorch API, see [this Link](https://pytorch.org/vision/stable/datasets.html) for more information or [this Link](https://www.tensorflow.org/datasets/catalog/overview) for the `TensorFlow` equivalent.** 

# Part 1: Regression task

To learn about the fundamentals of `pytorch`, we will start with a multivariate regression task. We use the WHO life [life expectancy dataset](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who) where we try to predict the life expectancy of a country given different medical, economical and social factors.


## Downloading the WHO life expectancy dataset

The dataset can be downloaded from the above kaggle link, or simply by typing the following command in a command shell:

```
wget -O life_expectancy_data.csv https://wolke.physnet.uni-hamburg.de/index.php/s/Bdnmoit483Xrnmo/download
```


## Analyzing the WHO dataset

Before we actually start with the regression, we do some initial exploratory data analysis first. This is important to get an idea of what the data looks like and check if we have outliers or missing values.

A useful library for data analysis is [pandas](https://pandas.pydata.org/), which stores data in a so called "data frame". A data frame is essentially a table, where each row represents a sample of our dataset and each column represents a specific feature of this sample. For example, in case of the WHO dataset, each row represents a country and its respective data for a particular year and each column represents a specific economic or demographic feature, such as population or GDP.

The pandas library also comes with built-in plotting functions and allows for efficient exploration, preprocessing and manipulation of our data.


In [None]:
# Import necessary libraries
import torch
from torch import nn
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import time

In [None]:
# load data from csv file
df_who = pd.read_csv("life_expectancy_data.csv")

# take a look at dataframe
df_who

As we can see, this dataset consists of 2938 rows and 22 columns. We also see some of the column names that hint at what information they contain. The first column, `Country` contains the country name. We see that some countries occur multiple times. This is because the dataset contains data from the years 2000 to 2015 and the data of each year is put in a separate row for each country. The `Status` column tells us about the development status of a country. Then, we see several columns with health-related information, such as the `Life excpectancy`, the `Adult Mortality`, Alcohol consumption etc.


Let us explore this dataset a bit before we start with the analysis. We can make use of the plotting functions contained within `pandas` for this. Looking at all the 22 features of this dataset will take quite some time, so for this tutorial, we focus only on a subset of them. However, in your own research we recommend checking all the variables you want to analyze.

In [None]:
# columns of pandas dataframes can be accessed similar to dictionary
# values in python, by stating my_df["column_name"]. If you access a
# single column this way, you get a pandas Series object. If you pass
# multiple column names as a list (i.e.
# my_df[["column_name_1", "column_name_2"]]) it wil return a new DataFrame
# object with the selected columns only.

# we will select four variables for this test
selected_columns = ["Life expectancy", "Adult Mortality", "Alcohol", "GDP"]

# get new DataFrame object with the selected columns only
df_subset = df_who[selected_columns]

df_subset

First, let's plot the distributions of these variables. In particular, let's take a look at developed and developing countries separately. We can achieve this using the `Status` column in the original DataFrame object, `df_who`:

In [None]:
df_who["Status"].value_counts()

The `value_counts` method is very useful if we want to know about the unique values in a column and how often they occur. We see that a country can have either the value `Developed` or `Developing`.

We can simply create a selection mask from that column and use `pandas` built-in plotting methods to compare the distributions of our selected column variables for developed and developing countries separately.

In [None]:
# create mask for developed/developing country comparison
dev_mask = (df_who['Status'] == "Developed")

# Rearrange selected columns to a 2x2 matrix (as numpy array)
# to use in the subplots call (this arranges the plots of the
# four selected features in a 2x2 way)
colname_matrix = df_subset.columns.to_numpy().reshape(2,2)
fig, axs = plt.subplots(2,2, figsize=(14, 7))

for row in range(2):
    for col in range(2):
        colname_tmp = colname_matrix[row, col]

        # set current axis to subplot index
        plt.sca(axs[row, col])
        
        # define bin edges and use histogram plotting function
        # of pandas dataframe
        bin_edges = np.linspace(df_subset[colname_tmp].min(),
                                df_subset[colname_tmp].max(), 50)
        df_subset[~dev_mask][colname_tmp].plot.hist(bins=bin_edges,
                                                    label="Developing",
                                                    density=True)

        df_subset[dev_mask][colname_tmp].plot.hist(histtype="step",
                                                   bins=bin_edges,
                                                   label="Developed",
                                                   density=True)

        plt.xlabel(colname_tmp)
        plt.legend(loc="upper right")

plt.show()
plt.close()

As we can see, developed countries have a significantly higher life expectancy and lower adult mortality than most developing countries. Interestingly, the alcohol consumption seems higher in Developed countries as well. The GDP distribution of Developed countries leans more towards higher values in the "tail" while devloping countries usually have a low GDP.

All of these values are more or less what one would naively expect, so we do not see any variable being distributed in a completely counter-intuitive way.

Another important step before starting any analysis is to **look for outliers** in the data. Outliers can sometimes worsen your analysis results, or may point you to faulty datapoints. For example, let's say there is a faulty entry in the WHO dataset whose population value is very high (say, above the entire population on earth). Or, imagine you want to analyze only developing countries and the `Status` label was wrongly assigned such that you have a single developed country in your dataset. In this case, outlier analysis helps you to pinpoint these entries (in the former case by looking at the `Population` column, in the latter case by finding a very high value e.g. in the `GDP` column) and remove them from the dataset.

To visually assess if there are outliers in our WHO dataset, we can use the box plot method of our `DataFrame`.

In [None]:
# as before: arrange plots in 2x2 matrix
colname_matrix = df_subset.columns.to_numpy().reshape(2,2)
fig, axs = plt.subplots(2,2, figsize=(10, 7))

for row in range(2):
    for col in range(2):
        colname_tmp = colname_matrix[row, col]
        
        # set current axis to subplot index
        plt.sca(axs[row, col])
        
        df_subset[colname_tmp].plot.box()
        
plt.show()
plt.close()

Box plots are a great way of investigating important features of a distribution at a glance: The outer edges of the box correspond to the first and third quartiles (i.e. the 25th and 75th percentiles). The green line represents the median and the "whiskers" are set to 1.5 $\cdot$ IQR (IQR="Inter quartile range=Q3-Q1). If the highest or lowest datapoint of the distribution lie below the value of 1.5 $\cdot$ IQR, the whiskers are simply put at that value. The points beyond the whiskers are the ones that exceed 1.5 $\cdot$ IQR and thus could be considered as outliers.

We see quite a few outliers in the GDP and Adult Mortality features, which is expected in distributions that have long tails, which is actually the case if we take into account the previous distribution plots. Still, the few very high points in the `GDP` column look a bit suspicious, so let's take a look just to be sure.

In [None]:
df_who[["Country", "GDP"]].sort_values(by="GDP", ascending=False)

We see that the highest values come from Luxembourg, which is known to rank amongst the highest GPD (per capita) countries in the world, so this seems to be a correct value and not an actual outlier.

## Preprocessing the WHO dataset

### Step 1: Encode non-numerical features

We now already had a look at some of the distributions and checked for outliers. However, in order for the data to be used in a machine learning model, we need to clean it and put it in a representation that such a model can understand. Machine learning models can be trained on **numerical** inputs, which can be integers or floats. However, we see that there are at least some columns (e.g. the `Status` column) that have non-numerical values. 

We can easily check all the datatypes of our columns using the `dtypes` variable:

In [None]:
df_who.dtypes

We see that we are lucky: Most columns have a numeric type, like `int` or `float`. `String` columns typically have the `object` type in pandas dataframes. We see that there are two string-encoded columns: the `Country` column, which simply contains the name of the country and the `Status` column, which tells us about the development status of a country.

For a machine learning model to work on this data, we would need to turn these two columns into a numerical representation. This means we need to employ some kind of **encoding**. This is a topic for itself, but in principle there exist two major ways how we can encode strings that are categorical (i.e. there exist a limited number of unique string values, such as the country names):

1. Ordinal encoding: If there is an intrinsic order to the strings, we can simply convert the string values into integers, ranging from 1 to the number of unique string values. For example, say our column was "shirt size" and we had the values "S", "M", "L", "XL" and "XXL", then we could just encode them with integers from 1 to 5. This is possible because the size strings carry meaning: "S" is a smaller size than "XL", which is why it makes sense to assign it a smaller value.

2. One-hot encoding: If there is no intrinsic order to the strings (for example in case of our country names), we can do what is called a [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/). In a one-hot encoding, we **create a new column for each unique string value in the original column** and this column is 1 where this unique value occurs and 0 everywhere else. For example, let's imagine our country names would only contain the names of three countries: France, Germany and Belgium. To one-hot encode this column, we would create three new columns: one column is 1 where the string "France" occurs in the original column and else it is 0. The second column is 1 where the string "Germany" occurs and else it is again 0. And the same for the "Belgium" string. This can of course inflate our dataset quite a bit, in particular if we have many unique values.

Let's investigate the two `object` columns a bit. We can use the `value_counts` method to get the unique values in a column and how often they occur.

In [None]:
df_who["Country"].value_counts()

We see that there are 193 countries in that column. Some occur 16 times, which means they contain individual rows for each of the years 2000 to 2015, others only occur once (probably due to missing data in several years). If we were to one-hot encode this column, we would add 193 values, which seems a bit much. For now we decide to postpone the decision what to do with this column and just drop it for the first test.

In [None]:
# drop Country column for now
df_who = df_who.drop(columns=["Country"])

Now let's take a look at the `Status` column:

In [None]:
df_who["Status"].value_counts()

The `Status` column only has two unique values: `Developed` or `Developing`. For this column, one could argue to use an ordinal encoding, since developed and developing countries have certain unique quantifiable differences in terms of economics and also in other areas. This could lead us to use a higher integer value for developed countries and a lower one for still developing ones. However, since there are only two unique values in this column, we go for a one-hot encoding here.

In [None]:
# simple pandas function to one-hot encode a column: get_dummies

encoded_status = pd.get_dummies(df_who["Status"], prefix="status")

encoded_status

We see that there are now two new columns: one for the developed and one for developing. We could actually just drop one of these columns, since they are perfectly correlated, so the whole information is already contained only in one of them. Note that this would result in a single column that is 0 for one value and 1 for the other one, which could also be interpreted as an ordinal encoding of the original feature. So if there are only two features, a one-hot encoding already has kind of an ordinal structure to it.

In [None]:
# drop one of the new columns
encoded_status = encoded_status.drop(columns=["status_Developing"])

# drop the original status columns from the WHO DataFrame
df_who = df_who.drop(columns=["Status"])

# add the new status_Developed column to the WHO DataFrame
df_who["status_Developed"] = encoded_status["status_Developed"]

Let's again check the datatypes in our `DataFrame`. We should see that it doesn't contain `object` dtypes any more.

In [None]:
df_who.dtypes

### Cleaning the data

Another important point in any data analysis task is to check for infinite (`INF`) and not-a-number (`NaN`) values. These values can for example occur if data is missing or if a faulty measurement was obtained. We can use our data frame in combination with the `numpy` function `isfinite`:

In [None]:
# There are is a function called isna() that we can call on pandas dataframes
# to check for NaN values. For INFs we could use isin([-np.inf, np.inf]), but
# it is easier to use the numpy function isfinite() to check for both at the
# same time.

# We sum over the boolean array which is 1 at all occurrences that are
# Nan or INF per column to count them
(~np.isfinite(df_who)).sum()

We see that we have several columns where there are multiple `NaN` or `INF` values. Lets check particularly for `INF` values.

In [None]:
np.isinf(df_who).sum()

We see that there are no `INF` values, which means all the non-finite values we saw in the previous test are `NaN`.

There are several ways how we can mitigate `NaN` values: If there are only few rows affected, we can simply drop these. However, this is usually not ideal since we also remove information from the data. If there exist columns where a majority of rows is `NaN`, we can also consider dropping the column, since it doesn't carry much information that we can use.

Often, a better choice than simply dropping the values is **imputing** the values, i.e. replace the missing values with some other value. Typical choices for the replacement value are the mean or the median of the feature distribution. However, also these methods are not really great, since they do not preserve the relationships among variables. The optimal way would be to use some kind of domain knowledge to impute the values. For example, if you were an expert in the population growth behavior of developing countries and you knew that their population growth is best described with an exponential function, we could simply take the existing `Population` values of a country, do an exponential fit and set the missing values accordingly.

Beyond that, there exist many other methods to impute missing values. You can read more on the topic for example [here](https://odsc.medium.com/data-imputation-beyond-mean-median-and-mode-6c798f3212e3). Another approach is based on the k-nearest neighbors algorithm: Here we impute the mean only of the $k$ neighboring data points, computed as the euclidean distance across all features. You can read more about this topic [here](https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/). We will use this algorithm here, since we can easily use the scikit-learn (`sklearn`) imputer called `KNNImputer`.

In [None]:
from sklearn.impute import KNNImputer

# use imputer defaults for now
imputer = KNNImputer()

# transform the dataframe and save it into a new DataFrame object
df_who_clean = pd.DataFrame(imputer.fit_transform(df_who),
                            columns=df_who.columns)

Let's check for `NaN` values again:

In [None]:
(~np.isfinite(df_who_clean)).sum()

It worked! We imputed all the `NaN` values using the kNN imputation approach! Now we are done with the first data cleaning step.


### Preprocess samples for training

Now comes another important step: We need to split our dataset into three parts: The **training** set, the **validation** set and the **test** set. These datasets need to be **completely separate** from each other, in order to guarantee statistically independent results. Let's quickly discuss why this is necessary:

The training set is used for training the DNN. A DNN has many parameters or weights, that are adjusted during the training to best minimize a certain loss function - in case of the regression, this will be the **mean squared error** and in case of the binary classification, this will be the [**binary crossentropy**](https://en.wikipedia.org/wiki/Cross_entropy) loss. So in the end, we just fit a very complex function to our input dataset, which comes with a caveat: At first, the DNN will learn useful patterns in the data, but at some point the network will learn the "noise", i.e. the particular statistical fluctuations in our input distributions, to further decrease the loss. So the DNN just "memorizes" the data instead of learning the general pattern that helps solve our statistical problem.

What we are interested in, however, is learning a general function that does not only perform well on the training set, but also *generalizes* well to other, previously unseen datasets or samples. This is why we need a second, statistically independend dataset when we evaluate our model. This is the **validation** set. With the validation set, we can give an estimate on the **generalization performance** of our trained model. The validation set is also useful to detect **overfitting**, which is another word describing the above mentioned phenomenon of the DNN model starting to learn the noise of the training set.

In practice, we regularly evaluate our model both on the training and on the validation set during training. At first, the loss is minimized for both sets, but at some point the validation loss stalls out at a certain value or even increases, while the training loss goes further down. At this point, we can stop the training and select the model with the minimum validation loss, since it will yield the optimal generalization performance. This procedure is also referred to as **model selection**.

However, also the model selection introduces a bias, since we select the model with the best generalization performance *on that particular validation set*. This is why we need a third statistically independent dataset, the **test set** to finally have an objective performance measure that is affected by neither the training nor the model selection procedure.

Here is a summary figure of what we just discussed:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_4daa0520c9ce87a37f88ee55c5f2f82d.png" width=60%>
</p>

There is no general rule how to optimally split up your dataset, but typically the majority of data will go into the training set, while test and validation set are kept roughly the same. We will choose a 70/15/15 percent split for our first test.

Before we do that, we also split off the life expectany column from the `DataFrame`, since this is the "target" variable that we want to predict from all the other features.

In [None]:
# Split the cleaned DataFrame into targets and features
targets = df_who_clean['Life expectancy']
data = df_who_clean.drop(columns=['Life expectancy'])

# Split into training/validation/test sets according to 70/15/15 split
from sklearn.model_selection import train_test_split

# split off training set (70% of samples)
x_train, x_remain, y_train, y_remain = train_test_split(data, targets,
                                                        train_size=0.7,
                                                        random_state=42)


# split remainder into test and validation sets
# (15% each, corresponding to half of the remaining non-training samples)
x_val, x_test, y_val, y_test = train_test_split(x_remain, y_remain,
                                                train_size=0.5,
                                                random_state=42)

In [None]:
# let's see if the numbers check out
n_full = data.shape[0]
train_percent = (x_train.shape[0]/n_full)*100
val_percent = (x_val.shape[0]/n_full)*100
test_percent = (x_test.shape[0]/n_full)*100

print(f"Train set corresponds to {train_percent:.2f}% of the full data.")
print(f"Validation set corresponds to {val_percent:.2f}% of the full data.")
print(f"Test set corresponds to {test_percent:.2f}% of the full data.")

It seems everything checks out! Now we also would like to apply some more preprocessing to our data. Depending on your project, preprocessing - or "data wrangling" - can take up a significant proportion of your work. In the WHO dataset example we only apply a "standard scaling", which means that we subtract the mean and divide by the standard deviation in each feature, such that all features end up with a mean of 0 and a standard deviation of 1.

This makes sure our input features are all in a similar range, which can help algorithms based on gradient descent - which our DNN optimization is - have a faster and better convergence behaviour; (For more information read e.g. [this blog post](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/) for an in-depth discussion of feature scaling an when to use which scaling method).

In [None]:
# import standard scaler from scikit-learn
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit (i.e. do the mean/std computation based on the training set features)
# and transform training data
x_train = scaler.fit_transform(x_train).astype("float32")

# important: the standard scaling is always with respect to the training set
# values of mean and std! So DO NOT call fit_trainsform again on the other sets
# but just transform using the already fitted scaler.
x_val = scaler.transform(x_val).astype("float32")
x_test = scaler.transform(x_test).astype("float32")

# Note: We also cast all values to 32 bit floats here, since this is what
# pytorch models expect as inputs by default, while numpy arrays are stored
# in 64 bit precision
y_train = y_train.astype("float32")
y_val = y_val.astype("float32")
y_test = y_test.astype("float32")

In [None]:
# check if it worked
print("Means of training set:\n", x_train.mean(axis=0), "\n")
print("Standard deviations of training set:\n", x_train.std(axis=0), "\n\n")

print("Means of validation set:\n", x_val.mean(axis=0), "\n")
print("Standard deviations of validation set:\n", x_val.std(axis=0), "\n\n")

print("Means of test set:\n", x_test.mean(axis=0), "\n")
print("Standard deviations of test set:\n", x_test.std(axis=0), "\n\n")

It seems it worked! While means and standard deviations are 0 and 1 respectively for the training set, we get slightly higher/lower values in the validation and test sets, which reflects the fact that we used the standard scaler based on the training set values only and then just applied the transformations to the other two sets.

We are now good to go! We loaded our dataset, did some exploratory analysis, checked for outliers and `INF`/`NaN` values and then preprocessed it. Now we will build our neural network using `pytorch`!

## Building the `pytorch` model

Writing custom models in `pytorch` is relatively straightforward. For this tutorial, we will set up what is often referred to as a "dense" or "fully connected" neural network, which is essentially a model that consists of several layers. Each layer contains a certain number of nodes and all nodes from one layer are connected to all the nodes in the adjacent layers.

These fully connected networks (FCNs) have an input layer and an output layer and so called "hidden layers" that are in between. The number of nodes in the input layer corresponds to the number of input features of our dataset. The output layer usually has one node for a regression problem, where we try to predict a single variable from different input variables. Also for binary classification problems there is a single node in the output layer, in which case the node outputs the probability of an event belonging to the "positive" class. However, there are also multi-class and multi-label classification problems, where multiple output nodes are necessary.

Below you can find a simple picture of such a FCN architecture:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_7f05223770eb84859360fee088b1e88f.png" width=60%>
</p>

The number of hidden layers and of nodes inside them is set by the user, since there is no "optimal" rule how many one should take. For one ML problem, fewer layers with many nodes are beneficial, while for other problems it might be the opposite. Varying and finding the setting of number of layers and nodes per layer that yields the best result for a given problem is often referred to as ["hyperparameter optimization"](https://en.wikipedia.org/wiki/Hyperparameter_optimization) and it is a task that needs to be re-done for each problem or network architecture. One can use several methods to tune hyperparameters, ranging from "brute force" methods such as [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or random search to more sophisticated algorithms based on Bayesian optimization. 

Common layers of neural networks are provided in the [`torch.nn`](https://pytorch.org/docs/stable/nn.html) module of `pytorch`. A fully connected layer can be implemented by using [`torch.nn.linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) and providing the number of input and output values of this layer. This means that, instead of providing the concrete number of nodes in the current layer, you also need to think about how many nodes the next layer should have.

Let's take an example: Assume we had 5 input features and would like to create a layer for that. Our next layer, i.e. the first hidden layer, should have 16 nodes, which is why we would need to create the layer like this: `layer = nn.Linear(5, 16)`. Note that now the next linear layer in your model needs to have the same number of inputs as the output of the previous layer, so `16` in this case. You can now stack many of these layers like this until the output layer, which should have a number of output values of `1` for our regression problem.

However, this is not the full story yet! In order for the FCN to be able to learn complex non-linear functions of the data, we need to apply an [activation function](https://en.wikipedia.org/wiki/Activation_function) to the outputs of our layers. The most common and useful activations are also contained in the `torch.nn` module and we can simply add them to our model. One common activation function is the ["Rectified Linear Unit"](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) (ReLU), which is zero at negative values values of $x$ and $x$ for values $x\ge0$. We can add this activation by calling `torch.nn.ReLU()` after a linear layer. 

`ReLU` activations are especially useful in input and hidden layers. For the output layer, however, we need to choose a different activation function: For a regression problem, we would like to predict a continuous value, which means that we can simply use a linear activation in the end. Since we defined our output layer as a linear layer with one node already, we do not need to add a particular activation in this example.

To put it all together, we can create a fully connected model in `pytorch` by stacking linear layers, followed by an activation (such as `ReLU`) and making sure that the activation of our output layer fits our problem at hand (e.g. linear activation for regression). In practice, we implement this appending layers and activations to a python `list` and once we are done, we create a model by providing the list to a [`torch.nn.Squential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) object, which is just a sequential container that builds our final model from our provided modules.

In the code below, we have conveniently wrapped this into a separate class called `Regressor` which contains other useful methods that come in handy later in this tutorial.

In [None]:
class Regressor(nn.Module):
    def __init__(self, layers, n_inputs=5):
        super().__init__()

        # the layers variable will be the list that contains the individual
        # linear layers and their activations
        self.layers = []
        for nodes in layers:
            # in the first iteration, n_inputs will correspond to the number
            # of input features in our dataset. After the input layer, the
            # value of n_inputs is set to the number of output features, such
            # that the layers are always stacked in the correct way
            self.layers.append(nn.Linear(n_inputs, nodes))
            self.layers.append(nn.ReLU())
            n_inputs = nodes

        # the final output linear layer. We do not need an activation here,
        # since for regression, we are interested in having a continuous
        # output, which the linear layer already provides by default
        self.layers.append(nn.Linear(n_inputs, 1))

        # build pytorch model as sequence of our layers
        self.model_stack = nn.Sequential(*self.layers)

    def forward(self, x):
        # the forward call just takes data (x) and sends it through the model
        # to produce an output
        return self.model_stack(x)

    def predict(self, x):
        # the predict method sets the model to evaluation mode and only then
        # computes the model prediction of given data (x). For convenience,
        # we already make sure that the output prediction is a numpy array, so
        # we can use it easier in the final evaluation of the model.
        with torch.no_grad():
            self.eval()
            x = torch.tensor(x)
            prediction = self.forward(x).detach().cpu().numpy()
        return prediction

## Running the training

For running the training, we need to implement a "training loop". In Machine Learning, we typically optimize our model in an iterative way. This means that we use our training data multiple times to adjust the weights of the FCN and get better performance. Additionally, we often face the problem that our dataset is too large to fit into the memory at once. This is why usually we need to run the training in a "batched" fashion.

For this, we use a tool called a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) that is included in `pytorch`. The `DataLoader` takes our full dataset and randomly samples small batches of data from it. We then loop over all the batches and for each batch result adjust the weights of the network until we looped over all the samples in our dataset this way. This is the "inner training loop" and once it concludes, we say that one "epoch" of training is finished.

Also in the inner training loop, we need to use an [`optimizer`](https://pytorch.org/docs/stable/optim.html), which takes care of adjusting the model parameters to minimize the loss at each training step. There exist many optimization algorithms that one can use to approach the minimum of the loss function. Two very famous algorithms are [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) that is implemented in `pytorch` as [`torch.optim.SGD`](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) or [ADAM](https://optimization.cbe.cornell.edu/index.php?title=Adam), which is implemented as [`torch.optim.Adam`](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam).

The inner training loop needs to follow a fixed structure in `pytorch` that looks like this:

```python

# Send (batch of) inputs through the model
model_outputs = model(inputs)

# compute the loss of the model outputs using the model predictions and truth labels/target variables
loss = loss_function(model_outputs, targets)

# make sure gradients are reset to zero in each iteration
optimizer.zero_grad()

# backpropagate the loss. This will compute the gradient of the loss
# w.r.t. the weights of our model, which we need to update the weights
loss.backward()

# finally, actually update the weights
optimizer.step()
```

This "inner loop" is a loop over all the batches in our data, and it ends once we used up all of them.

Then there is the "outer training loop", which simply repeats the inner loop for a user-specified number of multiple epochs, but it consists of two parts:

1. The training part: Here, we just run the model through the inner training loop as described and also keep track of the **average** loss per batch.
2. The validation part: We set the model to **evaluation mode** (turn off gradients computation, set specific layers to freeze trained values) and then send our validation set through it. Again we compute the loss (also often in a batched fashion) and in the end compute the average loss per batch.

With these two parts, we can plot a **loss curve** of our training: The x axis shows the epoch and the y axis is the loss value. We can draw two curves, one for the training and one for the validation loss. This allows us to detect overfitting (train loss decreases, validation loss stalls out or increases) and also observe the convergence behaviour of our training.

This was a lot of theory, let's put this into practice!

In [None]:
# In order to be able to use the pytorch dataloader, we need to create pytorch
# datasets from our training, validation and test sets (and their respective
# target vectors)

from torch.utils.data import TensorDataset, DataLoader
from torch import optim
import torch.nn.functional as F

train_set = TensorDataset(torch.tensor(x_train),
                          torch.from_numpy(y_train.to_numpy()).reshape(-1, 1))
val_set = TensorDataset(torch.tensor(x_val),
                        torch.from_numpy(y_val.to_numpy()).reshape(-1, 1))

# create DataLoader objects
train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
val_loader = DataLoader(val_set, batch_size=32)

# set number of epochs you want to train.
epochs = 10

# build model using a single layer with 64 neurons
reg_model = Regressor(layers=[64], n_inputs = x_train.shape[1])

# Use Adam optimizer to keep track of model parameter updates.
# The "lr" parameter is the learning rate and describes the "step size"
# that we take at each weight update in the direction of the largest
# negative gradient value. It is again a hyperparameter we need to tune,
# but we leave it at the default of 1e-3 for now.
optimizer = optim.Adam(reg_model.parameters(), lr=1e-3)

# Define loss function. For a regression problem, use mean squared error loss.
loss = F.mse_loss

In [None]:
# define empty lists for storage of training and validation losses
train_losses = []
val_losses = []

start = time.time()

# outer training loop
for epoch in range(epochs):
    
    running_train_loss = 0
    
    # make sure model is in training mode
    reg_model.train()

    # training part of outer loop = inner loop
    for batch in train_loader:
        
        data, targets = batch
        output = reg_model(data)
        tmp_loss = loss(output, targets)
        optimizer.zero_grad()
        tmp_loss.backward()
        optimizer.step()
        
        running_train_loss += tmp_loss.item()
    
    print(f"Train loss after epoch {epoch+1}: {running_train_loss/len(train_loader)}")
    train_losses.append(running_train_loss/len(train_loader))
    
    ## validation part of outer loop
    
    running_val_loss = 0
    # deactivate gradient computation
    with torch.no_grad():
        
        # set model to evaluation mode
        reg_model.eval()
        
        # loop over validation DataLoader
        for batch in val_loader:
            
            data, targets = batch
            output = reg_model(data)
            tmp_loss = loss(output, targets)
            running_val_loss += tmp_loss.item()
        
        mean_val_loss = running_val_loss/len(val_loader)
        print(f"Validation loss after epoch {epoch+1}: {mean_val_loss}")
        
        # If the validation loss of the model is lower than that of all the
        # previous epochs, save the model state
        if epoch == 0:
            torch.save(reg_model.state_dict(), "./min_val_loss_reg_model.pt")
        elif (epoch > 0) and (mean_val_loss < np.min(val_losses)):
            print("Lower loss!")
            torch.save(reg_model.state_dict(), "./min_val_loss_reg_model.pt")
        
        val_losses.append(mean_val_loss)

end = time.time()
print(f"Done training {epochs} epochs!")
print(f"Training took {end-start:.2f} seconds!")

## Check training convergence

Our first training is done! Let's check if it converged nicely: We saved all the training and validation losses in a `list` which we can now plot!

In [None]:
plt.plot(np.arange(epochs), train_losses, label="training")
plt.plot(np.arange(epochs), val_losses, label="validation")
plt.ylabel("loss")
plt.xlabel("epochs")
plt.legend(loc="upper right")
plt.show()
plt.close()

It seems the training converged! However, we see that both training and validation losses still go down, so we could have trained a little longer here probably.

We can now do the final step of our analysis and evaluate our model!

## Evaluate the trained model

The evaluation is done using the **test set**. We send it through the model to get the predictions, and compare them to our truth labels to see how accurate it is. There are several metrics one can use to quantify the model performance. Which methods to use also depends on the task. For a regression task, we can take a look at the mean squared error (i.e. the loss we were minimizing), the mean absolute error or the median absolute error, which is less affected by very large differences. Another score that is often used is the [R2 score](https://en.wikipedia.org/wiki/Coefficient_of_determination), which is defined as

$$R^{2}=1-\frac{\sum{(y_{i}-\hat{y}_{i})^{2}}}{\sum{(y_{i}-\bar{y})^{2}}}$$

where $y$ is the target value, $\hat{y}$ the value predicted by the model and $\bar{y}$ the mean of the target values. One can understand this variable as the proportion of the total variance which can be explained by the model.

Let's compute some performance measures for our regression model. We can use the scikit-learn package (`sklearn`) again, which already offers functions to conveniently compute them:

In [None]:
from sklearn.metrics import (mean_absolute_error,
                             mean_squared_error,
                             median_absolute_error,
                             max_error, r2_score)

# make sure the minimum validation loss model is used
reg_model.load_state_dict(torch.load("./min_val_loss_reg_model.pt"))

y_pred = reg_model.predict(torch.tensor(x_test))

print("Classification performance report")

print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"Median Absolute Error: {median_absolute_error(y_test, y_pred):.2f}")
print(f"Max Error: {max_error(y_test, y_pred):.2f}")
print(f"R2 score: {r2_score(y_test, y_pred):.2f}")

We see that the results aren't too great. We still have large errors. The largest being about 60, which is very significant if we take into account the average lifespan of a human being. Also, the R2 score is negative, which means our model is not a very good predictor: The difference of the predictions and the target values is so large compared to the difference of the target values with their mean, that the ratio in the above formula becomes larger than one and we are getting a negative value overall.

Let's take a look at how well the predicted value correlates with the true value of the life epxectancy!

In [None]:
plt.errorbar(y_test, y_pred, fmt='bo', label="True values")
plt.xlabel("True Life Expectancy")
plt.ylabel("Predicted Life Expectancy")
plt.legend(loc="upper right")
plt.show()
plt.close()

Well, that doesn't look like a good correlation, either. It will be your task to fix this later!

That's it! We now finished a full regression task, using a full "analysis pipeline":

- loading the data
- doing some exploratory data analysis
- checking for outliers and cleaning `NaN` and `INF` values
- encoding non-numerical values
- splitting the data into training, validation and test sets
- preprocessing the data for optimised training
- running the training and checking convergence
- selecting the model with the best generalization performance
- evaluating the model


Now it's time for you to put into practice what you have learned!

## Task 1: Improve the regression model

Your first task is to play around with some of the parameters of the analysis pipeline to get better results for the regression. The amount of hyperparameters that can be tuned is vast, so there is surely some room for improvement! Here are some first leads:

- In the preprocessing, we used imputation for the missing values. Maybe with a different imputation method to handle `NaN` values, we can improve the performance? One hyperparameter for the kNN imputation is the value of $k$ that is used (default is 5). Maybe this wasn't the optimal value to use here.
- We didn't tune any of the model/training hyperparameters here (numbers of nodes/layers, learning rate, activation functions, batch size etc.), this would be a good starting point to improve the baseline performance probably

In [None]:
# your code here

## Task 2 (Bonus): Predict housing prices in California

The second task requires you to put everything you just learned to the test. Using the [California housing prices dataset](https://www.kaggle.com/datasets/camnugent/california-housing-prices) from kaggle (original data from the California 1990 census), your task is to do a regression to predict the median house price (column name is `median_house_value`) in the different districs of California from the different features contained in the dataset (such as the number of bedrooms or bathrooms, proximity to the ocean etc.). 

Please **do not** download the dataset from kaggle directly, but use the command below in a command shell of your choice to get it (we have slightly changed this dataset to make it a bit more challenging for you):

```
wget -O housing.csv https://wolke.physnet.uni-hamburg.de/index.php/s/XJ7aCZJ44NTxBQX/download
```

Don't forget to go through the whole data analysis/cleaning/preprocessing chain before starting to implement your machine learning model!

In [None]:
# your code here

# Part 2: Classification

## Finding the Higgs Boson

One of the major breakthroughs in Particle Physics in the last decade was the discovery of the Higgs Boson by the CMS and ATLAS collaborations at CERN. One major challenge is to distinguish particle collision events that actually contain the particles we are interested in (such as the Higgs boson) from "uninteresting" events that are already well-known in the Standard Model, but might still look similar to interesting events. The interesting events are also referred to as "signal" events, while the uninteresting ones are referred to as "background" events.

For the classification task, we will use the [Higgs dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS#), where we will try to train a classifier using `pytorch` to distinguish signal and background events from various input features.

## Downloading the Higgs dataset

Since the original files of the Higgs dataset are huge, we use a random subset of 25% of the original dataset for this tutorial. For convenience, we prepared two H5 files that contain the low-level kinematic inputs of the particles and higher level features that are computed from the lower level ones, respectively. These files can be downloaded using the following commands from your favourite command shell:

```
wget -O higgs_highlevels.h5 https://wolke.physnet.uni-hamburg.de/index.php/s/CXBsb63njFZe5J9/download

wget -O higgs_lowlevels.h5 https://wolke.physnet.uni-hamburg.de/index.php/s/nr9SREtXkBmzys9/download
```


## The Higgs dataset: Exploration

Similar to our regression task, we first do some exploratory data analysis before we start with the analysis. For now, we will focus on the high level variables.

In [None]:
# load h5 file into pandas dataframe, we will look at the high level variables
# at first

df_highlvl = pd.read_hdf("higgs_highlevels.h5")

Let's take a look at the features contained in the dataset

In [None]:
df_highlvl.columns

We see there are several mass and energy variables, as well as kinematic variables of leptons. There is also a `class_label` column, which tells us if an event comes from a Higgs decay ("signal", label=1) or from a known Standard Model particle decay ("background", label=0). In Machine Learning, these truth-level variables are often referred to as *label* or *target* values.

Let us take a look at the data types in our data frame:

In [None]:
df_highlvl.dtypes

We see that all of our variables are already in float format. This is great, because it means that we don't have to deal with encoding non-numerical features here!

Next, let's plot the distributions of these variables for the signal and background events separately:

In [None]:
# create mask for signal/background distribution comparison
sig_mask = (df_highlvl['class_label'] == 1)

# drop first column (class_label) and rearrange other columns 
# to a 6x2 matrix (as numpy array) , which will come in handy
# later, when we access the rows and columns of our plot
colname_matrix = df_highlvl.columns.to_numpy()[1:].reshape(6,2)
fig, axs = plt.subplots(6,2, figsize=(14, 21))

for row in range(6):
    for col in range(2):
        colname_tmp = colname_matrix[row, col]

        # set current axis to subplot index
        plt.sca(axs[row, col])
        
        # define bin edges and use histogram plotting function
        # of pandas dataframe
        bin_edges = np.linspace(df_highlvl[colname_tmp].min(),
                                df_highlvl[colname_tmp].max(), 100)

        df_highlvl[~sig_mask][colname_tmp].plot.hist(bins=bin_edges,
                                                     label="background")

        df_highlvl[sig_mask][colname_tmp].plot.hist(histtype="step", bins=bin_edges,
                                            label="signal")

        plt.xlabel(colname_tmp)

        # the phi angular coordinate is typically flat, which is
        # why it's the only feature we would like to see not in log scale
        if not "phi" in colname_tmp:
            plt.yscale("log")

        plt.legend(loc="upper right")

plt.show()
plt.close()

We can see what is often the case in particle physics: Signal and background distributions are - at least by eye - very similar. Neural networks try to find patterns in the data and utilize the subtle changes between signal and background to learn a function that maps from the features we see here to a single variable that has a much higher discriminative power.

The strength of Deep Neural Networks (DNNs) is that they can learn nonlinear functions over the full input space and thus can also pick up on correlations in higher dimensions, which we can not see easily by just looking at the one dimensional marginal distributions of the features.

## Preprocessing the data

We have now looked at the distributions and while we could not see any "strange" patterns or outliers in our data, we still need to make sure that there are not too many infinite or NaN values.

In [None]:
# Count NaN or INF values per column
(~np.isfinite(df_highlvl)).sum()

As we can see, we got lucky this time: There are no INF or NaN values in this dataset. This is often the case in particle physics data analysis, since usually the input data comes in through a sophisticated data preprocessing pipeline that is curated by your research collaboration, but it might be different in other fields.

Now we again split our dataset into training, validation and test set.

In [None]:
# At first, we will move the target column to a separate variable and then drop
# this column from the original dataframe

y_full = df_highlvl['class_label']
x_full = df_highlvl.drop(columns=['class_label'])

In [None]:
# import train_test_split function from scikit_learn, then first split off
# training set, and then split remainder into validation and test sets

from sklearn.model_selection import train_test_split

# split off training set (70% of samples)
x_train, x_remain, y_train, y_remain = train_test_split(x_full, y_full,
                                                        train_size=0.7,
                                                        random_state=42)


# split remainder into test and validation sets
# (15% each, corresponding to half of the remaining non-training samples)
x_val, x_test, y_val, y_test = train_test_split(x_remain, y_remain,
                                                train_size=0.5,
                                                random_state=42)


In [None]:
# let's see if the numbers check out
n_full = x_full.shape[0]
train_percent = (x_train.shape[0]/n_full)*100
val_percent = (x_val.shape[0]/n_full)*100
test_percent = (x_test.shape[0]/n_full)*100

print(f"Train set corresponds to {train_percent:.2f}% of the full data.")
print(f"Validation set corresponds to {val_percent:.2f}% of the full data.")
print(f"Test set corresponds to {test_percent:.2f}% of the full data.")

It seems everything checks out! Now we again apply some scaling for our input features to have a mean of zero and unit standard deviation.

In [None]:
# import standard scaler from scikit-learn
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit (i.e. do the mean/std computation based on the training set features)
# and transform training data

x_train = scaler.fit_transform(x_train)

# important: the standard scaling is always with respect to the training set
# values of mean and std! So DO NOT call fit_trainsform again on the other sets
# but just transform using the already fitted scaler.

x_val = scaler.transform(x_val)
x_test = scaler.transform(x_test)

In [None]:
# check if it worked
print("Means of training set:\n", x_train.mean(axis=0), "\n")
print("Standard deviations of training set:\n", x_train.std(axis=0), "\n\n")

print("Means of validation set:\n", x_val.mean(axis=0), "\n")
print("Standard deviations of validation set:\n", x_val.std(axis=0), "\n\n")

print("Means of test set:\n", x_test.mean(axis=0), "\n")
print("Standard deviations of test set:\n", x_test.std(axis=0), "\n\n")

It seems it worked! 

Now we are good to go and can build the model.

## Building a classifier model

The good news is: we can keep the exact same architecture as for the regression model and only make some minor changes. The first change is that now we need an activation after the output linear layer. For binary classification, we need the sigmoid activation, which is implemented in [`torch.nn.Sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid). This activation function makes sure that the output value is a continuous number between zero and one and can thus be interpreted as a probability how "signal-like" the model thinks an event is.

Secondly, we need to change the loss function. This time, it is not the mean squared error, but the [binary crossentropy](https://en.wikipedia.org/wiki/Cross_entropy) that is minimized. How can we interpret this loss?

A task that occurs very often in science and other fields is [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), where we estimate the parameters of a probability distribution from data by maximizing the likelihood function. Out of various reasons, one ends up minimizing the negative logarithm of that function ("negative log-likelihood minimization"), which is an equivalent problem.

A binary classification problem only has two outcomes, in our case "signal" and "background". This is similar to a coin throw. As we know, this kind of random variable is described by a [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) and the binary crossentropy is simply the negative logarithmic likelihood of that:

$$\mathrm{BCE\_loss}=-\frac{1}{N}\sum{y_{i}\cdot\log{(\hat{y}_{i})}+(1-y_{i})\cdot\log{(1-\hat{y}_{i})}}$$

where again $\hat{y}$ is the model prediction, $y$ ist the target value and $i$ is the index of a particular sample/event.

So what we are essentially doing is maximizing the likelihood of our data under a Bernoulli distribution. 

In the code below, we again coded up a class for the `Classifier` model. It is the same as the previous `Regressor` class, except that it uses a sigmoid activation in the output layer.

In [None]:
class Classifier(nn.Module):
    def __init__(self, layers, n_inputs=5):
        super().__init__()

        self.layers = []
        for nodes in layers:
            self.layers.append(nn.Linear(n_inputs, nodes))
            self.layers.append(nn.ReLU())
            n_inputs = nodes

        # the final linear layer with a Sigmoid activation for binary
        # classification
        self.layers.append(nn.Linear(n_inputs, 1))
        self.layers.append(nn.Sigmoid())

        self.model_stack = nn.Sequential(*self.layers)

    def forward(self, x):
        return self.model_stack(x)

    def predict(self, x):
        with torch.no_grad():
            self.eval()
            x = torch.tensor(x)
            prediction = self.forward(x).detach().cpu().numpy()
        return prediction

## Running the training

Here we can use the exact same training loop as in our regression problem. The only thing we need to change before is the loss function: from mean squared error to binary crossentropy.

In [None]:
from torch.utils.data import TensorDataset, DataLoader
from torch import optim
import torch.nn.functional as F

train_set = TensorDataset(torch.tensor(x_train),
                          torch.from_numpy(y_train.to_numpy()).reshape(-1, 1))
val_set = TensorDataset(torch.tensor(x_val),
                        torch.from_numpy(y_val.to_numpy()).reshape(-1, 1))

# create DataLoader objects
train_loader = DataLoader(train_set, batch_size=1024, shuffle=True)
val_loader = DataLoader(val_set, batch_size=1024)

# set number of epochs you want to train.
epochs = 5

# build model using a single layer with 64 neurons
clsf_model = Classifier(layers=[64], n_inputs = x_train.shape[1])

optimizer = optim.Adam(clsf_model.parameters(), lr=1e-3)

# Define loss function. For a binary classification problem,
# use binary crossentropy loss.
loss = F.binary_cross_entropy


In [None]:
# outer training loop

# define empty lists for storage of training and validation losses
train_losses = []
val_losses = []

start = time.time()

for epoch in range(epochs):
    
    running_train_loss = 0
    
    # make sure model is in training mode
    clsf_model.train()

    # training part of outer loop = inner loop
    for batch in train_loader:
        
        data, targets = batch
        output = clsf_model(data)
        tmp_loss = loss(output, targets)
        optimizer.zero_grad()
        tmp_loss.backward()
        optimizer.step()
        
        running_train_loss += tmp_loss.item()
    
    print(f"Train loss after epoch {epoch+1}: {running_train_loss/len(train_loader)}")
    train_losses.append(running_train_loss/len(train_loader))
    
    ## validation part of outer loop
    
    running_val_loss = 0
    # deactivate gradient computation
    with torch.no_grad():
        
        # set model to evaluation mode
        clsf_model.eval()
        
        # loop over validation DataLoader
        for batch in val_loader:
            
            data, targets = batch
            output = clsf_model(data)
            tmp_loss = loss(output, targets)
            running_val_loss += tmp_loss.item()
        
        mean_val_loss = running_val_loss/len(val_loader)
        print(f"Validation loss after epoch {epoch+1}: {mean_val_loss}")
        
        # If the validation loss of the model is lower than that of all the
        # previous epochs, save the model state
        if epoch == 0:
            torch.save(clsf_model.state_dict(), "./min_val_loss_clsf_model.pt")
        elif (epoch > 0) and (mean_val_loss < np.min(val_losses)):
            print("Lower loss!")
            torch.save(clsf_model.state_dict(), "./min_val_loss_clsf_model.pt")
        
        val_losses.append(mean_val_loss)

end = time.time()
print(f"Done training {epochs} epochs!")
print(f"Training took {end-start:.2f} seconds!")

## Check training convergence

Our first classifier training is done! Let's check if it converged nicely:

In [None]:
plt.plot(np.arange(epochs), train_losses, label="training")
plt.plot(np.arange(epochs), val_losses, label="validation")
plt.ylabel("loss")
plt.xlabel("epochs")
plt.legend(loc="upper right")
plt.show()
plt.close()

It seems the training converged! However, we see that both training and validation losses still go down, so we could have trained a little longer here probably.

We can now do the final step of our analysis and evaluate our model!

## Evaluate the trained model

The evaluation is done again using the **test set**. We send it through the model to get the predictions, and compare them to our truth labels to see how accurate it is. 

The metrics that we use to quantify the model performance are a different from the ones we used for the regression problem. The simplest is the accuracy: the fraction of correctly predicted samples divided by all samples in the test set.

Another more sophisticated but very common method is to plot a [`Receiver Operator Characteristic (ROC) curve`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). The advantage of this approach is that it takes into account both correct predictions of the positive class as well as wrongly classified negative class samples and it does so for different thresholds on the classifier output.

A typical approach is to plot this curve and quote its integral, the "Area Under Curve" (AUC) value as a single performance metric. The AUC is also often quoted in machine learning research papers to compare the performance of different state-of-the-art model architectures with each other.

In [None]:
# import ROC/AUC computation from scikit-learn package
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score

# make sure we take the model with the lowest validation loss
clsf_model.load_state_dict(torch.load("./min_val_loss_clsf_model.pt"))

y_score = clsf_model.predict(x_test)

test_acc = accuracy_score(y_test, np.around(y_score))
print(f"Model accuracy: {test_acc:.2f}")

fpr, tpr, thresholds = roc_curve(y_test, y_score)

test_auc = roc_auc_score(y_test, y_score)
print(f"Model AUC: {test_auc:.2f}")



In [None]:
# plot ROC curve
plt.plot(fpr, tpr)
plt.plot(np.linspace(0,1,100), np.linspace(0,1,100), color="grey",
         linestyle="dashed")

plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.show()
plt.close()

Great! We did a full analysis run, starting with the raw data, using some preprocessing, then training a binary classifier using `pytorch` and evaluating it.

The accuracy is about 70% and the AUC is about 0.8. The dashed line in the ROC curve is simply the diagonal and stands for the worst possible (i.e. random) performance, which corresponds to an AUC of 0.5. Our result is quite higher than that, but still, we might do better. Now it's your turn!

## Task 1: Improve the result!

Try to improve the AUC and also try to understand how varying different parameters of the training affect the result.

Some starting points might be:

- So far, we only considered a single hidden layer with 64 nodes. Maybe you can improve the model architecture a bit?
- We didn't play with other hyperparameters, like:
    - the learning rate
    - batch size
    - the optimizer
    - activations for the hidden layers
- We also didn't use the low-level features that are in the `higgs_lowlevels.h5`. Using the same preprocessing procedure as before, maybe you can integrate (some of) those and see if they actually improve our performance.

Play around a bit and get to know the `pytorch` library better! Feel free to copy code from anything we used so far!

In [None]:
# your code here

## Task 2: Breast cancer detection

Breast cancer is a serious health issue, since it is the most common form of cancer amongst women in the world and accounts for about 25% of all cancer cases. A key challenge is to classify tumors into malignant (i.e. cancerous) tumors and benign (i.e. non cancerous) tumors.

Your task ist to use the [breast cancer dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset) and train a classifier to distinguish between these two kinds of tumors given various tumor properties. You can either download the data from [kaggle](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset) or use the below command in a command shell.

```
wget -O breast_cancer.csv https://wolke.physnet.uni-hamburg.de/index.php/s/2W74iN84D9kCkTc/download
```

Run the whole analysis pipeline (exploratory data analysis, data cleaning and preprocessing, training and evaluation) for this dataset and get a first baseline AUC for your model, then subsequently try to improve it.

In [None]:
# your code here

# Bonus: Computer vision task

We started our journey with two simple problems: binary classification and regression. However, we can use the power of deep learning for far more advanced tasks. In particular, we can use it to detect patterns on image-based data!

We start with a rather simple dataset: the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). This is a dataset of images of hand-written digits and the task is to classify each image to the respective digit it is representing. This task will be different in two ways from the previous tasks: firstly, we will use a different network architecture, a `convolutional neural network` and secondly, we now have a **multi-class** classification task, since now we need to classify into 10 classes, i.e. we need to say if an image shows a digit from 0 to 9.

This task is also a good example where we can start using the [`torchvision.datasets`](https://pytorch.org/vision/stable/datasets.html) module to load our data. This is very useful, since we do not need to care too much about data cleaning and preprocessing ourselves. In fact, we even get data that is already split into training and validation sets, though not into a separate test set, unfortunately. For this tutorial, we will make our lives a bit easier and just use training and validation set, "cheat" a bit and do the final evaluation on the validation set too.

In [None]:
# import the mnist dataset and a transformation library
from torchvision.datasets import MNIST
from torchvision import transforms

In [None]:
# create a new MNIST object. This is essentially a pytorch dataset object
# containing the images of the hand written digits.
# The first argument of this object is the root directory where the data is 
# stored. If you did not download it already, you can simply tell torchvision
# to download it for you, by setting the respective parameter to True.
# We also set train to True, which means in this case we get the 
# dataset for the training set.
mnist_dataset_train = MNIST("./mnist", train=True, download=True, transform=transforms.ToTensor())

mnist_dataset_train

As we can see, the dataset object prints out some specific information: we have 60,000 datapoints, the root location is `"./mnist"` and we use the training split of the full dataset. To access the actual data inside the dataset, we need to use the `data` variable. Let's take a look at its shape:

In [None]:
mnist_dataset_train.data.shape

As we can see, we have 60,000 images and each image has 28 x 28 pixels. Lets pick out a single image and use `plt.imshow` to plot the respective digit in greyscale:

In [None]:
plt.imshow(mnist_dataset_train.data[1], cmap="gray_r")
plt.show()
plt.close()

I guess we can safely assume we found a zero here! This is very good. Now let's also create an `MNIST` object for the validation set:

In [None]:
mnist_dataset_val = MNIST("./mnist", train=False, transform=transforms.ToTensor())

Now, we can start to build our first convolutional neural network model. The topic of machine learning for computer vision tasks is so large that you can probably have semester-filling lectures about it. So for this quick tutorial we will try to only cover the very foundations of convolutional neural networks.

## Convolutions: Introduction

Convolutional neural networks are specifically tailored to perform well on a specific type of data, which is **image data**. There are two key insights that motivated the introduction of convolutions for image data: firstly, neighboring pixels in an image are usually highly correlated. Let's say you have an image of a cat lying on a grass hill with some blue sky in the background. If you took any of the blue pixels from the sky at random, the probability that another blue pixel from the sky is next to it is extremely high, while it is very unlikely that any of the neighboring pixels contain information about either the grass or the cat. From this fact, it is only logical that methods operating on image data do not need to capture information from the *whole* image, but rather from coherent *patches* inside the image. In our example image, there are three patches: some area will have a lot of grass, some other area will have a lot of blue sky and the third one contains the cat.

Secondly: Image data is *translation invariant*. Let's go again with the previous example image. Say we would like to predict from some image dataset if an image contains a cat or not. The information *where* the cat is in this image is irrelevant for this task. It can be in the upper left corner of the image or in the center or in any other position. If we would shift all image pixels up or down or left or right, we would end up with an image where the cat is now at a different position, but a machine learning algorithm should still be able to classify the image as containing a cat. This feature of image data leads to the assumption that a machine learning algorithm would probably benefit from a method that is somehow scanned over the image to be able to recognize patterns at any position of the image.

To summarize:
- images are typically made up of highly correlated, local patches of pixels -> motivates method that looks at smaller region of image rather than trying to capture the information of the image as a whole
- images are translation invariant, the location of an object inside an image should not matter -> motivates method that is scanned over image to recognize patterns everywhere

## Convolutions in practice

Convolutions combine both points: They use small so-called "kernels" that are systematically scanned over the image and each "pixel value" of the kernel is a parameter of the model that is adjusted during the training. Some of you might have heard of convolutions in mathematics, i.e. the convolution of two functions ( $f*g$ ). This is a completely different topic, however, and what we do in image processing is quite different. Here is a quick example how a convolutional layer in a machine learning model operates:

Consider the 3x3 kernel being convoluted with the 5x5 pixel image on the right.

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_5aa3677a95d34d07fab6e31f62cb5bfe.png" width=60%>
</p>

The convolution operation is done like this: we overlay the kernel with the image, starting in the top left corner. We multiply each element in the kernel with each corresponding element in the image and then sum everything up and set the result as the value of a new output image pixel, again starting from the top left. See the illustration below:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_a2909f900730f9d2312d0573b0454a90.png" width=80%>
</p>

We then repeat this process for each possible overlaid position of the kernel inside the image, going forward in a left-to right, top-to-bottom manner. Find below the illustration of the second step:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_92689504280872629ae0450e09358996.png" width=80%>
</p>

Finally, after we iterated all possible positions, the output will look like this:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_20cd4749951bdb8153974a160a5f26ca.png" width=30%>
</p>

Convolutions are useful because they are able to detect specific structures in our image: The kernel we just used, being one in the central column but zero everywhere else, produces a high output value if three pixels that are in the same column all have high values (high values in an image meaning a high intensity). This means we can use this kernel as a "feature detector" which detects straight vertical lines. This might be helpful in the MNIST case for detecting hand-written ones.

This is a very simple example, but once you start building a deep convolutional network with many convolutional layers and many kernels per layer, we can train our kernels to detect very specific features, such as a nose or an eye in a face recognition task etc. So to summarize: kernels can be trained to detect specific features using convolutions, and the output of each convolution tells us if such a feature is present in the convoluted image, by showing a high value in any of the output pixels.

## (Max) Pooling layers

Another common method used in convolutional neural networks is pooling. They are often applied after a convolution operation. Pooling methods simply aggregate information in an image. Let's say you tried to classify landscape images and many images contain large patches of blue sky. This means there is a lot of redundant information in the image: of course, you are interested if there is blue sky in the image, but you don't need a huge amount of pixels conveying this information.

This is where pooling comes in. In pooling, you also scan a kernel over the image and then map all the values that are overlaid with the kernel to a single, aggregated value. For example, if you used max pooling layers, you would just take the maximum value in that area. Other than convolutions, in pooling you often scan through the image in a non-verlapping way, so you set the "stride" to the respective kernel size such that the kernels are placed exactly next to each other. Here is an image that shows how max pooling works in practice:

<p align="center">
<img src="https://s3.desy.de/hackmd/uploads/upload_575d3be1d9f86cd766423ea1fd0a79e7.png" width=60%>
</p>


Now that we learned the foundations of convolutions, let's put the theory into practice and build our first convolutional neural network:

In [None]:
class Convoluter(nn.Module):
    def __init__(self, fcn_layers, conv_layers, kernel_size,
                 input_dim=28, in_channels=1):
        super().__init__()

        # similar to the classifier and regressor models,
        # each number in layers will set the parameters
        # of one model layer.
        self.layers = []
        
        for kernels in conv_layers:
            # structure of Conv2d:
            # first argument: number of input channels
            # (corresponds to number of kernels from 
            # previous layer)
            # second argument: number of kernels in current layer
            # third argument: kernel size (tuple or single integer for
            # quadratic kernels)
            self.layers.append(nn.Conv2d(in_channels, kernels, kernel_size))
            self.layers.append(nn.ReLU())
            
            # add max pooling
            self.layers.append(nn.MaxPool2d(2,2))
            in_channels = kernels
            
            # keep track of image size such that we can easily compute the
            # number of nodes of the first fully connected layer after the
            # convolutions later
            h = np.floor((input_dim - kernel_size + 1)/2).astype(int)
            input_dim = h
        
        # flatten convoluted outputs and continue with fully connected network
        self.layers.append(nn.Flatten())
        n_inputs = h*h*in_channels
        for nodes in fcn_layers:
            self.layers.append(nn.Linear(n_inputs, nodes))
            self.layers.append(nn.ReLU())
            n_inputs = nodes

        # multi-classification into 10 integers needs
        # 10 output nodes
        self.layers.append(nn.Linear(n_inputs, 10))
        
        # Softmax layer to make sure all outputs are numbers
        # between 0 and 1 and their values sum to 1, such that
        # we can interpret them as class probabilities
        self.layers.append(nn.Softmax())

        # build pytorch model as sequence of our layers
        self.model_stack = nn.Sequential(*self.layers)

    def forward(self, x):
        # the forward call just takes data (x) and sends it through the model
        # to produce an output (in our case a number between 0 and 1)
        return self.model_stack(x)

    def predict(self, x):
        # the predict method sets the model to evaluation mode and only then
        # computes the model prediction of given data (x). For convenience,
        # we already make sure that the output prediction is a numpy array, so
        # we can use it easier in the final evaluation of the model.
        with torch.no_grad():
            self.eval()
            x = torch.tensor(x)
            prediction = self.forward(x).detach().cpu().numpy()
        return prediction

## Prepare model and optimizer for training

The training is done as usual in `pytorch`: we need to create a model and an optimizer, and then we run a training loop. For a multi-class classification task such as MNIST, we use the (categorical) cross entropy loss, which is also defined in `torch.nn.functional`.

In [None]:
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader

# Define dataloaders for MNIST datasets
train_loader = DataLoader(mnist_dataset_train, batch_size=16)
val_loader = DataLoader(mnist_dataset_val, batch_size=16)

epochs = 5

# build model using a single layer with 64 neurons
conv_model = Convoluter(fcn_layers=[16], conv_layers=[3], kernel_size=3)

optimizer = optim.Adam(conv_model.parameters(), lr=1e-3)

# Define loss function. For a binary classification problem,
# use binary crossentropy loss.
loss = F.cross_entropy

## Run training loop

In [None]:
# outer training loop

# define empty lists for storage of training and validation losses
train_losses = []
val_losses = []

start = time.time()

for epoch in range(epochs):
    
    running_train_loss = 0
    
    # make sure model is in training mode
    conv_model.train()

    # training part of outer loop = inner loop
    for batch in train_loader:
        
        data, targets = batch
        output = conv_model(data)
        tmp_loss = loss(output, targets)
        optimizer.zero_grad()
        tmp_loss.backward()
        optimizer.step()
        
        running_train_loss += tmp_loss.item()
    
    print(f"Train loss after epoch {epoch+1}: {running_train_loss/len(train_loader)}")
    train_losses.append(running_train_loss/len(train_loader))
    
    ## validation part of outer loop
    
    running_val_loss = 0
    # deactivate gradient computation
    with torch.no_grad():
        
        # set model to evaluation mode
        conv_model.eval()
        
        # loop over validation DataLoader
        for batch in val_loader:
            
            data, targets = batch
            output = conv_model(data)
            tmp_loss = loss(output, targets)
            running_val_loss += tmp_loss.item()
        
        mean_val_loss = running_val_loss/len(val_loader)
        print(f"Validation loss after epoch {epoch+1}: {mean_val_loss}")
        
        # If the validation loss of the model is lower than that of all the
        # previous epochs, save the model state
        if epoch == 0:
            torch.save(conv_model.state_dict(), "./min_val_loss_conv_model.pt")
        elif (epoch > 0) and (mean_val_loss < np.min(val_losses)):
            print("Lower loss!")
            torch.save(conv_model.state_dict(), "./min_val_loss_conv_model.pt")
        
        val_losses.append(mean_val_loss)

end = time.time()
print(f"Done training {epochs} epochs!")
print(f"Training took {end-start:.2f} seconds!")

We did it! We trained our first convolutional neural network!

## Evaluate the training

Let's start bei looking at the losses:

In [None]:
plt.plot(np.arange(epochs), train_losses, label="training")
plt.plot(np.arange(epochs), val_losses, label="validation")
plt.ylabel("loss")
plt.xlabel("epochs")
plt.legend(loc="upper right")
plt.show()
plt.close()

This looks good, but we could have probably trained a little longer here.

Now about the performance: For multi-class classifications, the performance evaluation isn't actually that easy. The simplest thing we can do is estimate the accuracy, i.e. the ratio of correctly predicted images and all images.

What one can also plot is a confusion matrix. The rows of this matrix contain the *actual* truth-labelled classes and the columns the classes predicted by the model. In an ideal classifier, we would get a perfect diagonal: all images that were actually a hand-written 0 are assigned to 0, all images that were a 1 are assigned to 1 and so on. However, the classifier will never be perfect, so we will have always values in the off-diagonal elements of that matrix. Let's take a look!

In [None]:
# make sure we take the model with the lowest validation loss
conv_model.load_state_dict(torch.load("./min_val_loss_conv_model.pt"))

# use the full MNIST validation set at once for evaluation
# we need to add a single dimension of 1 as the "channels" and cast it to
# float for the network model to be able to digest this (in our actual
# training, this is taken care of by the ToTensor transformation already)
full_mnist_val = mnist_dataset_val.data.reshape((-1, 1, 28, 28)).type(torch.float32)

preds = conv_model.predict(full_mnist_val)

predicted_class = np.argmax(preds, axis=1)
true_class = mnist_dataset_val.targets.numpy()

accuracy = np.count_nonzero(predicted_class == true_class)/true_class.shape[0]

print(f"The achieved accuracy is: {accuracy:.2f}")

Very good! In our first test, we already achieved around 90% accuracy! Let's take a look at the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

# get confusion matrix
conf_matrix = confusion_matrix(true_class, predicted_class)

sns.heatmap(conf_matrix, cmap="Blues", annot=True, fmt='g')
plt.show()
plt.close()


This looks good! We see that the values are highest at the diagonals of the matrix. The true class is the rows and the predicted class the columns. We also see some interesting off-diagonal elements.

Now it's your turn again! 

## Task 1: Improve the result

Similar to the previous tasks, play around with the network and training hyperparameters to improve the accuracy of the model. Once you have the best version of your model, plot the confusion matrix again. Do the misclassifications change compared to the first version?

In [None]:
# your code here

## Task 2: Compare performance on fully connected network

You've learnt already about fully connected classifiers in this tutorial. You might ask the quesion: Can't I just flatten the image completely and feed the pixel values into a fully connected network?

The answer is: of course! Please do this test and compare the performance of the fully connected network with the convolutional one? Which one is better and what do you think is the reason why? Note that you need to change the output layer and activation as well as the loss function of the original `Classifier` model for this to work!

## Task 3: CIFAR 10 dataset

Your next task is to run a training on a different dataset: the [CIFAR 10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), which is a dataset of 60000 32x32 pixel colour images that need to be assigned to any of 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck). Use the full analysis pipeline for convolutional neural networks as we discussed in the tutorial. Train and evaluate your model, compute the accuracy, the confusion matrix and then improve until you get the best possible performance!

Note that this dataset is also available as a `torchvision` dataset: [`torchvision.datasets.CIFAR10`](https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html#torchvision.datasets.CIFAR10)

In [None]:
# your code here