#  06 - Supervised Learning Methods

*HFT Stuttgart, 2024 Summer Term, Michael Mommert (michael.mommert@hft-stuttgart.de)*

This Jupyter Notebook provides a simple introduction into Python programming and is based on Notebooks prepared by the amazing Dr. Marco Schreyer.

Today, we will use your Python skills implement our first machine learning models. For this purpose, we will utilize the [`scikit-learn`](https://scikit-learn.org/stable/index.html) package, which provides a huge amount of functionality for different machine learning tasks, as well as some datasets for learning how to use this functionality. 

In this example, we will utilize a *k*NN multi-class classfication model.

## Lab Objectives:

The learning objectives for today are based on the supervised learning setup discussed in our lecture:

This Notebook follows this pipeline in its structure:

> 0. Data loading and exploration
> 1. Feature engineering
> 2. Data scaling
> 3. Data splitting
> 4. Define hyperparameters
> 5. Train model on fixed hyperparameters
> 6. Evaluate model on val data set
> 7. Maximize performance on validation data set by tuning hyperparameters
> 8. Evaluate trained model on test data 

We will apply these steps to the $k$-NN classifier. Furthermore, we will look at the following things in this lab:
> 9. Model evaluation with the confusion matrix
> 10. Setting up a random forest classifier

## -1. Setup of the Environment

We install and import necessary packages:

In [None]:
! pip install -r requirements.txt

In [None]:
# import the numpy, scipy and pandas data science library
import numpy as np

# import sklearn data and data pre-processing libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split

# import matplotlib data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

Set a random seed for all our experiments - this ensures reproducibility.

In [None]:
random_seed = 42

# 0. Data Loading and Exploration

In [None]:
iris = datasets.load_iris()

We can extract the data, targets and target (class) names as follows:

In [None]:
x = iris.data
y = iris.target
y_classes = iris.target_names

**Exercise**: Explore the data yourself!

Before we continue, we have a look at a method from Python's **Seaborn** library to create a pairwise plot of all features, referred to as a **Pairplot**. The Seaborn library is a powerful data visualization library based on the Matplotlib. It provides a great interface for drawing informative statstical graphics (https://seaborn.pydata.org). 

In [None]:
# init the plot
plt.figure(figsize=(10, 10))

# load the dataset also available in seaborn
iris_plot = sns.load_dataset("iris")

# plot a pairplot of the distinct feature distributions
sns.pairplot(iris_plot, diag_kind='hist', hue='species');

It can be observed from the created Pairplot, that most of the feature measurements that correspond to flower class "setosa" exhibit a nice **linear separability** from the feature measurements of the remaining flower classes. In addition, the flower classes "versicolor" and "virginica" exhibit a commingled and **non-linear separability** across all the measured feature distributions of the Iris Dataset.

# 1. Feature Engineering

Both the **input** data (`iris.data`) and the **output** data (`iris.target`) are already available in the form of quantitative data (continuous input data and discrete class labels), which we directly feed into our ML models. Therefore, no feature engineering is required for this specific data set.

# 2. Data scaling

We implement data scaling using the Standard scaler class, which forces upon the values in each feature a mean of unity and a spread that is based on the variance in the dataset. 


Let's use the standard scaler implemented in scikit-learn to scale our data. We have to import the correponding class and initialize it.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

To apply the scaler and retrieve a transformed dataset, use the `.fit_transform()` method:

In [None]:
data_scaled = scaler.fit_transform(iris.data)

In [None]:
print('original data, mean =', np.mean(iris.data, axis=0))
print('original data, std =', np.std(iris.data, axis=0))
print('scaled data, mean =', np.mean(data_scaled, axis=0))
print('scaled data, std =', np.std(data_scaled, axis=0))

We can now use the scaled data in our machine learning model. 

Hint: If you would like to undo the scaling tranformation, you can use the `.inverse_transform()` method of `scaler`:

In [None]:
print('original data:', iris.data[0,])
print('scaled data:', data_scaled[0,])
print('unscaled data:', scaler.inverse_transform(data_scaled[0,].reshape(1, -1)))

**Exercise**: implement a different data scaler and compare the results.

# 3. Data splitting

To understand and evaluate the performance of any trained **supervised machine learning** model, it is good practice to divide the dataset into a **training dataset** (the fraction of data records solely used for training purposes), a **validation dataset** (data to evaluate the current settings of your hyperparameters) and a **test dataset** (the fraction of data records solely used for independent evaluation purposes). Please note that both the **validation dataset** and the **test dataset** will never be shown to the model as part of the training process. The **test dataset** is sometimes also referred to as **evaluation set**; both terms refer to the same concept.

We first split our scaled dataset into a training dataset and some other dataset (which we will refer to as *remainder* in the following) that will then be evenly split into a validation and test dataset. We set the fraction of training records to **60%** of the original dataset:

In [None]:
train_fraction = 0.6

Randomly split the scaled dataset into training set and evaluation set using sklearn's `train_test_split` function:

In [None]:
# 60% training and 40% remainder
x_train, x_remainder, y_train, y_remainder = train_test_split(data_scaled, iris.target, test_size=1-train_fraction, 
                                                              random_state=random_seed, stratify=iris.target)

In [None]:
# 50% validation and 50% test
x_val, x_test, y_val, y_test = train_test_split(x_remainder, y_remainder, test_size=0.5, 
                                                random_state=random_seed, stratify=y_remainder)

Note the use of the `stratify` keyword argument here: a stratified split makes sure that approximately the same fraction of samples from each class is present in each dataset. Therefore, we have to provide the same list of class labels to this argument.

Evaluate the different dataset dimensionalities:

In [None]:
print('original:', iris.data.shape, iris.target.shape)
print('train:', x_train.shape, y_train.shape)
print('val:', x_val.shape, y_val.shape)
print('test:', x_test.shape, y_test.shape)

# 4. Define Hyperparameters

As we learned in the lecture, our *k*NN model has a single hyperparameter: the number of neighbors, *k*. We start by considering a simple *nearest neighbor* model, which, of course, implies that $k=1$.

 

In [None]:
k = 1

# 5. Train model on fixed hyperparameters

We start by creating a model instance, which requires passing the chosen hyperparameters to the model: 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=k)

Now we have to train the model on our training dataset. Each model implemented in `scikit-learn` has a `.fit()` method for this purpose. "Fitting" refers here to the same idea that we typically refer to as "training", so don't get confused.

The training of the data requires two `arrays`: the training input features ($\mathbf{X}$) and the training target vector ($\mathbf{y}$), such that for a given classifier $f$ the following holds: 
$$f(\mathbf{X}) = \mathbf{y}$$

The way we split our dataset and into `x_train` and `y_train` already follows this naming convention. We can use those `arrays` readily in the training. Just for reference: `x_train` has to be of shape `(n_samples, n_features)` and `y_train` has to be of shape `(n_samples,)`.

In [None]:
model.fit(x_train, y_train)

`model` is now trained and can be used to make predictions. Let's take one datapoint from our training dataset and see whether it makes a correct prediction:

In [None]:
model.predict([x_train[0]])

In [None]:
y_train[0]

Indeed, it classifies this single datapoint correctly. However, this is not a good way to test or evaluate the performance of your model. Why?

# 6. Evaluate model on val data set

Of course, we should use our previously split test sample for evaluating our model performance:

In [None]:
y_pred = model.predict(x_val)
y_pred

In [None]:
y_val

A quick by-eye check seems to look pretty promising, but of course we need a more quantitative metric for the performance of our model.

In the case of classification, we can use the accuracy metric:

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_val)

This is it. After evaluation on our independent eval dataset - which the model has not seen during training - we find that our model makes an accurate prediction in 96.7% of cases.

This could be it, but there is a good chance that by tuning our sole hyperparameter, $k$, we can achieve a better result. 

## 7. Maximize performance on validation data set by tuning hyperparameters

Let's compile all the relevant code in one cell and try a different value for $k$:

In [None]:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(x_train, y_train)
y_pred = model.predict(x_val)
accuracy_score(y_pred, y_val)

We can now use a loop over different choices for $k$ and evaluate the model for these parameters to find the best-performing one. This process is called **hyperparameter tuning**.

However, there is one more technical detail. Currently, we evaluate the performance on our **test dataset**. If we select $k$ based on these evaluations and therefore the **test dataset**, we have a *data leakage*. To resolve that issue, we can evaluate our model on the **validation dataset** for different $k$s and then, after picking the best-performing $k$, we can evaluate that model on the **test dataset**, providing an independent measure of performance. 

In [None]:
for k in [1, 3, 5, 7, 10, 15, 20]:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_val)
    print('k={:d}, val accuracy={:.2f}%'.format(k, accuracy_score(y_pred, y_val)*100))

It seems that the model performs equally well for $k\sim3$ and $k\sim10$. Based on experience, I would pick $k=10$. Why? For small values of $k$, you are more likely to **overfit** the training data, so choosing a larger value of $k$ increases the chances that the model generalizes well to data it has never seen before.


# 8. Evaluate trained model on test data set

Let's retrain the model with $k=10$, make predictions on the test data set and we're done:

In [None]:
k = 10
model = KNeighborsClassifier(n_neighbors=k)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print('final test accuracy={:.2f}%'.format(accuracy_score(y_pred, y_test)*100))

Indeed, evaluating the model on the test dataset provides the same accuracy as for the validation dataset - but this is not always the case, since both datsets are different from one another.

# 9. Model evaluation with the confusion matrix

So far, we have only considered the accuracy metric to evaluate our predictions. It would be useful to know whether one class of iris is more likely to be mistaken than another. For that purpose, confusion matrizes are used:

In [None]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(y_test, y_pred)
mat

On the y-axis you have your predicted classes, on the x-axis you have the actual classes. Those entries on the diagonal have been accurately predicted. Entries off the diagonal indicate how many flowers have incorrect class predictions.

The ``seaborn`` library provides a method to generate nicely formatted confusion matrizes. Let's give it a try:

In [None]:
# init the plot
plt.figure(figsize=(7, 7))

# plot confusion matrix heatmap
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, cmap='Reds', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)

# add plot axis labels
plt.xlabel('Ground Truth')
plt.ylabel('Prediction')

# add plot title
plt.title('Confusion Matrix')

# 10. Setting up a random forest classifier

With `scikit-learn`, it is easy to simpoly try out a different model, since all algorithms (ML methods, scalers, etc.) are implemented as classes and thus provide `.fit()` and `.transform()` methods.

**Exercise**: Implement a random forest classifier! Here, we use two different hyperparameters: the number of trees in the ensemble and the maximum depth of the individual trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier