# Machine Learning with scikit-learn

> A computer program is said to learn from **experience** E with respect to some class of **tasks** T and **performance measure** P, if its performance at tasks in T , as measured by P , improves with experience E.

Tom Mitchell, “Machine Learning”, McGraw-Hill, 1997

#### Task

- **Classification**: determining which of a finite set of categories an input belongs to.
- **Regression**: inferring the relationship between input and output in order to predict a numerical value given some input.
- **Clustering**: discover groups of similar examples within the data
- **Density estimation**: determine the distribution of data within the input space
- **Synthesis and Sampling**: generate new examples that are like the ones that compose the machine’s experience


#### Experience

the dataset the machine accesses to learn how to perform the task. The experience therefore depends on the type of task
- In **Supervised Learning**, the dataset contains the labels representing the ground truth from which the machine can learn. 
- In **Unsupervised Learning**: the dataset is a collection of examples with no additional information. Used to learn the properties or the structure in the data 
- In **Reinforcement Learning** the machine continuously learn by interacting with the environment.

In general, the dataset is split in two part:
- **Training set**: used to teach the model so it is the one that composes the actual experience of the machine.
- **Test set**: it is used to assess the model performance. For a fair assessment the performance must be measured on examples that are not part of the experience so that are not used during the training.
- **Validation set**: sometimes, there is the need for tuning some hyper-parameter (parameter that cannot be trained with the model)

#### Performance Measure

A machine needs a quantitative measure to assess how well it is doing. 
- It is often _task-specific_. For instance, accuracy is ok for classification but useless for regression
- It must be _indicative for the desired behaviour_. An algorithm that has the same task and learns from the same experience can have completely different behaviours if different measures are adopted.

In general, the measure used to learn (during training) may be different from the one used to assess the performances (test). 

## scikit-learn

[`scikit-learn`](https://scikit-learn.org/stable/index.html) is a Python package implementing Machine Learning methods for data analysis and designed to be simple and efficient. It is open source, commercially usable and build on top of `numpy`, `scipy`, and `matplotlib`. Not only `scikit-learn` supports supervised and unsupervised learning but also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

In [None]:
# module for data manipulation
import numpy as np
import pandas as pd

# modules for random number generation
from numpy import random

# modules for data visualization
import matplotlib.pyplot as plt

## Regression

Let's start with a toy case, where we create a dataset with a given model and then we use a `LinearRegression` model available in `scikit-learn` to see if it is able to learn the true relationship between input and ouput from the data. 

### Dataset

First, we need to build a dataset starting from a given model.

In [None]:
# function implementing the true model
def groud_truth(x):
    y = 4*x + 1
    return y

Once we have the model we can generate some random input data `x` and compute the labels (or targets) `y` which in general are affected by noise.

In [None]:
n_samples = 100 # number of samples to generate
x = random.rand(n_samples) # randomly generate samples
y_gt = groud_truth(x) # noise-less observations

noise = random.randn(n_samples)  # additive noise affecting the observations
y_true = y_gt + noise # actual observations

Here a visualization of the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x, y_true, c='C0', label='samples')
ax.plot(x, y_gt, c='C1', lw=2, label='model')
ax.set(xlabel='$x$', ylabel='$y$')
ax.legend()
ax.grid(True)

### Performance metric

Before we delve into the machine learning, let us recall that we need a performance measure to say if the machine is doing good or not. A typical measure for Regression problem is the **Mean Squared Error** (**MSE**) that is the mean of the squared errors computed as the difference between the prediction and the true values.

In [None]:
from sklearn.metrics import mean_squared_error

We can compute the MSE for the true model so that it will be a reference for our machine learning model.

In [None]:
mse_gt = mean_squared_error(y_true, y_gt)
mse_gt

### Model

We need an estimator that implements the model and is able to learn the parameters from the data that make it perform best.

`scikit-learn` provides dozens of built-in machine learning algorithms and models for regression problem. We here start by considering the `LinearRegression` model which finds the linear function that minimizes the square of the prediction error (that is why this kind of problem is called **Ordinary Least Squares**).

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# create an instance of the Linear Regression Esstimator
reg = LinearRegression() 
reg

### Training

At now, the estimator cannot be used as it has not been trained yet, that means that it has no experience. To train the model we call the method `.fit()` which takes the entire training set (both samples and labels) as input and learn the best parameters.

In [None]:
X = x.reshape((n_samples, 1)) # scikit-learn requires input to be 2-dim
reg.fit(X, y_true)

To assess the training we can first see if the estimated model fits the training data.
To do so we need to use the model to predict the labels from the training data, i.e., we use the model for **inference**. In `scikit-learn`, we do it by calling the method `.predict()` which takes the samples as input and returns the predicted labels.

In [None]:
y_pred = reg.predict(X)

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x, y_true, c='C0', label='samples')
ax.plot(x, y_gt, c='C1', lw=2, label='true model')
ax.plot(x, y_pred, c='C2', lw=2, label='pred model')
ax.set(xlabel='$x$', ylabel='$y$')
ax.legend()
ax.grid(True)

Better than a visual comparison, we should assess the training with a quantity measure. We have previously choosed the MSE.

In [None]:
mse_pred = mean_squared_error(y_true, y_pred)

print(f'mse ground-truth: {mse_gt:.3f}')
print(f'mse estimator: {mse_pred:.3f}')

That is a paradox! Our trained model cannot perform better than the ground-truth but it does. How can it be possible?

It is because we cannot assess the performance on the training set, we need a test set composed by samples that are different from the training set. Otherwise, it is called cheating!

By checking that the MSE in prediction is similar to the one obtained by the ground-truth, we just assess if the estimator was trained enough.

### Test

We need new samples. Since it is a toy case we can generate it.
If the dataset is given (as in most real applications), you usually split it into two parts, one for training and the other one for test.

In [None]:
x = random.rand(n_samples)
y_gt = groud_truth(x)

noise = random.randn(n_samples)
y_true = y_gt + noise

In [None]:
X = x.reshape((n_samples, 1))
y_pred = reg.predict(X)

In [None]:
mse_gt = mean_squared_error(y_true, y_gt)
mse_pred = mean_squared_error(y_true, y_pred)

print(f'mse ground-truth: {mse_gt:.3f}')
print(f'mse estimator: {mse_pred:.3f}')

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x, y_true, c='C0', label='samples')
ax.plot(x, y_gt, c='C1', lw=2, label='true model')
ax.plot(x, y_pred, c='C2', lw=2, label='pred model')
ax.set(xlabel='$x$', ylabel='$y$')
ax.legend()
ax.grid(True)

## Classification

We see now a toy case for a classification problem, that consists in finding which class the input data belongs to.

### Dataset

Here, we will make use of one of the datasets available in the `dataset` submodule of `scikit-learn`. In particular, we consider the **hand-written digits dataset** that contains images of hand-written digits and the task consists in identiying which one of the 10 digits the image displays.

In [None]:
from sklearn import datasets

In [None]:
digits = datasets.load_digits()

In `scikit-learn`, a dataset is a dictionary-like object that holds all the data and some metadata (that is info about data). This data is stored under the key `data`, which is a (`n_samples`, `n_features`) array. In the case of supervised problem, the dataset also comprises the lables that are stored under the key `target`.

In [None]:
print('data shape', digits['data'].shape)
print('labels shape:', digits['target'].shape)

In [None]:
print(digits['DESCR'])

Let's see some examples

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax in axes:
    i_sample = random.randint(0, len(digits.images))
    ax.imshow(digits.images[i_sample], cmap=plt.cm.gray_r)
    ax.set_axis_off()
    ax.set_title(f"class: {digits.target[i_sample]}" )

Our entire dataset is composed by `n_samples` samples. We cannot use them all for training, otherwise we do not any sample for assessing the performance. Therefore, we first need to split the dataset in **Training** and **Test** sets.

To do so, we can use the function `train_test_split` available in the `model_selection` submodule.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.4, shuffle=True
)
print(f'training -> X: {X_train.shape}, y: {y_train.shape}')
print(f'test     -> X: {X_test.shape }, y: {y_test.shape }')

### Model

As model we choose the **Support Vector Machine** (**SVM**) that is a common method to address classifgication problems (it was probably the most common before neural networks).
We need an estimator that implements the model and is able to learn the parameters from the data that make it perform best.

`scikit-learn` dedicates a entire submodule `svm` to SVM since the same underlying principle can be used for other tasks as regression and unsupervised anomaly detection.
SVM as classifier is implemented with the estimator `svm.SVC`.

In [None]:
from sklearn.svm import SVC

In [None]:
svc = SVC(gamma=0.001)

### Training

Let's train our model.

In [None]:
svc.fit(X_train, y_train)

### Test

To assess the performance of our model we need to choose a quantiative measure and compute it for the predictions on a test set. In classification we have many possible metrics (nearly all of them are available in the `metrics` submodule) and the most common are Accuracy, Precision-Recall, F1-score.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Here we make the predictions (i.e., inference on the test set) and then compute the metric.

In [None]:
y_pred = svc.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f'accuracy: {100*acc:.2f}%')

Accuracy is a good metric but, it little informative about how the model makes mistakes. For this reason, one may have a look at a more informative report on how the model performs for each class.

In [None]:
print(classification_report(y_test, y_pred))

Another way to visualize more information about the model performances is the Confusion Matrix, that not only shows how good is classification for each class but also show what are the mistakes.

In [None]:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=digits.target_names).plot();

Let's have a look at some of the most common mistakes.

In [None]:
# err_classes = {'true': 3, 'pred': 7}
err_classes = {'true': 5, 'pred': 6}
# err_classes = {'true': 5, 'pred': 9}
# err_classes = {'true': 8, 'pred': 1}

X_err = X_test[(y_test == err_classes['true']) & (y_pred == err_classes['pred'])]

nrows, ncols = 1, min(6, len(X_err))
fig, axes = plt.subplots(nrows, ncols, figsize=(10, 3), squeeze=False)
for ax, x in zip(axes[0], X_err):
    image = x.reshape((8,8))
    ax.imshow(image, cmap=plt.cm.gray_r)
    ax.set_axis_off()

---
---
---