# Regression with Scikit-Learn

In [None]:
import sklearn

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

## Our task

Dataset representing diabetes progression in a set of patients. Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. Input features are already normalized.

Details here https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset

In [None]:
from sklearn.datasets import load_diabetes

In [None]:
X, y = load_diabetes(return_X_y=True)

In [None]:
X.shape,  y.shape

In [None]:
plt.hist(y)

### Construct selection and test sets

Now we create an arbitrarily split in order to have a selection set and a test set for the next experiments. Usually those splits are given by the task, e.g. ML Cup dataset for students and blind test set for the teacher.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=50, shuffle=True, random_state=42)

In [None]:
X_dev.shape, X_test.shape

## Evaluation metrics

We'll now see some evaluation metrics for a regression, using a dummy baseline as an example

### A baseline predictor

An extremely naive baseline predictor returns always the *mean target value* as prediction:

In [None]:
np.mean(y_dev)

In [None]:
y_pred = np.ones_like(y_dev) * np.mean(y_dev)

In [None]:
y_test_pred = np.ones_like(y_test) * np.mean(y_dev)

### Evaluating the baseline

In [None]:
from sklearn.metrics import *

Mean absolute error (MAE): average of 1-norms of output errors.

$\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=1}^{n_{\text{samples}}} \left| y_i - \hat{y}_i \right|$.

In [None]:
mean_absolute_error(y_test, y_test_pred)

Mean squared error (MSE): average of squared 2-norms of output errors.

$\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=1}^{n_\text{samples}} (y_i - \hat{y}_i)^2$.

In [None]:
mean_squared_error(y_test, y_test_pred)

Root mean squared error (RMSE): square root of MSE.

$\text{RMSE}(y, \hat{y}) = \sqrt{\text{MSE}(y, \hat{y})}$

In [None]:
sklearn.__version__

In [None]:
root_mean_squared_error(y_test, y_test_pred)

Other metrics are maxiumum error, median error, ...

Notice that in case of multi-output targets, these functions return the average errors among the targets.

Functions like *mean euclidean error* (MEE) must be implemented by the user, and then set as a custom scoring function in model evaluation.

In [None]:
def mean_eucliean_error(y_true, y_pred):
    errors = y_true - y_pred
    return np.linalg.norm(errors, axis=1).mean()

In [None]:
mean_eucliean_error(np.random.rand(10,2), np.random.rand(10,2))

## Data transformations

Some models can benefit from target normalization, e.g. when training a multi-output neural network with target variables of different scales.

In [None]:
from sklearn.preprocessing import *

In [None]:
Y = np.concatenate((np.random.randn(100,1)*10+25, np.random.rand(100,1)), axis=1)

In [None]:
plt.hist(Y[:,0], alpha=.5, label='y0')
plt.hist(Y[:,1], alpha=.5, label='y1')
plt.legend(loc='upper right')
plt.plot()

- Rescale values between minimimum and maximum:

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(Y)

In [None]:
Y_scaled = scaler.transform(Y)

In [None]:
plt.hist(Y_scaled[:,0], alpha=.5, label='y0')
plt.hist(Y_scaled[:,1], alpha=.5, label='y1')
plt.legend(loc='upper right')
plt.plot()

- Normalize values with mean and standard deviation:

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(Y)

In [None]:
Y_scaled = scaler.transform(Y)

In [None]:
plt.hist(Y_scaled[:,0], alpha=.5, label='y0')
plt.hist(Y_scaled[:,1], alpha=.5, label='y1')
plt.legend(loc='upper right')
plt.plot()

- **Remember to apply the inverse transform before estimating errors!**

In [None]:
np.linalg.norm(scaler.inverse_transform(Y_scaled) - Y)

## Nearest neighbour

k-NN returns the local interpolation of the targets of $k$ samples that are closest to $\mathbf{x}$.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
model = KNeighborsRegressor(n_neighbors=3,  # number of neighbours
                           weights='uniform',  # or weight proportional to inverse of distance
                           metric='minkowski',  # or other user-define distances
                           p=2)  # p-norm for 'minkowski' metric

In [None]:
model.fit(X_dev, y_dev)

In [None]:
mean_absolute_error(y_test, model.predict(X_test))

## Linear models

This class of models is essentially in the form $\hat{y} = \mathbf{w}^T \mathbf{x} + w_0$, with paramters $\mathbf{w}$ and $w_0$ to be trained. These models usually minimize the MSE with some form of weight regularization.

### Ridge regression

A linear model $\hat{y} = \mathbf{w}^T \mathbf{x} + w_0$ which is trained by least-squares regression $\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2$, i.e. it is minimizing a penalized MSE.

In [None]:
from sklearn.linear_model import Ridge

In [None]:
model = Ridge(alpha=1.0,  # regularization parameter
              solver='auto')  # choose solving method (e.g. SVD, SGD, ...)

In [None]:
model.fit(X_dev, y_dev)

In [None]:
mean_absolute_error(y_test, model.predict(X_test))

### Other linear models

Other variants of linear models are distinguished by the type of regularization that is applied in error minimization:
- `Lasso`: Weights have L1 regularization to favor sparsity, 
- `ElasticNet`: Weights have both L1 and L2 regularization,
- ...

## Neural networks

A neural network regressor, where you can choose the hidden layers and their units, the training procedure (SGD, LBGFS, Adam), regularization, etc. The loss to be minimized is *MSE loss*.

The L2 regularization parameter is $\alpha$, larger ⇒ more regularization.

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
nn = MLPRegressor(hidden_layer_sizes=(5,),  # input and output layer sizes automatically selected by fit()
                 activation='tanh',  # activation function
                 solver='sgd',  # {‘lbfgs’, ‘sgd’, ‘adam’}
                 alpha=1e-4,  # L2 regularization (is divied by #samples)
                 max_iter=50,  # epochs
                 batch_size=32,
                 shuffle=True,  # reshuffle samples between epochs
                 learning_rate='constant',  # can also be adaptive
                 learning_rate_init=1e-3,  # (initial) learning rate
                 momentum=0.9,
                 nesterovs_momentum=True,  # if you want to use Nesterov’s momentum, only for SGD
                 verbose=True)

In [None]:
nn.fit(X_dev, y_dev)

You can access learning curve and other training statistics in the `MLPRegressor` object

In [None]:
plt.plot(nn.loss_curve_)

In [None]:
mean_absolute_error(y_test, model.predict(X_test))

## Support vector machines

Support vector machines for regression $(C,\epsilon)$-SVR.

This class solves the soft-margin problem:
$\begin{align}\begin{aligned}\min_ {w, b, \zeta, \zeta^*} \frac{1}{2} w^T w + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^*)\\\begin{split}\textrm {subject to } & y_i - (w^T \phi (x_i) + b) \leq \epsilon + \zeta_i,\\
& (w^T \phi (x_i) + b) - y_i \leq \epsilon + \zeta_i^*,\\
& \zeta_i, \zeta_i^* \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align}$
where $C$ controls the strenght of regularization: larger $C$ ⇒ smaller regularization, and $\epsilon$ the tube diameter of the $\epsilon$-insensitive loss.

In the dual form, the kernel trick is applied in the scalar products $K(x_i, x_j) = \phi (x_i)^T \phi (x_j)$.

In [None]:
from sklearn.svm import SVR

In [None]:
svm = SVR(C=1.0,
          epsilon=0.1,
          kernel='linear',
          verbose=True)

For the linear SVR, the class `LinearSVR` can also be used, where the kernel trick is not applied.

In [None]:
svm.fit(X_dev, y_dev)

You can access statistics concerning the fit, such as number of support vectors per class:

In [None]:
svm.n_support_

In [None]:
mean_absolute_error(y_test, svm.predict(X_test))

## Exercise

Evaluate the following models: ridge regression, eleasticnet, k-NN, SVR with a double cross-validation, and report average test MAE with standard deviation.

5 folds of selection/test split; random seed = 42

Submit the results here: https://tinyurl.com/ml2025-sklearn

In [None]:
from sklearn.datasets import load_diabetes

In [None]:
X, y = load_diabetes(return_X_y=True)