___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[MEEC](https://ise.ualg.pt/en/curso/1477) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)
___

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import numpy as np

# Linear regression

## Ordinary least squares (OLS)

Let us generate some data and split it before fitting it.

In [None]:
def make_wave(n_samples=100):
    """ builds a sample with n_samples in the form y = x + random()"""
    rnd = np.random.RandomState(1)
    x = rnd.uniform(-10, 10, size=n_samples)
    y_no_noise = x
    y = y_no_noise + (rnd.normal(size=len(x)))
    x = x.reshape(-1, 1) # reshape to a (n,1) shape
    return x, y 

X, y = make_wave(n_samples=100)

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)

plt.scatter(X, y)
# plt.show()

Use the Ordinary least squares Linear Regression model

In [None]:
ols = LinearRegression().fit(x_train, y_train)

print("lr.coef_: {}".format(ols.coef_))
print("lr.intercept_: {}".format(ols.intercept_))

An R^2 of around 0.9 might not be not very bad...

(see: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

In [None]:
print("Training set score: {:.2f}".format(ols.score(x_train, y_train)))
print("Test set score: {:.2f}".format(ols.score(x_test, y_test)))

In [None]:
y_pred = ols.predict(x_test)

plt.plot(x_test, y_pred, label='pred')
plt.scatter(x_test, y_test, label='test')
plt.scatter(x_train, y_train, label='train')

plt.legend()
plt.show()

## Diabetes dataset

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

In [None]:
from sklearn.datasets import load_diabetes
diab = load_diabetes()

The dataset description is:

In [None]:
print(diab.DESCR)

The dataset has 442 samples and 10 features. The features are:

In [None]:
diab.feature_names

The target is progression of the disease. 

In [None]:
diab.target

Let us place the data in a pandas dataframe

In [None]:
import pandas as pd

df = pd.DataFrame(diab.data, columns=diab.feature_names)
df['evolution'] = diab.target
df.head()

Now, we can use Pandas to explore the dataset a bit more in detail (what conclusions can you draw from the data? why is data values between 0 and 1? suggestion: see the dataset description) 

In [None]:
df.describe()

### Applying OLS to the Diabetes data set

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

LinearRegression fits a linear model with coefficients $w = (w_1, \ldots, w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation, i.e., setting $\hat y = \sum_i w_i x_i + b$, OLS optimizes  $\min_{w}||y - Xw||^2_2$

In [None]:
X, y = diab.data, diab.target
x_train, x_test, y_train, y_test = train_test_split(X, y, 
                                                    shuffle=True, 
                                                    train_size=.75,
                                                    random_state=42
                                                   )

In [None]:
ols = LinearRegression().fit(x_train, y_train)

When comparing training set and test set scores, we find that we predict more accurately on the training than in the test set, as expected!!

In [None]:
print("Training set score: {:.2f}".format(ols.score(x_train, y_train)))
print("Test set score: {:.2f}".format(ols.score(x_test, y_test)))

We can also compute other metrics, such as the mean squared error and the mean absolute error:

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

y_pred = ols.predict(x_test) 

print("Mean squared error: {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("Mean absolute error: {:.2f}".format(mean_absolute_error(y_test, y_pred)))

### Applying Ridge regression to the Diabetes data set

Recall that, Ridge regression minimizes the objective function:
$||y - Xw||^2_2 + \alpha * ||w||^2_2$

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html



In [None]:
rr = Ridge(alpha=1).fit(x_train, y_train)

print("rr.coef_: {}".format(rr.coef_))
print("rr.intercept_: {}".format(rr.intercept_))


In [None]:
print("Training set score: {:.2f}".format(rr.score(x_train, y_train)))
print("Test set score: {:.2f}".format(rr.score(x_test, y_test)))

### Comparing OLS and Ridge

Let us compare the OLS and Ridge regression models on the Diabetes dataset. For the Ridge regression model, we will vary the value of the regularization parameter $\alpha$. From the plot below, we can see that the regularization parameter $\alpha$ allows to slightly improve the performance.

In [None]:
plt.figure(figsize=(15,5))

ridge_scores_train = []
ridge_scores_test = []

alphas = np.arange(0, 2, 0.01)

ols = LinearRegression().fit(x_train, y_train)

for alpha in alphas:
    rr = Ridge(alpha=alpha, ).fit(x_train, y_train)
    ridge_scores_train.append(rr.score(x_train, y_train))
    ridge_scores_test.append(rr.score(x_test, y_test))

plt.plot(alphas, ols.score(x_train, y_train) * np.ones(len(alphas)), '--', label='OLS - train')
plt.plot(alphas, ols.score(x_test, y_test) * np.ones(len(alphas)), '--', label='OLS - test')

plt.plot(alphas, ridge_scores_train, label='Ridge - train')
plt.plot(alphas, ridge_scores_test, label='Ridge - test')

plt.legend()

plt.ylabel('score')
plt.xlabel('alpha')

plt.show()

|### Exercises

Fix $\alpha = 0.1$ and then investigate how the size of the training dataset affects the score

## Extended Diabetes dataset

If we look at  data we can see that, i.e., different value magnituds appear on columns

In [None]:
df.describe()

So, let us do some data transformation, namely:
- scalling
- polynomial combinations of the features

In [None]:
def do_extended_diab():
    from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures

    diab = load_diabetes()
    X = diab.data

    # Transforms features by scaling each feature to a given range.
    X = MinMaxScaler().fit_transform(diab.data)
    
    # Generate a new feature matrix consisting of all polynomial combinations of the features 
    # with degree less than or equal to the specified degree. 
    # For example, if an input sample is two dimensional and of the form [a, b], 
    # the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
    poly_fit = PolynomialFeatures(degree=2, include_bias=False)
    X = poly_fit.fit_transform(X)
    
    return X, diab.target, poly_fit.get_feature_names_out(diab.feature_names)

X, y, feature_names = do_extended_diab()
print(X.shape)

df = pd.DataFrame(X, columns=feature_names)
df.head()

Now, the extended California housing dataset: dataset has 442 samples and 65 features.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42
)

### Applying OLS on the extended Diabetes dataset

In [None]:
ols = LinearRegression().fit(x_train, y_train)

When comparing training set and test set scores, we find that we predict very accurately on the training set (overfitting?), but the R2 on the test set is worse... as expected!!

In [None]:
print("Training set score: {:.2f}".format(ols.score(x_train, y_train)))
print("Test set score: {:.2f}".format(ols.score(x_test, y_test)))

### Applying Ridge regression on the extended Diabetes data set



In [None]:
rr = Ridge(alpha=1).fit(x_train, y_train)

print("rr.coef_: {}".format(rr.coef_))
print("rr.intercept_: {}".format(rr.intercept_))


In [None]:
print("Training set score: {:.2f}".format(rr.score(x_train, y_train)))
print("Test set score: {:.2f}".format(rr.score(x_test, y_test)))

### Comparing OLS and Ridge on the California housing extended dataset

In [None]:
plt.figure(figsize=(15,5))

ridge_scores_train = []
ridge_scores_test = []

alphas = np.arange(0, 2, 0.01)

ols = LinearRegression().fit(x_train, y_train)

# ols.score(x_test, y_test)

for alpha in alphas:
    rr = Ridge(alpha=alpha).fit(x_train, y_train)
    ridge_scores_train.append(rr.score(x_train, y_train))
    ridge_scores_test.append(rr.score(x_test, y_test))

plt.plot(alphas, ols.score(x_train, y_train) * np.ones(len(alphas)), '--', label='OLS - train')
plt.plot(alphas, ols.score(x_test, y_test) * np.ones(len(alphas)), '--', label='OLS - test')

plt.plot(alphas, ridge_scores_train, label='Ridge - train')
plt.plot(alphas, ridge_scores_test, label='Ridge - test')

plt.legend()

plt.ylabel('score')
plt.xlabel('alpha')

plt.show()

### Exercises

1. Fix $\alpha = 0.1$ and then investigate how the size of the training dataset affects the score
1. Do a similar study but with "scalling" and "polynomial combinations of the features" done individually. 

## Applying Lasso Regression on the extended Diabetes dataset
Just to recall, the optimization objective for Lasso is:
$\frac{1}{2  n_{samples}}  ||y - Xw||^2_2 + \alpha  ||w||_1$

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

In [None]:
lr = Lasso().fit(x_train, y_train)

print("rr.coef_: {}".format(lr.coef_))
print("rr.intercept_: {}".format(lr.intercept_))

In [None]:
print("Training set score: {:.2f}".format(lr.score(x_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(x_test, y_test)))
print("Number of features used: {}".format(np.sum(lr.coef_ != 0)))

As you can see, keeping the default parameters, Lasso does not so well, both on the training and the test set. This indicates that we are underfitting, and we find that it used only 4 of the 44 features.

But if we change the alpha parameter some improvement is achieved.

In [None]:
lr = Lasso(alpha=0.2, max_iter=10000).fit(x_train, y_train)

print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(x_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(x_test, y_test)))
print("Number of features used: {}".format(np.sum(lr.coef_ != 0)))

In [None]:
plt.figure(figsize=(15, 10))

ridge_scores_train = []
ridge_scores_test = []

lasso_scores_train = []
lasso_scores_test = []

alphas = np.arange(0, 2, .01)

ols = LinearRegression().fit(x_train, y_train)
ols.score(x_test, y_test)

for alpha in alphas:
    rr = Ridge(alpha=alpha).fit(x_train, y_train)
    ridge_scores_train.append(rr.score(x_train, y_train))
    ridge_scores_test.append(rr.score(x_test, y_test))

    lr = Lasso(alpha=alpha, max_iter=100000).fit(x_train, y_train)
    lasso_scores_train.append(lr.score(x_train, y_train))
    lasso_scores_test.append(lr.score(x_test, y_test))

plt.plot(alphas, ols.score(x_train, y_train) * np.ones(len(alphas)), '--', label='OLS - train')
plt.plot(alphas, ols.score(x_test, y_test) * np.ones(len(alphas)), '--', label='OLS - test')

plt.plot(alphas, ridge_scores_train, label='Ridge - train')
plt.plot(alphas, ridge_scores_test, label='Ridge - test')

plt.plot(alphas, lasso_scores_train, label='Lasso - train')
plt.plot(alphas, lasso_scores_test, label='Lasso - test')

plt.legend(loc='lower right')

plt.grid(True)

plt.ylabel('score')
plt.xlabel('alpha')

plt.show()

A lower alpha allowed us to fit a more complex model, which worked better on the training and test data. The performance is slightly better than using Ridge, and we are using only some of the 44 features. This makes this model potentially easier to understand.

### Exercises

Make a similar analysis for wine dataset, that you can find here: https://archive.ics.uci.edu/dataset/186/wine+quality