# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [2]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [5]:
from sklearn.metrics import SCORERS
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [17]:
from sklearn.model_selection import cross_val_score, KFold

cv = KFold(n_splits=10)

r2_score = cross_val_score(regressor, data, target, cv=cv, scoring='r2')
print(f"r2 score: {r2_score.mean():.3f}")

'r2 score:0.794'

Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [18]:
mae_score = -cross_val_score(regressor, data, target, cv=cv, scoring='neg_mean_absolute_error')
print(f"mae: {mae_score.mean():.3f}")


mae: 21.892


Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [15]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(regressor, data, target, scoring=['neg_mean_absolute_percentage_error','max_error'])

In [16]:
cv_results

{'fit_time': array([0.0067625 , 0.0374763 , 0.0076139 , 0.00992942, 0.00927162]),
 'score_time': array([0.00495434, 0.01029444, 0.00821972, 0.00822902, 0.00458956]),
 'test_neg_mean_absolute_percentage_error': array([-0.12451987, -0.13837998, -0.12663902, -0.12690628, -0.14102181]),
 'test_max_error': array([-134.45742704, -354.06671337, -306.67550638, -238.41260029,
        -596.32259916])}