# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [3]:
# Write your code here.
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [13]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(linreg, data, target, cv=10, scoring='r2')
print(f"R2 score: {scores.mean():.3f} +/- {scores.std():.3f}")

R2 score: 0.794 +/- 0.103


In [11]:
# Write your code here.
from sklearn.model_selection import cross_validate
result_linreg_r2 = cross_validate(linreg, data, target, cv=10, scoring="r2")
result_reg_r2_df = pd.DataFrame(result_linreg_r2)
result_reg_r2_df

Unnamed: 0,fit_time,score_time,test_score
0,0.006994,0.001011,0.843903
1,0.002989,0.001009,0.854974
2,0.003991,0.001009,0.887523
3,0.002992,0.001009,0.749511
4,0.00299,0.00101,0.81698
5,0.003992,0.000998,0.820134
6,0.003996,0.001,0.815541
7,0.002999,0.001,0.814525
8,0.004,0.001013,0.501158
9,0.002988,0.001001,0.833307


In [12]:
print(f"R2 result for linreg: {result_reg_r2_df['test_score'].mean():.3f} +/- {result_reg_r2_df['test_score'].std():.3f}")

R2 result for linreg: 0.794 +/- 0.109


Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [17]:
# Write your code here.
result_linreg_mae = cross_validate(linreg, data, target, cv=10, scoring="neg_mean_absolute_error")
result_reg_mae_df = pd.DataFrame(result_linreg_mae)
result_reg_mae_df

Unnamed: 0,fit_time,score_time,test_score
0,0.009,0.002999,-20.480499
1,0.003,0.001,-21.380031
2,0.003999,0.000999,-21.268315
3,0.003002,0.000998,-22.868877
4,0.002999,0.001,-24.799557
5,0.003998,0.001001,-18.958276
6,0.002998,0.001002,-20.117938
7,0.004,0.001,-20.504017
8,0.003997,0.001001,-26.767746
9,0.003,0.001001,-21.778711


In [15]:
scores = cross_val_score(linreg, data, target, cv=10, scoring='neg_mean_absolute_error')
scores = -scores
print(f"Mean Absolute Error: {scores.mean():.3f} +/- {scores.std():.3f}")

Mean Absolute Error: 21.892 +/- 2.225


In [18]:
print(f"Mean Absolute Error result for linreg: {-result_reg_mae_df['test_score'].mean():.3f} +/- {-result_reg_mae_df['test_score'].std():.3f}")

Mean Absolute Error result for linreg: 21.892 +/- -2.346


Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [19]:
# Write your code here.
scoring = ["r2", "neg_mean_absolute_error"]
result_linreg_duo = cross_validate(linreg, data, target, cv=10, scoring=scoring)

scores = {"R2": result_linreg_duo["test_r2"],
            "MAE": -result_linreg_duo["test_neg_mean_absolute_error"]}
scores_df = pd.DataFrame(scores)
scores_df

Unnamed: 0,R2,MAE
0,0.843903,20.480499
1,0.854974,21.380031
2,0.887523,21.268315
3,0.749511,22.868877
4,0.81698,24.799557
5,0.820134,18.958276
6,0.815541,20.117938
7,0.814525,20.504017
8,0.501158,26.767746
9,0.833307,21.778711


In [20]:
result_linreg_duo

{'fit_time': array([0.01000905, 0.00489712, 0.00299668, 0.00354314, 0.0029912 ,
        0.00300097, 0.00398946, 0.00299072, 0.00299072, 0.00299215]),
 'score_time': array([0.00099087, 0.0010004 , 0.00100875, 0.00100684, 0.00100017,
        0.00100946, 0.        , 0.00100899, 0.00100827, 0.00100946]),
 'test_r2': array([0.84390289, 0.85497435, 0.88752303, 0.74951104, 0.81698014,
        0.82013355, 0.81554085, 0.81452472, 0.50115778, 0.83330693]),
 'test_neg_mean_absolute_error': array([-20.48049905, -21.38003105, -21.26831487, -22.86887664,
        -24.79955736, -18.95827641, -20.11793792, -20.5040172 ,
        -26.76774564, -21.77871056])}