# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [4]:
from sklearn.linear_model import LinearRegression

model_linear = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [8]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10)
cv_linear = cross_val_score(model_linear, data, target, cv=kfold, scoring='r2', n_jobs=2)
f"{cv_linear.mean():.3f} +/- {cv_linear.std():.3f}"

'0.794 +/- 0.103'

Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [10]:
cv_linear2 = cross_val_score(model_linear, data, target, cv=kfold, scoring='neg_mean_absolute_error', n_jobs=2)
f"{-cv_linear2.mean():.3f} +/- {cv_linear2.std():.3f}"

'21.892 +/- 2.225'

Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [13]:
from sklearn.model_selection import cross_validate

cv_linear3 = cross_validate(model_linear, data, target, cv=kfold, scoring=['r2','neg_mean_absolute_error'], n_jobs=2)
cv_linear3
#f"{-cv_linear3.mean():.3f} +/- {cv_linear3.std():.3f}"
pd.DataFrame({"R2": cv_linear3['test_r2'], "MAE": -cv_linear3['test_neg_mean_absolute_error']})

Unnamed: 0,R2,MAE
0,0.843903,20.480499
1,0.854974,21.380031
2,0.887523,21.268315
3,0.749511,22.868877
4,0.81698,24.799557
5,0.820134,18.958276
6,0.815541,20.117938
7,0.814525,20.504017
8,0.501158,26.767746
9,0.833307,21.778711


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

predicted_actual = {
    "True values (k$)": target, "Predicted values (k$)": target_predicted}
predicted_actual = pd.DataFrame(predicted_actual)

sns.scatterplot(data=predicted_actual,
                x="True values (k$)", y="Predicted values (k$)",
                color="black", alpha=0.5)
plt.axline((0, 0), slope=1, label="Perfect fit")
plt.axis('square')
_ = plt.title("Regression using a model without \ntarget transformation")