# Model Assessment 

Thus far, our assessment of a model's predictive performance has been based on its ability to
predict the same data used to train the model.

This approach typically does a poor job at quantifying how well our model generalises to data
it hasn't seen before.

## Cross validation

Cross validation can be an invaluable tool in assessing the predictive ability of a model.

In cross validation we typically divide our training data into $k$ approximately equally sized
groups.

Each group is left out in turn as a test set, with the other $k-1$ groups being used as a training
set.

This approach allows us to assess how well our model generalises.

## Cross validation: example

The `sklearn.metrics` module provides a number of scoring functions for different modelling tasks.

Additionally, `sklearn.model_selection.cross_validate` can be used to run a cross validation
procedure.

Let's begin by loading some standard libraries:


In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd

We will again be working with the Ames housing dataset:


In [6]:
from sklearn.datasets import fetch_openml

housing = fetch_openml(name="house_prices", as_frame=True)
X_train = housing.data
y_train = housing.target / 100000

Let's select 10 columns to work with:


In [7]:
features = [
    "LotArea",
    "YearBuilt",
    "OverallQual",
    "YearRemodAdd",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageArea",
    "GarageCars",
    "PoolArea",
    "YrSold",
]
X_train = X_train[features]

Now we set up a modelling `Pipeline` for the Ames data:


In [8]:
model = Pipeline(
    [
        ("pre", StandardScaler()),
        ("reg", LinearRegression()),
    ]
)

To use the scoring functions with `cross_validate()` we must wrap them with the `make_scorer()`
function:


In [9]:
from sklearn.metrics import mean_squared_error, make_scorer

score_fn = make_scorer(mean_squared_error)

We can perform 10-fold (dividing the data into 10 approximately equally sized groups) cross
validation with:


In [10]:
from sklearn.model_selection import cross_validate

scores = cross_validate(
    model, X_train, y_train, scoring=score_fn, cv=10
)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score
0,0.00349,0.001045,0.108807
1,0.002632,0.000911,0.123177
2,0.001854,0.000786,0.109383
3,0.002164,0.001017,0.185055
4,0.002051,0.000821,0.25355
5,0.001951,0.000827,0.138271
6,0.001772,0.000757,0.13271
7,0.002046,0.000909,0.123239
8,0.001859,0.000771,0.340745
9,0.001699,0.000754,0.124159
