# Introduction
In **cross-validation** we divide our data into multiple subsets to get multiple measures of model quality.

Since machine modeling is an iterative process, we run our model in multiple experiments where each experiment uses a different subset as the validation data.

In [1]:
from IPython.display import Image
Image(url="https://i.imgur.com/9k60cVA.png")

While cross-validation provides a more accurate measure of model quality, it requires a significant amount of computational power. For this reason it is best used on small data sets where this burden is acceptable. 

A couple minutes of less of model run time is sufficient for switching to cross-validation.

Pipelines incredible facilitate the cross-validation process and greatly reduce code.

# Code

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import pandas as pd

data = pd.read_csv('./ressources/melb_data.csv')
y = data.Price

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


Scikit-learn has a convention where all metrics are defined so a high number is better. Using the negative MAE allows them to be consistent with this convention even though it is almost unheard of elsewhere.

The MAE is much better with cross-validation and the code is much smaller - we don't even need to deal with training and validation sets.