# Model Validation

Okay, we have built a model. But, how good is it? We saw something interesting when comparing the predictions with the true values...

You'll want to evaluate every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive **accuracy**. In other words, will the model's predictions be close to what actually happens. This is specially true for regression tasks, and it is a bit more complicated with classification tasks. 

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. 

**Can you see why this is a problem?**


You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

## Defining accuracy for our case

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is: `error = actual - predicted`

So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

> On average, our predictions are off by about X.

To calculate the MAE, we need two things:

* A model (we did this in our previous notebook)
* Predictions for our data

To calculate the MAE, we will also use `scikit-learn`

In [None]:
# Don't modify
import pandas as pd

from sklearn.tree import DecisionTreeRegressor

df = pd.read_csv('../data/housing/train.csv')

features = [
    'LotArea',
    'YearBuilt',
    '1stFlrSF',
    '2ndFlrSF',
    'FullBath',
    'BedroomAbvGr',
    'TotRmsAbvGrd'
]
target = 'SalePrice'

X = df[features]
y = df[target]

model = DecisionTreeRegressor(random_state=1)
model.fit(X, y)

predictions = model.predict(X)

Now, you have trained a model and run the predictions. The method `mean_absolute_error` from the package `sklearn.metrics` will help you calculate the MAE. Try importing it and print the error:

In [None]:
from _ import _ # Import mean_absolute_error here

mae =  # calculate MAE between y and predictions
print(f'MAE for the model: {mae}')

Looks good, right? Given the prices of the houses, a MAE of ~62 looks amazing!

## The Problem with "In-Sample" Scores
The measure we just computed can be called an _"in-sample"_ score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

As usual, `scikit-learn` has useful functions to help with this. Specifically, the `train_test_split` method from the `model_selection` package does exactly this.

Try importing that module and splitting the data:

In [None]:
from _ import _ # Import train_test_split

train_X, val_X, train_y, val_y =  # use train_test_split to split the data

Now, try retraining the model with only the training data, and run predictions on the validation data. What happens with MAS?

In [None]:
# 1. Retrain model with train_X and train_y


# 2. Run predictions on val_X


# 3. Calculate MAE between predictions and val_y


print(f'MAE for the model: {mae}')

wow! that went up quite a lot... what happened? 