# Model Validation

## Intro

You've built a model. But how good is it?

In this lesson, you will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.

## What is Model Validation

You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called **MAE**). Let's break down this metric starting with the last word, error.

The prediction error for each house is:
```
error = actual − predicted
```

So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the **MAE** metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality.

In plain English, it can be said as
> On average, our predictions are off by about X.

To calculate **MAE**, we first need a model. That is built in the cell below:

In [3]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Load data
melb_file_path = '../input/melbourne-housing/melb_data.csv'
melb_data = pd.read_csv(melb_file_path) 
# Filter rows with missing price values
filtered_melb_data = melb_data.dropna(axis=0)
# Choose target and features
y = filtered_melb_data.Price
melb_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                 'YearBuilt', 'Lattitude', 'Longtitude']

X = filtered_melb_data[melb_features]
# Define model
melb_model = DecisionTreeRegressor()
# Fit model
melb_model.fit(X, y)

DecisionTreeRegressor()

Once we have a model, here is how we calculate the mean absolute error:

In [4]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melb_model.predict(X)
print("MAE: {:.2f}".format(mean_absolute_error(y, predicted_home_prices)))

MAE: 434.72


## The Problem with "In-Sample" Scores

The measure we just computed can be called an "**in-sample**" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data. But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, **we need to measure performance on data that wasn't used to build the model**. 

The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.

## Coding It

The **scikit-learn** library has a function `train_test_split` to break up the data into two pieces. We'll use the first piece as training data to fit the model and we'll use the other data as validation data to calculate `mean_absolute_error`.

Here is the code:

In [5]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# Supplying the random_state argument guarantees we get the same split every time
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Define and fit model
melb_model = DecisionTreeRegressor()
melb_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melb_model.predict(val_X)
print("MAE: {:.2f}".format(mean_absolute_error(val_y, val_predictions)))

MAE: 253077.99


### Wow!

Your mean absolute error for the in-sample data was about 500 dollars; but out-of-sample it is more than 250,000 dollars!!

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

There are many ways to improve this model, such as experimenting to find better features or different model types.


## Experimenting With Different Models

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn's [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that the decision tree model has many options (more than you'll want or need for a long time). 

The most important options determine the tree's depth. Recall from the first lesson in this micro-course that a tree's depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree

<center>
<img src="Improve_Decision_Tree.png" width="640" align="center"><br/>
</center>

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have $2^{10}$ groups of houses by the time we get to the 10th level. That's 1024 leaves.

### Overfitting
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. 

### Underfitting

On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason).

When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between **underfitting** and **overfitting**. Visually, we want the low point of the (red) validation curve in the graph bellow:

<center>
<img src="Underfit_Overfit_Trade.png" width="480" align="center"><br/>
</center>

## Example: Overfitting vs Underfitting

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare **MAE** scores from different values for `max_leaf_nodes`:

In [6]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

The data is loaded into `train_X`, `val_X`, `train_y` and `val_y` using the code we have already written.

We can use a for-loop to compare the accuracy of models built with different values for `max_leaf_nodes`.

In [9]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 100, 500, 1000, 2000, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  363932
Max leaf nodes: 50  		 Mean Absolute Error:  253843
Max leaf nodes: 100  		 Mean Absolute Error:  249184
Max leaf nodes: 500  		 Mean Absolute Error:  241771
Max leaf nodes: 1000  		 Mean Absolute Error:  242625
Max leaf nodes: 2000  		 Mean Absolute Error:  245913
Max leaf nodes: 5000  		 Mean Absolute Error:  248952


We can see that from the options listed, 500 is the optimal number of leaves. Lower values result in underfitting, while bigger values overfit the train data.

### Conclusion

As a summary, models can suffer from either:
- **Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- **Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best on

## Improving Performance with Random Forest

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the **random forest** model as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

### Example

You've already seen the code to load the data a few times. At the end of data-loading, we have the following variables:
- `train_X`, `train_y`
- `val_X`, `val_y`

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Load data
melb_file_path = '../input/melbourne-housing/melb_data.csv'
melb_data = pd.read_csv(melb_file_path) 
# Filter rows with missing price values
filtered_melb_data = melb_data.dropna(axis=0)
# Choose target and features
y = filtered_melb_data.Price
melb_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melb_data[melb_features]

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.2, random_state = 0)

In scikit-learn we build and use a **random forest** model the same way we built a **decision tree**, only this time we use the **RandomForestRegressor** class instead of **DecisionTreeRegressor** class.

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print("Random Forest MAE = {:.2f}".format(mean_absolute_error(val_y, melb_preds)))

Random Forest MAE = 193739.86


### Conclusion

There is clearly room for further improvement, but this MAE of 193,740 is a big improvement over the best decision tree error of 250,000. 

There are parameters which allow you to change the performance of the **Random Forest** as much as we changed a single decision tree with the maximum depth. But one of the best features of **Random Forest** models is that they generally work reasonably well, even without this tuning.