# Exercise: Model Validation

## Recap

You've built a model. In this exercise you will test how good your model is.

Run the cell below to set up your coding environment where the previous exercise left off.

In [2]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/iowa_train.csv'
home_data = pd.read_csv(iowa_file_path)

y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
                   'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


## Step 1: Split Your Data

Use the `train_test_split` function to split up your data.

Give it the argument `random_state=1` so you know what to expect when verifying your code.

Recall, your features are loaded in the DataFrame **X** and your target is loaded in **y**.

In [3]:
# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split

# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

## Step 2: Specify and Fit the Model

Create a **DecisionTreeRegressor** model and fit it to the relevant data. Set `random_state` to 1 again when creating the model.

In [4]:
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit model
iowa_model.fit(train_X, train_y)
print("First in-sample predictions:", iowa_model.predict(train_X.head()))
print("Actual target values for those homes:", train_y.head().tolist())

First in-sample predictions: [307000. 223500. 145000. 155000. 140000.]
Actual target values for those homes: [307000, 223500, 145000, 155000, 140000]


## Step 3: Make Predictions with Validation data

Predict with validation data and inspect your predictions and actual values from validation data.

In [5]:
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# print the top few validation predictions
print("First validation predictions:", val_predictions[:5])
# print the top few actual prices from validation data
print("Actual validation values:", val_y.head().tolist())

First validation predictions: [186500. 184000. 130000.  92000. 164500.]
Actual validation values: [231500, 179500, 122000, 84500, 142000]


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in this page).

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

## Step 4: Calculate the Mean Absolute Error in Validation Data

In [6]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)

# uncomment following line to see the validation_mae
print("MAE: {:.2f}".format(val_mae))

MAE: 29652.93


Is that MAE good? There isn't a general rule for what values are good that applies across applications. But you'll see how to use (and improve) this number in the next step.

## Step 5: Compare Different Tree Sizes

We have seen the `max_leaf_nodes` argument provides a very sensible way to control overfitting vs underfitting in the Decision Tree algorithm. 

In this exercise we will use the following a utility function to help compare MAE scores from different values for `max_leaf_nodes`:

In [7]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    mae = mean_absolute_error(val_y, model.predict(val_X))
    return(mae)

Write a loop that tries the values for max_leaf_nodes from a list of possible values in `candidate_max_leaf_nodes`. Call the `get_mae` function on each value and store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data.

In [8]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
min_mae = 1.0e10
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = 5
for mln in candidate_max_leaf_nodes:
    mae = get_mae(mln, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: {:d}\t\t Mean Absolute Error: {:0.2f}".format(mln, mae))
    if mae < min_mae:
        min_mae = mae
        best_tree_size = mln
print("Best max_leaf_nodes: {:d}".format(best_tree_size))

Max leaf nodes: 5		 Mean Absolute Error: 35044.51
Max leaf nodes: 25		 Mean Absolute Error: 29016.41
Max leaf nodes: 50		 Mean Absolute Error: 27405.93
Max leaf nodes: 100		 Mean Absolute Error: 27282.51
Max leaf nodes: 250		 Mean Absolute Error: 27893.82
Max leaf nodes: 500		 Mean Absolute Error: 29454.19
Best max_leaf_nodes: 100


## Step 6: Fit Model Using All Data

Now you know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size.  That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [9]:
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)

DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)

You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. 

In the next step you will learn to use Random Forests to improve your models even more.

## Step 7: Use a Random Forest

Data science isn't always this easy. But replacing the decision tree with a Random Forest is going to be an easy win.

In [11]:
from sklearn.ensemble import RandomForestRegressor

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
# fit your model
rf_model.fit(train_X, train_y)
val_predictions = rf_model.predict(val_X)

# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae = mean_absolute_error(val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:.2f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 21857.16
