*This tutorial is part of the series [Learn Machine Learning](https://www.kaggle.com/learn/machine-learning). At the end of this step, you will understand the concepts of underfitting and overfitting, and you will be able to apply these ideas to optimize your model accuracy.*

# Experimenting With Different Models

Now that you have a trustworthy way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions.  But what alternatives do you have for models?

You can see in scikit-learn's [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth.  Recall from [page 2](https://www.kaggle.com/dansbecker/first-data-science-scenario-page-2/) that a tree's depth is a measure of how many splits it makes before coming to a prediction.  This is a relatively shallow tree

![Depth 2 Tree](http://i.imgur.com/R3ywQsR.png)

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses and a leaf).  As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses.  If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses.  Splitting each of those again would create 8 groups.  If we keep doubling the number of groups by adding more splits at each level, we'll have \\(2^{10}\\) groups of houses by the time we get to the 10th level. That's 1024 leaves.  

When we divide the houses amongst many leaves, we also have fewer houses in each leaf.  Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data.  On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.  

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.  

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.  Visually, we want the low point of the (red) validation curve in

![underfitting_overfitting](http://i.imgur.com/2q85n9s.png)

# Example
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes.  But the *max_leaf_nodes* argument provides a very sensible way to control overfitting vs underfitting.  The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for *max_leaf_nodes*:


In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

The data is loaded into **train_X**, **val_X**, **train_y** and **val_y** using the code you've already seen (and which you've already written).

In [2]:
# Data Loading Code Runs At This Point
import pandas as pd
    
# Load data
melbourne_file_path = './data/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and predictors
y = filtered_melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_predictors]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We can use a for-loop to compare the accuracy of models built with different values for *max_leaf_nodes.*

In [3]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  257829
Max leaf nodes: 500  		 Mean Absolute Error:  243176
Max leaf nodes: 5000  		 Mean Absolute Error:  254915


Of the options listed, 500 is the optimal number of leaves.  Apply the function to your Iowa data to find the best decision tree.
---

# Conclusion

Here's the takeaway: Models can suffer from either:
- **Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or 
- **Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions. 

We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one. 

But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards. 

---
# Your Turn
In the near future, you'll be efficient writing functions like `get_mae` yourself.  For now, just copy it over to your work area.  Then use a for loop that tries different values of *max_leaf_nodes* and calls the *get_mae* function on each to find the ideal number of leaves for your Iowa data.

You should see that the ideal number of leaves for Iowa data is less than the ideal number of leaves for the Melbourne data. Remember, that a lower MAE is better.

---

# Continue
**[Click here](https://www.kaggle.com/dansbecker/random-forests)** to learn your first sophisticated Machine Learning model, the Random Forest. It is a clever extrapolation of the decision tree model that consistently leads to more accurate predictions.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('./data/house-prices-advanced-regression-techniques/train.csv')

X = data[["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]]
y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [5]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    
    return(mae)

In [6]:
import numpy as np

In [7]:
scores = []
for max_leaf_nodes in np.arange(5, 100000, 50):
    mae = get_mae(max_leaf_nodes, X_train, X_test, y_train, y_test)
    scores.append((max_leaf_nodes, mae))
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35099
Max leaf nodes: 55  		 Mean Absolute Error:  25041
Max leaf nodes: 105  		 Mean Absolute Error:  26591
Max leaf nodes: 155  		 Mean Absolute Error:  25916
Max leaf nodes: 205  		 Mean Absolute Error:  27558
Max leaf nodes: 255  		 Mean Absolute Error:  28065
Max leaf nodes: 305  		 Mean Absolute Error:  28438
Max leaf nodes: 355  		 Mean Absolute Error:  28535
Max leaf nodes: 405  		 Mean Absolute Error:  28733
Max leaf nodes: 455  		 Mean Absolute Error:  28679
Max leaf nodes: 505  		 Mean Absolute Error:  29088
Max leaf nodes: 555  		 Mean Absolute Error:  29128
Max leaf nodes: 605  		 Mean Absolute Error:  29186
Max leaf nodes: 655  		 Mean Absolute Error:  29147
Max leaf nodes: 705  		 Mean Absolute Error:  29140
Max leaf nodes: 755  		 Mean Absolute Error:  29103
Max leaf nodes: 805  		 Mean Absolute Error:  29123
Max leaf nodes: 855  		 Mean Absolute Error:  29142
Max leaf nodes: 905  		 Mean Absolute Error:  29114
Max leaf nodes:

Max leaf nodes: 7805  		 Mean Absolute Error:  29130
Max leaf nodes: 7855  		 Mean Absolute Error:  29130
Max leaf nodes: 7905  		 Mean Absolute Error:  29130
Max leaf nodes: 7955  		 Mean Absolute Error:  29130
Max leaf nodes: 8005  		 Mean Absolute Error:  29130
Max leaf nodes: 8055  		 Mean Absolute Error:  29130
Max leaf nodes: 8105  		 Mean Absolute Error:  29130
Max leaf nodes: 8155  		 Mean Absolute Error:  29130
Max leaf nodes: 8205  		 Mean Absolute Error:  29130
Max leaf nodes: 8255  		 Mean Absolute Error:  29130
Max leaf nodes: 8305  		 Mean Absolute Error:  29130
Max leaf nodes: 8355  		 Mean Absolute Error:  29130
Max leaf nodes: 8405  		 Mean Absolute Error:  29130
Max leaf nodes: 8455  		 Mean Absolute Error:  29130
Max leaf nodes: 8505  		 Mean Absolute Error:  29130
Max leaf nodes: 8555  		 Mean Absolute Error:  29130
Max leaf nodes: 8605  		 Mean Absolute Error:  29130
Max leaf nodes: 8655  		 Mean Absolute Error:  29130
Max leaf nodes: 8705  		 Mean Absolute Error: 

Max leaf nodes: 16555  		 Mean Absolute Error:  29130
Max leaf nodes: 16605  		 Mean Absolute Error:  29130
Max leaf nodes: 16655  		 Mean Absolute Error:  29130
Max leaf nodes: 16705  		 Mean Absolute Error:  29130
Max leaf nodes: 16755  		 Mean Absolute Error:  29130
Max leaf nodes: 16805  		 Mean Absolute Error:  29130
Max leaf nodes: 16855  		 Mean Absolute Error:  29130
Max leaf nodes: 16905  		 Mean Absolute Error:  29130
Max leaf nodes: 16955  		 Mean Absolute Error:  29130
Max leaf nodes: 17005  		 Mean Absolute Error:  29130
Max leaf nodes: 17055  		 Mean Absolute Error:  29130
Max leaf nodes: 17105  		 Mean Absolute Error:  29130
Max leaf nodes: 17155  		 Mean Absolute Error:  29130
Max leaf nodes: 17205  		 Mean Absolute Error:  29130
Max leaf nodes: 17255  		 Mean Absolute Error:  29130
Max leaf nodes: 17305  		 Mean Absolute Error:  29130
Max leaf nodes: 17355  		 Mean Absolute Error:  29130
Max leaf nodes: 17405  		 Mean Absolute Error:  29130
Max leaf nodes: 17455  		 Me

Max leaf nodes: 25005  		 Mean Absolute Error:  29130
Max leaf nodes: 25055  		 Mean Absolute Error:  29130
Max leaf nodes: 25105  		 Mean Absolute Error:  29130
Max leaf nodes: 25155  		 Mean Absolute Error:  29130
Max leaf nodes: 25205  		 Mean Absolute Error:  29130
Max leaf nodes: 25255  		 Mean Absolute Error:  29130
Max leaf nodes: 25305  		 Mean Absolute Error:  29130
Max leaf nodes: 25355  		 Mean Absolute Error:  29130
Max leaf nodes: 25405  		 Mean Absolute Error:  29130
Max leaf nodes: 25455  		 Mean Absolute Error:  29130
Max leaf nodes: 25505  		 Mean Absolute Error:  29130
Max leaf nodes: 25555  		 Mean Absolute Error:  29130
Max leaf nodes: 25605  		 Mean Absolute Error:  29130
Max leaf nodes: 25655  		 Mean Absolute Error:  29130
Max leaf nodes: 25705  		 Mean Absolute Error:  29130
Max leaf nodes: 25755  		 Mean Absolute Error:  29130
Max leaf nodes: 25805  		 Mean Absolute Error:  29130
Max leaf nodes: 25855  		 Mean Absolute Error:  29130
Max leaf nodes: 25905  		 Me

Max leaf nodes: 33255  		 Mean Absolute Error:  29130
Max leaf nodes: 33305  		 Mean Absolute Error:  29130
Max leaf nodes: 33355  		 Mean Absolute Error:  29130
Max leaf nodes: 33405  		 Mean Absolute Error:  29130
Max leaf nodes: 33455  		 Mean Absolute Error:  29130
Max leaf nodes: 33505  		 Mean Absolute Error:  29130
Max leaf nodes: 33555  		 Mean Absolute Error:  29130
Max leaf nodes: 33605  		 Mean Absolute Error:  29130
Max leaf nodes: 33655  		 Mean Absolute Error:  29130
Max leaf nodes: 33705  		 Mean Absolute Error:  29130
Max leaf nodes: 33755  		 Mean Absolute Error:  29130
Max leaf nodes: 33805  		 Mean Absolute Error:  29130
Max leaf nodes: 33855  		 Mean Absolute Error:  29130
Max leaf nodes: 33905  		 Mean Absolute Error:  29130
Max leaf nodes: 33955  		 Mean Absolute Error:  29130
Max leaf nodes: 34005  		 Mean Absolute Error:  29130
Max leaf nodes: 34055  		 Mean Absolute Error:  29130
Max leaf nodes: 34105  		 Mean Absolute Error:  29130
Max leaf nodes: 34155  		 Me

Max leaf nodes: 42205  		 Mean Absolute Error:  29130
Max leaf nodes: 42255  		 Mean Absolute Error:  29130
Max leaf nodes: 42305  		 Mean Absolute Error:  29130
Max leaf nodes: 42355  		 Mean Absolute Error:  29130
Max leaf nodes: 42405  		 Mean Absolute Error:  29130
Max leaf nodes: 42455  		 Mean Absolute Error:  29130
Max leaf nodes: 42505  		 Mean Absolute Error:  29130
Max leaf nodes: 42555  		 Mean Absolute Error:  29130
Max leaf nodes: 42605  		 Mean Absolute Error:  29130
Max leaf nodes: 42655  		 Mean Absolute Error:  29130
Max leaf nodes: 42705  		 Mean Absolute Error:  29130
Max leaf nodes: 42755  		 Mean Absolute Error:  29130
Max leaf nodes: 42805  		 Mean Absolute Error:  29130
Max leaf nodes: 42855  		 Mean Absolute Error:  29130
Max leaf nodes: 42905  		 Mean Absolute Error:  29130
Max leaf nodes: 42955  		 Mean Absolute Error:  29130
Max leaf nodes: 43005  		 Mean Absolute Error:  29130
Max leaf nodes: 43055  		 Mean Absolute Error:  29130
Max leaf nodes: 43105  		 Me

Max leaf nodes: 49955  		 Mean Absolute Error:  29130
Max leaf nodes: 50005  		 Mean Absolute Error:  29130
Max leaf nodes: 50055  		 Mean Absolute Error:  29130
Max leaf nodes: 50105  		 Mean Absolute Error:  29130
Max leaf nodes: 50155  		 Mean Absolute Error:  29130
Max leaf nodes: 50205  		 Mean Absolute Error:  29130
Max leaf nodes: 50255  		 Mean Absolute Error:  29130
Max leaf nodes: 50305  		 Mean Absolute Error:  29130
Max leaf nodes: 50355  		 Mean Absolute Error:  29130
Max leaf nodes: 50405  		 Mean Absolute Error:  29130
Max leaf nodes: 50455  		 Mean Absolute Error:  29130
Max leaf nodes: 50505  		 Mean Absolute Error:  29130
Max leaf nodes: 50555  		 Mean Absolute Error:  29130
Max leaf nodes: 50605  		 Mean Absolute Error:  29130
Max leaf nodes: 50655  		 Mean Absolute Error:  29130
Max leaf nodes: 50705  		 Mean Absolute Error:  29130
Max leaf nodes: 50755  		 Mean Absolute Error:  29130
Max leaf nodes: 50805  		 Mean Absolute Error:  29130
Max leaf nodes: 50855  		 Me

Max leaf nodes: 58405  		 Mean Absolute Error:  29130
Max leaf nodes: 58455  		 Mean Absolute Error:  29130
Max leaf nodes: 58505  		 Mean Absolute Error:  29130
Max leaf nodes: 58555  		 Mean Absolute Error:  29130
Max leaf nodes: 58605  		 Mean Absolute Error:  29130
Max leaf nodes: 58655  		 Mean Absolute Error:  29130
Max leaf nodes: 58705  		 Mean Absolute Error:  29130
Max leaf nodes: 58755  		 Mean Absolute Error:  29130
Max leaf nodes: 58805  		 Mean Absolute Error:  29130
Max leaf nodes: 58855  		 Mean Absolute Error:  29130
Max leaf nodes: 58905  		 Mean Absolute Error:  29130
Max leaf nodes: 58955  		 Mean Absolute Error:  29130
Max leaf nodes: 59005  		 Mean Absolute Error:  29130
Max leaf nodes: 59055  		 Mean Absolute Error:  29130
Max leaf nodes: 59105  		 Mean Absolute Error:  29130
Max leaf nodes: 59155  		 Mean Absolute Error:  29130
Max leaf nodes: 59205  		 Mean Absolute Error:  29130
Max leaf nodes: 59255  		 Mean Absolute Error:  29130
Max leaf nodes: 59305  		 Me

Max leaf nodes: 67005  		 Mean Absolute Error:  29130
Max leaf nodes: 67055  		 Mean Absolute Error:  29130
Max leaf nodes: 67105  		 Mean Absolute Error:  29130
Max leaf nodes: 67155  		 Mean Absolute Error:  29130
Max leaf nodes: 67205  		 Mean Absolute Error:  29130
Max leaf nodes: 67255  		 Mean Absolute Error:  29130
Max leaf nodes: 67305  		 Mean Absolute Error:  29130
Max leaf nodes: 67355  		 Mean Absolute Error:  29130
Max leaf nodes: 67405  		 Mean Absolute Error:  29130
Max leaf nodes: 67455  		 Mean Absolute Error:  29130
Max leaf nodes: 67505  		 Mean Absolute Error:  29130
Max leaf nodes: 67555  		 Mean Absolute Error:  29130
Max leaf nodes: 67605  		 Mean Absolute Error:  29130
Max leaf nodes: 67655  		 Mean Absolute Error:  29130
Max leaf nodes: 67705  		 Mean Absolute Error:  29130
Max leaf nodes: 67755  		 Mean Absolute Error:  29130
Max leaf nodes: 67805  		 Mean Absolute Error:  29130
Max leaf nodes: 67855  		 Mean Absolute Error:  29130
Max leaf nodes: 67905  		 Me

Max leaf nodes: 75805  		 Mean Absolute Error:  29130
Max leaf nodes: 75855  		 Mean Absolute Error:  29130
Max leaf nodes: 75905  		 Mean Absolute Error:  29130
Max leaf nodes: 75955  		 Mean Absolute Error:  29130
Max leaf nodes: 76005  		 Mean Absolute Error:  29130
Max leaf nodes: 76055  		 Mean Absolute Error:  29130
Max leaf nodes: 76105  		 Mean Absolute Error:  29130
Max leaf nodes: 76155  		 Mean Absolute Error:  29130
Max leaf nodes: 76205  		 Mean Absolute Error:  29130
Max leaf nodes: 76255  		 Mean Absolute Error:  29130
Max leaf nodes: 76305  		 Mean Absolute Error:  29130
Max leaf nodes: 76355  		 Mean Absolute Error:  29130
Max leaf nodes: 76405  		 Mean Absolute Error:  29130
Max leaf nodes: 76455  		 Mean Absolute Error:  29130
Max leaf nodes: 76505  		 Mean Absolute Error:  29130
Max leaf nodes: 76555  		 Mean Absolute Error:  29130
Max leaf nodes: 76605  		 Mean Absolute Error:  29130
Max leaf nodes: 76655  		 Mean Absolute Error:  29130
Max leaf nodes: 76705  		 Me

Max leaf nodes: 84755  		 Mean Absolute Error:  29130
Max leaf nodes: 84805  		 Mean Absolute Error:  29130
Max leaf nodes: 84855  		 Mean Absolute Error:  29130
Max leaf nodes: 84905  		 Mean Absolute Error:  29130
Max leaf nodes: 84955  		 Mean Absolute Error:  29130
Max leaf nodes: 85005  		 Mean Absolute Error:  29130
Max leaf nodes: 85055  		 Mean Absolute Error:  29130
Max leaf nodes: 85105  		 Mean Absolute Error:  29130
Max leaf nodes: 85155  		 Mean Absolute Error:  29130
Max leaf nodes: 85205  		 Mean Absolute Error:  29130
Max leaf nodes: 85255  		 Mean Absolute Error:  29130
Max leaf nodes: 85305  		 Mean Absolute Error:  29130
Max leaf nodes: 85355  		 Mean Absolute Error:  29130
Max leaf nodes: 85405  		 Mean Absolute Error:  29130
Max leaf nodes: 85455  		 Mean Absolute Error:  29130
Max leaf nodes: 85505  		 Mean Absolute Error:  29130
Max leaf nodes: 85555  		 Mean Absolute Error:  29130
Max leaf nodes: 85605  		 Mean Absolute Error:  29130
Max leaf nodes: 85655  		 Me

Max leaf nodes: 93705  		 Mean Absolute Error:  29130
Max leaf nodes: 93755  		 Mean Absolute Error:  29130
Max leaf nodes: 93805  		 Mean Absolute Error:  29130
Max leaf nodes: 93855  		 Mean Absolute Error:  29130
Max leaf nodes: 93905  		 Mean Absolute Error:  29130
Max leaf nodes: 93955  		 Mean Absolute Error:  29130
Max leaf nodes: 94005  		 Mean Absolute Error:  29130
Max leaf nodes: 94055  		 Mean Absolute Error:  29130
Max leaf nodes: 94105  		 Mean Absolute Error:  29130
Max leaf nodes: 94155  		 Mean Absolute Error:  29130
Max leaf nodes: 94205  		 Mean Absolute Error:  29130
Max leaf nodes: 94255  		 Mean Absolute Error:  29130
Max leaf nodes: 94305  		 Mean Absolute Error:  29130
Max leaf nodes: 94355  		 Mean Absolute Error:  29130
Max leaf nodes: 94405  		 Mean Absolute Error:  29130
Max leaf nodes: 94455  		 Mean Absolute Error:  29130
Max leaf nodes: 94505  		 Mean Absolute Error:  29130
Max leaf nodes: 94555  		 Mean Absolute Error:  29130
Max leaf nodes: 94605  		 Me

In [8]:
sorted_scores = sorted(scores, key=lambda score: score[1])

In [9]:
print("Best: max_leaf_nodes = %d, MAE = %d" %sorted_scores[0])

Best: max_leaf_nodes = 55, MAE = 25041
