In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  
2
10
  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

In [26]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error 
from sklearn.model_selection import train_test_split
# Path of the file to read
home_data = pd.read_csv('train.csv')
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]
# Split into validation and training data
train_X ,val_X, train_y,val_y = train_test_split(X,y,train_size = 0.8,test_size = 0.2,random_state = 0)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state = 1)
# Fit Model
iowa_model.fit(X, y)
val_predictions = iowa_model.predict(val_X)
mae_val  = mean_absolute_error(val_predictions, val_y)
print(f"Mae is {mae_val}")

Mae is 6.164383561643835


#### overfitting
This is a phenomenon called overfitting, where a <span style="background-color: #FFFF00">model matches the training data almost perfectly</span>, but does <span style="background-color: #FFFF00">poorly in validation and other new data</span>. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

#### max_leaf_nodes
- Controlling the tree depth
- But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

In [36]:
def get_mae(max_leaf_nodes,train_X= train_X ,val_X = val_X, train_y = train_y,val_y = val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes ,random_state = 0)
    model.fit( train_X, train_y)
    preds = model.predict(val_X)
    mae = mean_absolute_error(preds,val_y)
    return mae 

#### underfitting
At an extreme, if a tree divides houses into only 2 or 4, each group still has a <span style="background-color: #FFFF00">wide variety of houses</span>. <span style="background-color: #FFFF00">Resulting predictions may be far off for most houses</span>, even in the training data (and it will be bad in validation too for the same reason). When a <span style="background-color: #FFFF00">model fails to capture important distinctions and patterns in the data</span>, so it performs poorly even in training data, that is called underfitting.

In [44]:
import math
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {} 
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for i in candidate_max_leaf_nodes :
    score = get_mae(i)
    scores[i] = score
print(scores)
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best = 0
min_score = math.inf
for key,val in scores.items():
    if val < min_score: 
        best = key
        min_score = val
print(best)

{5: 36986.81844313694, 25: 29382.773183350568, 50: 27486.37338812241, 100: 29542.863076302427, 250: 33826.44190321383, 500: 35130.000092411385}
50


### Conclusion¶
Here's the takeaway: Models can suffer from either:

- Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.