# Underfitting and overfitting

## Experimenting with different models

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from the first lesson in this micro-course that a tree's depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree:

![Shallow tree](../data/misc/complex_tree.png)

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  210  groups of houses by the time we get to the 10th level. That's 1024 leaves (almost the amount of houses that we have in our dataset).

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called **overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

Since we care about accuracy _on new data_, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in here:



![Overfitting](../data/misc/overfitting.png)

## Our previous model

In [None]:
# Don't modify
import pandas as pd

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

df = pd.read_csv('../data/housing/train.csv')

features = [
    'LotArea',
    'YearBuilt',
    '1stFlrSF',
    '2ndFlrSF',
    'FullBath',
    'BedroomAbvGr',
    'TotRmsAbvGrd'
]
target = 'SalePrice'

X = df[features]
y = df[target]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

model = DecisionTreeRegressor(random_state=1)
model.fit(train_X, train_y)

How many leaves has our model?

In [None]:
leafs = model._ # your answer here
print(f'Number of leafs: {leafs}')

What we can do to experiment a little bit is put the code that makes the training, prediction and MAE calculation in a function. This function can take as parameter the max number of leaves for the tree, and we can call this function several times with different number of leaves and compare. I'll provide some code that you should try to complete:

In [None]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    # Create a DecisionTreeRegressor with a max_leaf_nodes parameter
    model = DecisionTreeRegressor(__, random_state=0)
    
    # Fit the model with train_X and train_y
   
    
    # Predict on the validation set
    
    
    # Return the MAE
    return 

**Compare Different Tree Sizes**

Write a loop that tries the following values for *max_leaf_nodes* from a set of possible values.

Call the *get_mae* function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of `max_leaf_nodes` that gives the most accurate model on your data.

In [None]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

# Write loop to find the ideal tree size from candidate_max_leaf_nodes


# Print the MAE per each tree


# Store the best value for max_leaf_nodes (that's the one that has min MAE)
best_tree_size = 

In [None]:
print(f'Best tree size is {best_tree_size}, with a MAE of {}')

## Final model
You would now train your model with the whole data, since you have already made all your model decisions, and more data is always better. 

You've tuned this model and improved your results (slightly). But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. Let's have a look at Random Forests