In [1]:
import pandas as pd

melbourne_data = pd.read_csv('../../data/melb_data.csv')
filtered_melbourne_data = melbourne_data.dropna(axis=0)

# Target
y = filtered_melbourne_data.Price
features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
            'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[features]

from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
train_X, valid_X, train_y, valid_y = train_test_split(X, y,random_state = 0)

## Experimenting With Different Models

Now we can measure model accuracy. So, we can experiment with alternative models and seach best predictions.

But how make `alternative model`? In docs of scikit-learn many options for decision tree model. The most important options determine the tree's depth. The tree's depth is a measure of how many splits it makes before coming to prediction.

So, when we devide the houses amongst many leaves, we also have fewer houses in each leaf. It will make predictions that very close to real one, but on new data that predictions may be very unreliable. This phenomenon called `overfitting`, where a model matches the training datat almost perfectly, but poorly in validation and new data.

On the other side if tree devided houses in 2 or 4 groups, resulting predictions may be far off for most houses, in training and validation data. When model fails to capture important distinctions and patterns in the data, that is called `underfitting`.

<b>How solve this? </b> Since we care about accuracy on new data, which we estimate from our validadtion data, we want to find the balance between underfitting and overfitting.

## Example

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. <br>
But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:


In [15]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes: int, train_X, valid_X, train_y, valid_y) -> int:
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    predictions = model.predict(valid_X)
    mae = mean_absolute_error(valid_y, predictions)
    return mae

We can use for-loop to compare the accuracy of models built with different values for `max_leaf_nodes`.

In [16]:
%%time
for max_leaf_nodes in [5, 50, 5_000, 7_000, 10_000]:
    current_mae = get_mae(max_leaf_nodes, train_X, valid_X, train_y, valid_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, current_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 5000  		 Mean Absolute Error:  254983
Max leaf nodes: 7000  		 Mean Absolute Error:  254983
Max leaf nodes: 10000  		 Mean Absolute Error:  254983
CPU times: user 106 ms, sys: 2.87 ms, total: 109 ms
Wall time: 111 ms


## Conclusion

Models can:
- <b>Overfitting:<b/> capturing all wrong patterns that will not work in real data
- <b>Underfitting:<b/> failing to capture relevants patterns because of low depth of tree.

So, both `overfitting` and `underfitting` leading to less accurate predictions.
    
We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.