## overfitting underfitting

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Here's the takeaway: Models can suffer from either:

Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.
We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [46]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error


In [49]:
train = pd.read_csv("datasets/training.csv") #filepath
train_cut = train.drop(train[train.min_ANNmuon <= 0.4].index)
x = train_cut.drop(['min_ANNmuon', 'mass', 'production', 'signal', 'id', 'SPDhits'], axis = 1)
y = train_cut['signal']
#x = x.reshape(-1,1)
pd.set_option('display.max_columns', None)
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 0)

In [50]:
def get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y):
    #model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model = RandomForestRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)

    model.fit(train_x, train_y)
    preds_val = model.predict(val_x)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.



In [51]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  0
Max leaf nodes: 50  		 Mean Absolute Error:  0
Max leaf nodes: 500  		 Mean Absolute Error:  0
Max leaf nodes: 5000  		 Mean Absolute Error:  0


You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

In [52]:
#final_model = DecisionTreeRegressor(max_leaf_nodes=5000,random_state=1)
final_model = RandomForestRegressor(max_leaf_nodes=50,random_state=1)

final_model.fit(x, y)
val_predictions = final_model.predict(val_x)
print(val_predictions)
val_mean = mean_absolute_error(val_y, val_predictions)
print("Validation MAE for Random Forest Model: {}".format(val_mean))


[0.7830149  0.95663345 0.95081301 ... 0.9524816  0.95077593 0.97001085]
Validation MAE for Random Forest Model: 0.18009334655397594
