# overfitting and underfitting

In [1]:
#here, we want to see the best tree depth that would give us the best accuracy, so we aviod overfitting or underfitting
#basically, overfitting occurs when there are too many splits in the tree, and with each split, the number of data reduces
#with this reduce in the number of data, there is not enough data to train the model and the model therefore captures patterns
#in the few data points too well, resulting in a model that fits the training data too well, but does poorly with new validation
#data, in contrast, underfitting occurs when the model does not have enough splits, and the datasets therefore the dataset
#has too much variations, the model therefore fails to capture patterns in the data set
#for more info, check https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting

In [2]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [57]:
#we load the train dataset
house_data_train = pd.read_csv("Data/train.csv")
#we then initialise the X and y, recall the X is the features, or the input, and the y is out output
house_data_train_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = house_data_train[house_data_train_features]
y = house_data_train.SalePrice

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [59]:
#next, we build the model using the train features and output
house_model = DecisionTreeRegressor(random_state=1)
house_model.fit(X_train,y_train)
price_predictions = house_model.predict(X_train)
price_predictions

array([307000., 223500., 145000., ..., 127000.,  89500.,  81000.])

In [60]:
y_train.head()

6       307000
807     223500
955     145000
1040    155000
701     140000
Name: SalePrice, dtype: int64

In [61]:
#the above was training data, so its no surprise that the values are same, 
#lets test on validation data

In [62]:
real_predictions = house_model.predict(X_test)
mean_absolute_error(y_test, real_predictions)

29652.931506849316

In [63]:
#approximately $30000, error

In [64]:
#now that we have everything set up, lets see how the tree depth can affect the accuracy of our predictions

# get_mae function

In [65]:
#the below function takes in four arguements, the max tree depth we want, the training and validation data for both test
#and train, and it outputs the mae while we vary the tree depth

In [124]:
def get_mae(max_leaf_nodes, train_X, test_X, train_y, test_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    predictions = model.predict(test_X)
    mae = mean_absolute_error(test_y, predictions)
    return(mae)

In [125]:
#the above function should be easy enough for you all to know what's happening, 
#i highly reccommend taking a programming course

In [127]:
#next, we store different leaf nodes in an array and loop over them, everytime calling the get_mae function
leaf_nodes = [5, 25, 50, 100, 250, 500]

In [132]:
#there are different ways to write loops in python, lets use a simple one
scores = {}
for node in leaf_nodes:
    score = get_mae(node, X_train, X_test, y_train, y_test)
    scores[node] = score

scores

{5: 35044.51299744237,
 25: 29016.41319191076,
 50: 27405.930473214907,
 100: 27282.50803885739,
 250: 27893.822225701646,
 500: 29454.18598068598}

In [135]:
#just looking at it manually, we can see that the minimum mae comes from using a tree depth of 100
#usign a depth of 100 on this particular dataset gives us a mae of approx $27300, our model accuracy increases