### Datasets:
[melb_data.csv](https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/data?select=melb_data.csv)

### Imports

In [16]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# 1st ML model

In [2]:
melbourne_data = pd.read_csv('data/melb_data.csv')
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [3]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

In [4]:
X = melbourne_data[['Rooms','Bathroom','Landsize','Lattitude','Longtitude']]
y = melbourne_data.Price

In [5]:
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X,y)
print(X.head())
print(melbourne_model.predict(X.head()))

   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
[1035000. 1465000. 1600000. 1876000. 1636000.]


# Model validation
*Tree's depth is a measure of how many splits im makes before coming to a prediciton.

In [6]:
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X_train, y_train)

predicitons = melbourne_model.predict(X_test)
print(mean_absolute_error(y_test, predicitons))

274447.0757477943


# Underfitting and Overfitting

- `overfitting` - when a model matches the training data almost perfectly, but does poorly in validation and other new data, capturing spurious patterns that won't recur in the future, leading to less accurate predictions <br> <br>
- `underfitting` - when a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, failing to capture relevant patterns, again leading to less accurate predictions. <br> <br>

**max_leaf_nodes** provides a very sensible way to control overfitting and underfitting in decision trees. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

In [12]:
def get_mae(n):
    model = DecisionTreeRegressor(max_leaf_nodes=n, random_state=0)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, pred)
    return mae

for max_leaf_nodes in [5, 25, 50, 100, 250, 500]:
    my_mae = get_mae(max_leaf_nodes)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  385696
Max leaf nodes: 25  		 Mean Absolute Error:  307919
Max leaf nodes: 50  		 Mean Absolute Error:  279794
Max leaf nodes: 100  		 Mean Absolute Error:  269191
Max leaf nodes: 250  		 Mean Absolute Error:  269945
Max leaf nodes: 500  		 Mean Absolute Error:  261718


You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size.

*In binary tree if there is n leaf nodes there is also n-1 "splits".

In [13]:
final_model = DecisionTreeRegressor(max_leaf_nodes=100)
final_model.fit(X,y)

DecisionTreeRegressor(max_leaf_nodes=100)

# Random Forests

In [17]:
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train)
pred = forest_model.predict(X_test)
mean_absolute_error(y_test, pred)

207190.6873773146