# **Ensemble Trees Exercise**

_John Andrew Dixon_

---

**Setup**

In [146]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV

In [147]:
# Remote URL to the data
url ="https://docs.google.com/spreadsheets/d/e/2PACX-1vSQc1CsJ25nPMJcuJD04csFCysrzuInd_IQ_drLza49m_3R4MllPcuhduu4GozMJun3MgUJkGl0cw-d/pub?output=csv"
# Load and verify data
df = pd.read_csv(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


In [148]:
# Function for evaluating regression R^2 scores
def metrics(regressor, training, testing):
    """Function for evaluating a model's R^2 scores"""
    predictions = {
        "training": regressor.predict(training["X"]),
        "testing": regressor.predict(testing["X"])
    }

    metrics_df = pd.DataFrame({
        "R2": [regressor.score(training["X"], training["y"]), regressor.score(testing["X"], testing["y"])],
        "MAE": [mean_absolute_error(training["y"], predictions["training"]), mean_absolute_error(testing["y"], predictions["testing"])],
        "MSE": [mean_squared_error(training["y"], predictions["training"]), mean_squared_error(testing["y"], predictions["testing"])],
        "RMSE": [np.sqrt(mean_squared_error(training["y"], predictions["training"])), np.sqrt(mean_squared_error(testing["y"], predictions["testing"]))]
    }, index=["Training", "Testing"])
    return metrics_df

---

## **Tasks**

### **Try a Decision Tree, Bagged Tree, and Random Forest.**

In [149]:
# Create feature matrix and target vector
X = df.drop(columns="PRICE")
y = df["PRICE"]

In [150]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
training_set = {"X": X_train, "y": y_train}
testing_set = {"X": X_test, "y": y_test}

In [151]:
# Bring default trees and forest regressors into the mix
dt_default = DecisionTreeRegressor(random_state=42)
bt_default = BaggingRegressor(random_state=42)
rf_default = RandomForestRegressor(random_state=42)

In [152]:
# Fit all of them to the training data
dt_default.fit(X_train, y_train)
bt_default.fit(X_train, y_train)
rf_default.fit(X_train, y_train)
print(end='')

In [153]:
display(metrics(dt_default, training_set, testing_set))
display(metrics(bt_default, training_set, testing_set))
metrics(rf_default, training_set, testing_set)

Unnamed: 0,R2,MAE,MSE,RMSE
Training,1.0,0.0,0.0,0.0
Testing,0.619323,3.140945,26.657717,5.163111


Unnamed: 0,R2,MAE,MSE,RMSE
Training,0.960676,1.103325,3.487356,1.867446
Testing,0.820421,2.315512,12.575417,3.546183


Unnamed: 0,R2,MAE,MSE,RMSE
Training,0.977134,0.953546,2.027774,1.423999
Testing,0.833853,2.207858,11.634795,3.410981


### **Tune each model to optimize performance on the test set.**
- After using a loop to tune each model, remember to create the best version of the model using the best hyperparameter values for the model based on the metrics you generated in your loop. The metrics from this best version model are what you will compare to the metrics of the other best version models to determine the overall best model.

#### **Decision Tree Tuning**

In [154]:
# I've arrived at these specific hyperparameter tunings through rote
# trial and error. I literally tried almost all hyperparameters for DecisionTrees 
# and this way, so far, had the highest score (R^2)
param_grid = {
    "criterion": ["squared_error", "friedman_mse", "absolute_error", "poisson"],
    "max_depth": [number for number in range(1, 101, 1)],
    "max_features": [number for number in range(1, 7, 1)]
}

dt_grid_search = GridSearchCV(
    estimator=dt_default,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

dt_grid_search.fit(X_train, y_train)
# dt_grid_search.best_estimator_
metrics(dt_grid_search.best_estimator_, training_set, testing_set)

Unnamed: 0,R2,MAE,MSE,RMSE
Training,0.916706,1.550396,7.386623,2.717834
Testing,0.723474,2.738976,19.364311,4.40049


#### **Bagged Tree Tuning**

In [163]:
param_grid = {
    "n_estimators": [1 00, 200],
    "max_samples": [100, 200]
}

bt_grid_search = GridSearchCV(
    estimator=bt_default,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

bt_grid_search.fit(X_train, y_train)
display(bt_grid_search.best_params_)
metrics(bt_grid_search.best_estimator_, training_set, testing_set)

{'max_samples': 200, 'n_estimators': 200}

Unnamed: 0,R2,MAE,MSE,RMSE
Training,0.942086,1.497902,5.135922,2.266257
Testing,0.822618,2.139587,12.421562,3.524424


### **Evaluate your best model using multiple regression metrics.**

### **Explain in a text cell how your model will perform if deployed by referring to the metrics.  Ex. How close can your stakeholders expect its predictions to be to the true value?**