# **Ensemble Trees Exercise**

_John Andrew Dixon_

---

**Setup**

In [372]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor

In [373]:
# Remote URL to the data
url ="https://docs.google.com/spreadsheets/d/e/2PACX-1vSQc1CsJ25nPMJcuJD04csFCysrzuInd_IQ_drLza49m_3R4MllPcuhduu4GozMJun3MgUJkGl0cw-d/pub?output=csv"
# Load and verify data
df = pd.read_csv(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


In [374]:
# Function for evaluating regression R^2 scores
def regressor_eval(regressor, tts_tuple, verbose=False):
    """Function for evaluating a model's R^2 scores"""
    training_r2 = regressor.score(tts_tuple[0], tts_tuple[2])
    testing_r2 = regressor.score(tts_tuple[1], tts_tuple[3])
    if verbose:
        print(type(regressor))
        print("Training R-squared:", training_r2)
        print("Testing R-squared:", testing_r2)
        print()

    return training_r2, testing_r2

---

## **Tasks**

### **Try a Decision Tree, Bagged Tree, and Random Forest.**

In [375]:
# Create feature matrix and target vector
X = df.drop(columns="PRICE")
y = df["PRICE"]

In [376]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_test_split_tuple = (X_train, X_test, y_train, y_test)

In [377]:
# Bring default trees and forest regressors into the mix
dt_default = DecisionTreeRegressor(random_state=42)
bt_default = BaggingRegressor(random_state=42)
rf_default = RandomForestRegressor(random_state=42)

In [378]:
# Fit all of them to the training data
dt_default.fit(X_train, y_train)
bt_default.fit(X_train, y_train)
rf_default.fit(X_train, y_train)
print(end='')

In [379]:
regressor_eval(dt_default, train_test_split_tuple, True)
regressor_eval(bt_default, train_test_split_tuple, True)
regressor_eval(rf_default, train_test_split_tuple, True)
print(end='')

<class 'sklearn.tree._classes.DecisionTreeRegressor'>
Training R-squared: 1.0
Testing R-squared: 0.6193230918136841

<class 'sklearn.ensemble._bagging.BaggingRegressor'>
Training R-squared: 0.9606756023782893
Testing R-squared: 0.8204208271364619

<class 'sklearn.ensemble._forest.RandomForestRegressor'>
Training R-squared: 0.9771342521069045
Testing R-squared: 0.8338530730048258



### **Tune each model to optimize performance on the test set.**
- After using a loop to tune each model, remember to create the best version of the model using the best hyperparameter values for the model based on the metrics you generated in your loop. The metrics from this best version model are what you will compare to the metrics of the other best version models to determine the overall best model.

#### **Decision Tree Tuning**

In [380]:
# DataFrame to hold the various hyperparameter tunings
depths = pd.DataFrame({
    "Depth": [],
    "Train R-Squared": [],
    "Test R-Squared": []
})
# Loop through pontential parameters
for number in range(1, 51):
    temp_decision_tree = DecisionTreeRegressor(max_depth=number, criterion="friedman_mse", random_state=42)
    temp_decision_tree.fit(X_train, y_train)
    train_r2, test_r2 = regressor_eval(temp_decision_tree, train_test_split_tuple)
    depths.loc[-1] = [number, train_r2, test_r2]
    depths.index = depths.index + 1
# Show the best result
depths.sort_values("Test R-Squared", ascending=False).head(3)

Unnamed: 0,Depth,Train R-Squared,Test R-Squared
39,11.0,0.9911,0.847881
43,7.0,0.958517,0.834341
40,10.0,0.986796,0.822949


#### **Bagged Tree Tuning**

In [383]:
# DataFrame to hold the various hyperparameter tunings
estimator = pd.DataFrame({
    "Estimator": [],
    "Train R-Squared": [],
    "Test R-Squared": []
})
# Loop through pontential parameters
for number in range(1, 51):
    temp_bagged_tree = BaggingRegressor(n_estimators=number, random_state=42)
    temp_bagged_tree.fit(X_train, y_train)
    train_r2, test_r2 = regressor_eval(temp_bagged_tree, train_test_split_tuple)
    estimator.loc[-1] = [number, train_r2, test_r2]
    estimator.index = estimator.index + 1
# Show the best result
estimator.sort_values("Test R-Squared", ascending=False).head(3)

Unnamed: 0,Estimator,Train R-Squared,Test R-Squared
42,8.0,0.964612,0.837922
45,5.0,0.958592,0.835033
43,7.0,0.958544,0.834858


#### **Random Forest Tuning**

In [385]:
hyperparams = pd.DataFrame({
    "Estimator": [0],
    "Depth": [0],
    "Train R-Squared": [0],
    "Test R-Squared": [0]
})

for estimator in range(1, 51, 1):
    for depth in range(1, 51, 1):
        temp_random_forest = RandomForestRegressor(max_depth=depth, 
                                                   n_estimators=estimator,
                                                   random_state=42)
        temp_random_forest.fit(X_train, y_train)
        train_r2, test_r2 = regressor_eval(temp_random_forest, train_test_split_tuple)
        hyperparams.loc[-1] = [estimator, depth, train_r2, test_r2]
        hyperparams.index = hyperparams.index + 1

hyperparams.sort_values("Test R-Squared", ascending=False).head(3)

Unnamed: 0,Estimator,Depth,Train R-Squared,Test R-Squared
2140,8.0,10.0,0.96026,0.845709
2190,7.0,10.0,0.954286,0.840736
2138,8.0,12.0,0.961838,0.840439


### **Evaluate your best model using multiple regression metrics.**

### **Explain in a text cell how your model will perform if deployed by referring to the metrics.  Ex. How close can your stakeholders expect its predictions to be to the true value?**