# **Ensemble Trees Exercise (Core)**

**Name:** Kellianne Yang

# Assignment:

You will use the [Boston Housing Data](https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8cbwauNV5rkFP_hFp8-ZEgY_r3ZEQDcFVo0QshmP7Z9dGZaSXRE7nwFLg2wM43zIh2biZ40Cbv4Mh/pub?gid=2001589399&single=true&output=csv) that you have used for previous exercises including the Decision Tree Regressor. See if you can improve your results by using these ensemble methods!

Your task is to create the best possible model to predict house prices.

1. Try a Decision Tree, Bagged Tree, and Random Forest.

2. Tune each model to optimize performance on the test set.

-    After using a loop to tune each model, remember to create the best version of the model using the best hyperparameter values for the model based on the metrics you generated in your loop. The metrics from this best version model are what you will compare to the metrics of the other best version models to determine the overall best model.

3. Evaluate your best model using multiple regression metrics.

4. Explain in a text cell how your model will perform if deployed by referring to the metrics.  Ex. How close can your stakeholders expect its predictions to be to the true value?






# Preliminary Steps

In [None]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# import data
path = '/content/drive/MyDrive/Coding Dojo/06 Week 6: Regression Models/Boston_Housing_from_Sklearn.csv'
df = pd.read_csv(path)

In [None]:
# inspect
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


In [None]:
# inspect
df.sample(10)

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
196,0.04011,0.404,7.287,34.1,12.6,4.08,33.3
260,0.54011,0.647,7.203,81.8,13.0,9.59,33.8
444,12.8023,0.74,5.854,96.6,20.2,23.79,10.8
212,0.21719,0.489,5.807,53.8,18.6,16.03,22.4
486,5.69175,0.583,6.114,79.8,20.2,14.98,19.1
469,13.0751,0.58,5.713,56.7,20.2,14.76,20.1
248,0.16439,0.431,6.433,49.1,19.1,9.52,24.5
141,1.62864,0.624,5.019,100.0,21.2,34.41,14.4
243,0.12757,0.428,6.393,7.8,16.6,5.19,23.7
144,2.77974,0.871,4.903,97.8,14.7,29.29,11.8


In [None]:
# because 'PRICE' shows the median home price in the thousands, I will change
# the values into thousands to make it easier to interpret errors in model
# predictions later

df['PRICE'] = df['PRICE'] * 1000

In [None]:
# inspect
df.sample(10)

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
207,0.25199,0.489,5.783,72.7,18.6,18.06,22500.0
495,0.17899,0.585,5.67,28.8,19.2,17.6,23100.0
126,0.38735,0.581,5.613,95.6,19.1,27.26,15700.0
141,1.62864,0.624,5.019,100.0,21.2,34.41,14400.0
459,6.80117,0.713,6.081,84.4,20.2,14.7,20000.0
66,0.04379,0.398,5.787,31.1,16.1,10.24,19400.0
184,0.08308,0.488,5.604,89.8,17.8,13.98,26400.0
14,0.63796,0.538,6.096,84.5,21.0,10.26,18200.0
168,2.3004,0.605,6.319,96.1,14.7,11.1,23800.0
17,0.7842,0.538,5.99,81.7,21.0,14.67,17500.0


In [None]:
# inspect
df.describe(include = 'number')

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,0.554695,6.284634,68.574901,18.455534,12.653063,22532.806324
std,8.601545,0.115878,0.702617,28.148861,2.164946,7.141062,9197.104087
min,0.00632,0.385,3.561,2.9,12.6,1.73,5000.0
25%,0.082045,0.449,5.8855,45.025,17.4,6.95,17025.0
50%,0.25651,0.538,6.2085,77.5,19.05,11.36,21200.0
75%,3.677083,0.624,6.6235,94.075,20.2,16.955,25000.0
max,88.9762,0.871,8.78,100.0,22.0,37.97,50000.0


There is no missing data, so we will not need any imputers in the preprocessing step.

All the variables are numeric continuous, so we will not need to ordinal or one-hot encode any data. 

Since we will be building tree-based models, we will not need to scale our features. 

Taking into account all of the above, we will not need to preprocess this data before using it with our models.

# Machine Learning Preprocessing

In [None]:
# assign y and X
target = 'PRICE'

y = df['PRICE']
display(y)

X = df.drop(columns = target)
display(X)

0      24000.0
1      21600.0
2      34700.0
3      33400.0
4      36200.0
        ...   
501    22400.0
502    20600.0
503    23900.0
504    22000.0
505    11900.0
Name: PRICE, Length: 506, dtype: float64

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT
0,0.00632,0.538,6.575,65.2,15.3,4.98
1,0.02731,0.469,6.421,78.9,17.8,9.14
2,0.02729,0.469,7.185,61.1,17.8,4.03
3,0.03237,0.458,6.998,45.8,18.7,2.94
4,0.06905,0.458,7.147,54.2,18.7,5.33
...,...,...,...,...,...,...
501,0.06263,0.573,6.593,69.1,21.0,9.67
502,0.04527,0.573,6.120,76.7,21.0,9.08
503,0.06076,0.573,6.976,91.0,21.0,5.64
504,0.10959,0.573,6.794,89.3,21.0,6.48


In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Decision Tree Model

In [None]:
# create (start with default parameters)
dec_tree = DecisionTreeRegressor(random_state = 42)

In [None]:
# fit the model on training data only
dec_tree.fit(X_train, y_train)

In [None]:
# inspect the parameters that can be tuned for this model
dec_tree.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 42,
 'splitter': 'best'}

In [None]:
# to tune the hyperparameter max_depth, we will find the values to loop through
# by inspecting the max_depth of the default tree
dec_tree.get_depth()

20

In [None]:
# set up values for max_depth to loop through
depths = list(range(2, 21))

In [None]:
# create a dataframe 'scores' that will store the r2 scores for each value
# of max_depth
scores = pd.DataFrame(index = depths, columns = ['Test Score', 'Train Score'])

In [None]:
# populate scores by creating a model for each max_depth value in a loop and 
# calculating its test and train scores

# for each value in list depths
for depth in depths:

  # make new decision tree model with that depth
  dec_tree = DecisionTreeRegressor(max_depth = depth, random_state = 42)

  # fit the model on the training data only
  dec_tree.fit(X_train, y_train)

  # get the r2 scores on the training data
  train_score = dec_tree.score(X_train, y_train)

  # get the r2 scores on the testing data
  test_score = dec_tree.score(X_test, y_test)

  # save the training scores in the scores df
  scores.loc[depth, 'Train Score'] = train_score

  # save the testing scores in the scores df
  scores.loc[depth, 'Test Score'] = test_score

In [None]:
# find best max_depth value

# sort the scores df to find the best test_score of the models from the loop
sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)

In [None]:
# view top five test scores
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
7,0.835841,0.958517
9,0.820885,0.982104
18,0.81917,0.999999
15,0.81228,0.999476
19,0.780687,1.0


In [None]:
# create model with best max_depth value
dec_tree_7 = DecisionTreeRegressor(max_depth = 7, random_state = 42)

In [None]:
# fit model on training data only
dec_tree_7.fit(X_train, y_train)

In [None]:
# make function that will calculate all metrics for a model after it has been 
# fitted, and return a df with the metrics

def model_metrics(model):

  # create list of metrics and df to store calculations
  metrics = ['MAE', 'MSE', 'RMSE', 'R2']
  metrics = pd.DataFrame(index = metrics, 
                         columns = ['Train Score', 'Test Score'])

  # create training and testing predictions
  train_pred = model.predict(X_train)
  test_pred = model.predict(X_test)
  
  # calculate mae for training and testing data
  train_mae = mean_absolute_error(y_train, train_pred)
  test_mae = mean_absolute_error(y_test, test_pred)

  # calculate mse
  train_mse = mean_squared_error(y_train, train_pred)
  test_mse = mean_squared_error(y_test, test_pred)

  # calculate r2
  train_r2 = r2_score(y_train, train_pred)
  test_r2 = r2_score(y_test, test_pred)

  # store values in df
  metrics.loc['MAE', 'Train Score'] = train_mae
  metrics.loc['MAE', 'Test Score'] = test_mae
  metrics.loc['MSE', 'Train Score'] = train_mse
  metrics.loc['MSE', 'Test Score'] = test_mse
  metrics.loc['RMSE', 'Train Score'] = np.sqrt(train_mse)
  metrics.loc['RMSE', 'Test Score'] = np.sqrt(test_mse)
  metrics.loc['R2', 'Train Score'] = train_r2
  metrics.loc['R2', 'Test Score'] = test_r2

  return metrics

In [None]:
# evaluate model using multiple regression metrics
dec_tree_7_metrics = model_metrics(dec_tree_7)

In [None]:
# display metrics df
display(dec_tree_7_metrics)

Unnamed: 0,Train Score,Test Score
MAE,1346.729907,2535.43854
MSE,3678789.859786,11495597.386296
RMSE,1918.017169,3390.515799
R2,0.958517,0.835841


# Bagged Tree Model

In [None]:
# create (start with default parameters)
bagreg = BaggingRegressor(random_state = 42)

In [None]:
# inspect the parameters that can be tuned for this model
bagreg.get_params()

{'base_estimator': 'deprecated',
 'bootstrap': True,
 'bootstrap_features': False,
 'estimator': None,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
# fit the model on training data only
bagreg.fit(X_train, y_train)

In [None]:
# to tune the hyperparameter n_estimators, we will run the model with several
# estimator values

# set up values for n_estimators to loop through
estimators = [10, 20, 30, 40, 50, 100]

In [None]:
# create a dataframe 'scores' that will store the r2 scores for each value
# of n_estimators
scores = pd.DataFrame(index = estimators, columns = ['Test Score', 'Train Score'])

In [None]:
# populate scores by creating a model for each n_estimator value in a loop and 
# calculating its test and train scores

# for each value in estimators
for num_estimators in estimators:

  # make new bagging regressor model with that number of estimators
  bagreg = BaggingRegressor(n_estimators = num_estimators,
                            random_state = 42)
  
  # fit the model on the training data only
  bagreg.fit(X_train, y_train)

  # get the r2 scores on the training data
  train_score = bagreg.score(X_train, y_train)

  # get the r2 scores on the testing data
  test_score = bagreg.score(X_test, y_test)

  # save the training scores in the scores df
  scores.loc[num_estimators, 'Train Score'] = train_score

  # save the testing scores in the scores df
  scores.loc[num_estimators, 'Test Score'] = test_score

In [None]:
# find the best number of estimators

# sort the scores df to find the best 'Test Score' of the models from the loop
sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)

In [None]:
# view the top five test scores
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
50,0.837614,0.974496
100,0.836429,0.976849
40,0.83622,0.97336
30,0.836017,0.972274
20,0.829472,0.970288


In [None]:
# create model with best number of estimators
bagreg_50 = BaggingRegressor(n_estimators = 50, 
                             random_state = 42)

In [None]:
# fit model on training data only
bagreg_50.fit(X_train, y_train)

In [None]:
# evaluate model using multiple regression metrics
bagreg_50_metrics = model_metrics(bagreg_50)

In [None]:
# display metrics df
display(bagreg_50_metrics)

Unnamed: 0,Train Score,Test Score
MAE,973.25066,2228.062992
MSE,2261700.46438,11371441.259843
RMSE,1503.895098,3372.156767
R2,0.974496,0.837614


# Random Forest Model

In [None]:
# create (start with default parameters)
rf = RandomForestRegressor(random_state = 42)

In [None]:
# inspect the parameters that can be tuned for this model
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
# fit the model on training data only
rf.fit(X_train, y_train)

In [None]:
# to tune the hyperparameter max_depth, we will find the values to loop through
# by inspecting the maximum max_depth of the default forest

# find all the depths
est_depths = [estimator.get_depth() for estimator in rf.estimators_]

# get the maximum depth
max(est_depths)


22

In [None]:
# set up values for max_depth to loop through
depths = list(range(2, 22))

In [None]:
# create a dataframe 'scores' that will store the r2 scores for each value
# of max_depth
scores = pd.DataFrame(index = depths, columns = ['Test Score', 'Train Score'])

In [None]:
# for each value in list depths
for depth in depths:

  # make new decision tree model with that depth
  rf = RandomForestRegressor(max_depth = depth, random_state = 42)

  # fit the model on the training data only
  rf.fit(X_train, y_train)

  # get the r2 scores on the training data
  train_score = rf.score(X_train, y_train)

  # get the r2 scores on the testing data
  test_score = rf.score(X_test, y_test)

  # save the training scores in the scores df
  scores.loc[depth, 'Train Score'] = train_score

  # save the testing scores in the scores df
  scores.loc[depth, 'Test Score'] = test_score

In [None]:
# find best max_depth value

# sort the scores df to find the best test_score of the models from the loop
sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)

In [None]:
# view top five test scores
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
21,0.84253,0.976292
20,0.842491,0.976255
16,0.842052,0.976569
19,0.8414,0.97644
18,0.839177,0.976455


RF model with 21 as its max_depth performed the best on the test data.

In [None]:
# to tune the hyperparameter n_estimators, we will run the model with several
# estimator values

# set up values for n_estimators to loop through
estimators = [50, 100, 150, 200, 250]

In [None]:
# create a dataframe 'scores' that will store the r2 scores for each value
# of n_estimators

scores = pd.DataFrame(index = estimators, columns = ['Test Score', 'Train Score'])

In [None]:
# populate scores by creating a model for each n_estimator value in a loop and 
# calculating its test and train scores

# for each value in estimators
for num_estimators in estimators:

  # make new bagging regressor model with that number of estimators
  rf = RandomForestRegressor(n_estimators = num_estimators,
                             max_depth = 21,
                             random_state = 42)
  
  # fit the model on the training data only
  rf.fit(X_train, y_train)

  # get the r2 scores on the training data
  train_score = rf.score(X_train, y_train)

  # get the r2 scores on the testing data
  test_score = rf.score(X_test, y_test)

  # save the training scores in the scores df
  scores.loc[num_estimators, 'Train Score'] = train_score

  # save the testing scores in the scores df
  scores.loc[num_estimators, 'Test Score'] = test_score

In [None]:
# find the best number of estimators

# sort the scores df to find the best 'Test Score' of the models from the loop
sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)

In [None]:
# view the top five test scores
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
100,0.84253,0.976292
150,0.839876,0.975557
250,0.839674,0.976142
200,0.8394,0.97559
50,0.837857,0.973541


RF model with 100 as its number of estimators performed the best on the test data.

In [None]:
# further tune this hyperparameter by trying num_estimators close to 100

# set up values for n_estimators to loop through
estimators = list(range(50, 150))

In [None]:
# create a dataframe 'scores' that will store the r2 scores for each value
# of n_estimators

scores = pd.DataFrame(index = estimators, columns = ['Test Score', 'Train Score'])

In [None]:
# populate scores by creating a model for each n_estimator value in a loop and 
# calculating its test and train scores

# for each value in estimators
for num_estimators in estimators:

  # make new bagging regressor model with that number of estimators
  rf = RandomForestRegressor(n_estimators = num_estimators,
                             max_depth = 21,
                             random_state = 42)
  
  # fit the model on the training data only
  rf.fit(X_train, y_train)

  # get the r2 scores on the training data
  train_score = rf.score(X_train, y_train)

  # get the r2 scores on the testing data
  test_score = rf.score(X_test, y_test)

  # save the training scores in the scores df
  scores.loc[num_estimators, 'Train Score'] = train_score

  # save the testing scores in the scores df
  scores.loc[num_estimators, 'Test Score'] = test_score

In [None]:
# find the best number of estimators

# sort the scores df to find the best 'Test Score' of the models from the loop
sorted_scores = scores.sort_values(by = 'Test Score', ascending = False)

In [None]:
# view the top five test scores
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
52,0.844483,0.973523
95,0.844035,0.975779
53,0.843738,0.974053
96,0.843528,0.975995
89,0.843343,0.975141


RF model with 52 as its number of estimators performed the best on the test data.

In [None]:
# create model with best max_depth value AND best n_estimators value
rf_21_52 = RandomForestRegressor(max_depth = 21, 
                               n_estimators = 52,
                               random_state = 42)

In [None]:
# fit model on training data only
rf_21_52.fit(X_train, y_train)

In [None]:
# evaluate model using multiple regression metrics
rf_21_52_metrics = model_metrics(rf_21_52)

In [None]:
# display metrics df
display(rf_21_52_metrics)

Unnamed: 0,Train Score,Test Score
MAE,980.244063,2206.848122
MSE,2347983.560247,10890399.846902
RMSE,1532.31314,3300.060582
R2,0.973523,0.844483


# The Best Model

In [None]:
# combine all metrics dfs into one df to compare them

# rename cols in DT scores to avoid confusion
dec_tree_7_metrics.rename(columns = {'Train Score': 'DT Train', 
                                     'Test Score': 'DT Test'}, 
                          inplace = True)

In [None]:
# rename cols in BR scores to avoid confusion
bagreg_50_metrics.rename(columns = {'Train Score': 'BR Train', 
                                    'Test Score': 'BR Test'}, 
                         inplace = True)

In [None]:
# rename cols in RF scores to avoid confusion
rf_21_52_metrics.rename(columns = {'Train Score': 'RF Train', 
                                   'Test Score': 'RF Test'}, 
                        inplace = True)

In [None]:
# join dfs
all_metrics = pd.concat([dec_tree_7_metrics,
                         bagreg_50_metrics,
                         rf_21_52_metrics],
                        axis=1)

In [None]:
# display all metrics
display(all_metrics)

Unnamed: 0,DT Train,DT Test,BR Train,BR Test,RF Train,RF Test
MAE,1346.729907,2535.43854,973.25066,2228.062992,980.244063,2206.848122
MSE,3678789.859786,11495597.386296,2261700.46438,11371441.259843,2347983.560247,10890399.846902
RMSE,1918.017169,3390.515799,1503.895098,3372.156767,1532.31314,3300.060582
R2,0.958517,0.835841,0.974496,0.837614,0.973523,0.844483


In [None]:
# since we will evaluate our models on how they perform on the test data only,
# drop the train scores
test_metrics = all_metrics.drop(columns = ['DT Train', 'BR Train', 'RF Train'])

In [None]:
# display test metrics
display(test_metrics)

Unnamed: 0,DT Test,BR Test,RF Test
MAE,2535.43854,2228.062992,2206.848122
MSE,11495597.386296,11371441.259843,10890399.846902
RMSE,3390.515799,3372.156767,3300.060582
R2,0.835841,0.837614,0.844483


In [None]:
# choose best model based on metrics
# for the 3 error scores (MAE, MSE, RMSE), the best will have the lowest score
# for R2, the best will have the highest score

The best model based on the metrics on the test data shown above is the Random Forest Regression model. It outperformed the Decision Tree Regression and Bagged Tree Regression models on all metrics. 

How close can the stakeholders expect the Random Forest Regression model's predictions to be to the true value?

How this model will perform if deployed:

According to the MAE score, which penalizes larger and smaller errors in the model's predictions proportionally, this model was off on its predictions for the target variable price by about $2,207 for the test data. On average, the model will be off by about 10% (MAE/median as a percent, which is (2207/21200)*100).

According to the MSE score, which penalizes larger errors in the model's predictions more dramatically than smaller errors, the model is off in its predictions on the test data by about 10,890,400 dollars SQUARED (the median home price in dollars squared would be 449,440,000, which is 21,200^2).

According to the RMSE score, which penalizes errors similarly to the MSE but measures the performance in the units of the original data (dollars), so it is easier to interpret, this model is making an average error of $3,300 on our test data. (On average, the model will be off by about 16%, RMSE/median as a percent, which is (3300/21200)*100). This is larger than the model's MAE score, which means that this model is making some very large errors. 

According to the R2 score, which can be compared across models, this model can account for about 84.45% of the variation in the test data using the features we have.