## 46-886

## Tree-Based (Ensemble) Models: Random Forests and Boosted Trees

### Application: Predicting Click-Through Rate
 

Amr Farahat

CMU / Tepper

2023-03-25

---

## Setting up

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [3]:
pd.options.display.max_columns = None

## Importing and preparing data

In [4]:
df = pd.read_csv("CTR.csv")

In [5]:
df

Unnamed: 0,CTR,titleWords,adWords,depth,position,advCTR,queryCTR,queryCTRInPos
0,0.0000,10,17,3,2,0.0136,0.0000,0.0000
1,0.0761,13,30,2,1,0.0373,0.0382,0.0581
2,0.0426,12,14,1,1,0.0254,0.0255,0.0323
3,0.0000,5,19,3,2,0.0178,0.0035,0.0017
4,0.0068,11,17,2,2,0.0096,0.0294,0.0171
...,...,...,...,...,...,...,...,...
6052,0.0182,8,16,1,1,0.0273,0.0040,0.0000
6053,0.0536,6,20,1,1,0.0919,0.0000,0.0000
6054,0.0141,9,26,3,1,0.0496,0.0467,0.0668
6055,0.1339,9,16,2,1,0.0566,0.0618,0.1012


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6057 entries, 0 to 6056
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   CTR            6057 non-null   float64
 1   titleWords     6057 non-null   int64  
 2   adWords        6057 non-null   int64  
 3   depth          6057 non-null   int64  
 4   position       6057 non-null   int64  
 5   advCTR         6057 non-null   float64
 6   queryCTR       6057 non-null   float64
 7   queryCTRInPos  6057 non-null   float64
dtypes: float64(4), int64(4)
memory usage: 378.7 KB


In [7]:
y = df["CTR"]
X = df.drop("CTR", axis=1)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=886)

## Random Forests

Scikit-learn reference: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor

### Bagging with default parameters

In [9]:
bag_mod = RandomForestRegressor(random_state=886)

In [10]:
bag_mod.fit(X_train, y_train)

RandomForestRegressor(random_state=886)

In [11]:
print("Bagging Train R2: ", bag_mod.score(X_train, y_train))
print("Bagging Forest Test R2: ", bag_mod.score(X_test, y_test))

Bagging Train R2:  0.9269195230172717
Bagging Forest Test R2:  0.44752654403515424


###  Random Forest (with max_features = 'sqrt')

In [12]:
rf_mod = RandomForestRegressor(max_features = 'sqrt', random_state=886)

In [13]:
rf_mod.fit(X_train, y_train)

RandomForestRegressor(max_features='sqrt', random_state=886)

In [14]:
print("Random Forest Train R2: ", rf_mod.score(X_train, y_train))
print("Random Forest Test R2: ", rf_mod.score(X_test, y_test))

Random Forest Train R2:  0.9282979210998914
Random Forest Test R2:  0.47063271079922764


### Random Forest with cross validation on max_features

In [15]:
rf_mod.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 886,
 'verbose': 0,
 'warm_start': False}

In [16]:
param_grid = [
  {'max_features': [1, 2, 3, 4, 5, 6, 7]} 
]

In [17]:
cv_grid_rf = GridSearchCV(rf_mod, param_grid, scoring="r2")

In [18]:
cv_grid_rf.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestRegressor(max_features='sqrt',
                                             random_state=886),
             param_grid=[{'max_features': [1, 2, 3, 4, 5, 6, 7]}],
             scoring='r2')

In [19]:
pd.DataFrame(cv_grid_rf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.242854,0.011702,0.013437,0.004215,1,{'max_features': 1},0.51749,0.442261,0.456691,0.586539,0.542613,0.509119,0.053674,2
1,0.32642,0.004907,0.016312,0.001229,2,{'max_features': 2},0.519816,0.44901,0.472707,0.585409,0.540243,0.513437,0.048475,1
2,0.433656,0.008984,0.016358,0.000876,3,{'max_features': 3},0.506343,0.455218,0.450203,0.583606,0.538473,0.506769,0.05053,3
3,0.525117,0.006614,0.015684,6.9e-05,4,{'max_features': 4},0.501471,0.466017,0.46489,0.575977,0.523602,0.506391,0.041268,4
4,0.618208,0.006166,0.015723,0.000146,5,{'max_features': 5},0.499632,0.456333,0.454406,0.562154,0.528209,0.500147,0.04158,5
5,0.72347,0.007954,0.0147,0.002588,6,{'max_features': 6},0.489973,0.448298,0.457233,0.562605,0.514675,0.494557,0.041447,6
6,0.822408,0.01931,0.015729,8.6e-05,7,{'max_features': 7},0.495129,0.448165,0.444572,0.553033,0.51132,0.490444,0.040662,7


In [20]:
rf_mod_cvfinal = RandomForestRegressor(max_features=2, random_state=886)
rf_mod_cvfinal.fit(X_train, y_train)

RandomForestRegressor(max_features=2, random_state=886)

In [21]:
print("Random Forest with CV train R2: ", rf_mod_cvfinal.score(X_train, y_train))
print("Random Forest with CV test R2: ", rf_mod_cvfinal.score(X_test, y_test))

Random Forest with CV train R2:  0.9282979210998914
Random Forest with CV test R2:  0.47063271079922764


## Boosted Trees

Scikit-learn reference: 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn-ensemble-gradientboostingregressor

### Default parameters

In [22]:
gb_mod_0 = GradientBoostingRegressor(random_state=886)

In [23]:
gb_mod_0.fit(X_train, y_train)

GradientBoostingRegressor(random_state=886)

In [24]:
print("Gradient Boosting (default parameters) Train R2: ", gb_mod_0.score(X_train, y_train))
print("Gradient Boosting (default parameters) Test R2: ", gb_mod_0.score(X_test, y_test))

Gradient Boosting (default parameters) Train R2:  0.6057677224441272
Gradient Boosting (default parameters) Test R2:  0.43179710632316104


### Varying the learning rate parameter

In [25]:
gb_mod_1 = GradientBoostingRegressor(learning_rate=1, n_estimators= 100, random_state=886)

In [26]:
gb_mod_1.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=1, random_state=886)

In [27]:
print("Gradient Boosting (slower learning) Train R2: ", gb_mod_1.score(X_train, y_train))
print("Gradient Boosting (slower learning) Test R2: ", gb_mod_1.score(X_test, y_test))

Gradient Boosting (slower learning) Train R2:  0.8201552703947009
Gradient Boosting (slower learning) Test R2:  0.1902708255698986


### END