In this worksheet, I run three different tree models. In order to allow a direct comparison, all three use the same train-test split.

The 3 models are (1) a simple decision tree, (2) Random Forest, and (3) Extreme Gradient Boosting using xGBoost.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('DataLinkageML.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.drop('StudentNumber', axis=1, inplace=True)
df.head()

Unnamed: 0,DPS_HomeLg,CYI_Lat,CYI_Deg,Disability,GT_C,FRL_C,Sect504_C,SPED_C,t_grade,time_t,time_t1
0,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,4,414,220
1,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,5,220,260
2,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,6,260,503
3,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,7,503,537
4,1.0,1.0,1.0,0.0,0.0,1.0,0.1,0.8,8,537,559


In [3]:
X = df.drop('time_t1', axis=1)
colnames = X.columns.tolist()
X = StandardScaler().fit_transform(X)
y = df['time_t1']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


The first is just a simple decision tree regressor. From previous work, I know I get the best results for the test sample with max_depth = 3.

In [4]:
regressor_dt = DecisionTreeRegressor(max_depth = 3)
regressor_dt.fit(X_train, y_train)

y_pred_train_dt = regressor_dt.predict(X_train)
y_pred_test_dt = regressor_dt.predict(X_test)

r2train = r2_score(y_train, y_pred_train_dt).round(3)
r2test = r2_score(y_test, y_pred_test_dt).round(3)

print('Decision Tree Train R^2:', r2train)
print('Decision Tree Test R^2:', r2test)


Decision Tree Train R^2: 0.751
Decision Tree Test R^2: 0.686


This regressor model hit an R^2 value of 0.751 on the training data, and 0.686 on the test data. 

Next is Random Forest, a bagging method that generates a number of weak decision trees and then averages the predictions of those trees. 

In [5]:
regressor_rf = RandomForestRegressor(n_estimators = 100, max_depth = 3) 
regressor_rf.fit(X_train,y_train)

y_pred_train_rf = regressor_rf.predict(X_train)
y_pred_test_rf = regressor_rf.predict(X_test)

r2train = r2_score(y_train, y_pred_train_rf).round(3)
r2test = r2_score(y_test, y_pred_test_rf).round(3)

print('Random Forest Train R^2:', r2train)
print('Random Forest Test R^2:', r2test)

Random Forest Train R^2: 0.766
Random Forest Test R^2: 0.702


This regressor model hit an R^2 of 0.766 on the training data and 0.702 on the test data. That's better than the result from a single decision tree.

Finally, we have xGBoost, or Extreme Gradient Boosting. This also generates a number of decision trees, but each successive tree builds on the previous ones. The first step generates a tree with just a root, effectively predicting the mean for all cases. At the next step, the tree starts to  branch out, and the model switches to predicting errors. The final prediction is the sum of all the predictions, or the mean plus all the adjustments made by subsequent trees.

In [6]:
regressor_xgb = xgb.XGBRegressor(objective = 'reg:squarederror', gamma = 0)
regressor_xgb.fit(X_train, y_train)

y_pred_train_xgb = regressor_xgb.predict(X_train)
y_pred_test_xgb = regressor_xgb.predict(X_test)

r2train = r2_score(y_train, y_pred_train_xgb).round(3)
r2test = r2_score(y_test, y_pred_test_xgb).round(3)

print('XGBoost Train R^2:', r2train)
print('XGBoost Test R^2:', r2test)

XGBoost Train R^2: 0.873
XGBoost Test R^2: 0.698


  if getattr(data, 'base', None) is not None and \


This model hit an R^2 of 0.873 on training data and 0.698 on test data, slightly better than Random Forest results, but in the ballpark. That higher R^2 for training data suggests overfitting, which is a common problem for gradient boosted trees in general.

I want to play around a little with the boosting parameters. Those are gamma, reg_alpha, and reg_lambda. Gamma is the minimum loss function required to split a leaf node, and it's default is 0. Manipulating gamma allows for more impurity in leaf nodes, which can prevent overfitting. Reg_alpha and reg_lambda control L1 and L2 regularization terms, respectively. Given the small number of predictors, I'm not really interested in Ridge or LASSO, so I'll focus on tuning gamma. Since gamma controls tree depth, let's play around with different depths at the same time.

In [7]:
param_test = {'gamma':[0,.05,.1,.15,.2,.25,.3], 'max_depth':[3,6,9,12]}
gsearch = GridSearchCV(estimator = regressor_xgb, param_grid = param_test)
gsearch.fit(X_train, y_train)
gsearch.best_params_, gsearch.best_score_

  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None

  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None

  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \
  if getattr(data, 'base', None) is not None and \


({'gamma': 0, 'max_depth': 3}, 0.6901535955824847)

It's both fortunate and unsatisfying that the default gamma and max_depth values gave the best results, and so there's nothing to tune.