# Boosted Regression Tree training 
- Author: Mako Shibata (s2471259@ed.ac.uk) 
- Date: 25/06/2024
- Aim: Apply and understand boosted regression tree methods
- Tutorial: https://lost-stats.github.io/Machine_Learning/boosted_regression_trees.html

Supervised machine learning method that are trained using multiple ~ number of weak classifiers. It will adjust each predictor's influence on the final classifier and reduce uncertainty of the outcome. 
Some characteristics include: 

- Boosted trees can pass information from one tree to another whereas Random Forest cannot. 
- Each classifier is a simple decision tree. 
- This technique of combining multiple predictors is called "ensembling". 


In [11]:
# install scikit-learn using conda 
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.model_selection import train_test_split

# generate some synthetic data 
X, y = make_regression()

print(X)
print(y)

[[-5.52454344e-02  3.04238144e-01  1.80231839e+00 ... -1.09861210e+00
   8.71592987e-01  1.88425086e+00]
 [-8.17702999e-01 -4.52334011e-01 -6.36446997e-01 ...  5.99830559e-01
   1.86178052e-03 -1.20737347e+00]
 [-5.21170283e-01 -1.01484865e+00 -1.58060160e-02 ...  1.52341773e+00
  -1.37689542e+00  2.00041358e+00]
 ...
 [-9.91916212e-02  6.21172040e-01  6.42234196e-01 ...  1.69870244e+00
  -8.13363678e-01  2.96839626e-01]
 [ 1.35476426e+00  3.07950860e-01 -8.39196289e-01 ... -9.93897092e-01
  -4.45705337e-01 -6.38623791e-01]
 [-7.43009922e-01  6.27902737e-03 -1.59169719e+00 ...  1.35726420e+00
   1.29039993e+00  6.09259688e-01]]
[  28.4533175    10.7089508    48.20977895 -214.81670566 -361.64449262
 -607.15892256   88.63701914  -22.733863    108.31950061   13.79200749
  -67.47596873  264.15290617 -202.45986749  -54.22542142 -224.07035996
  224.15819619 -104.74514213  169.02886463 -209.26685949 -214.43632466
   23.34140176  -85.89736812 -301.45642538  102.5359325   -82.67793307
 -452.291

In [23]:
# split synthetic data into train and test arrays
X_train, X_test, y_train, y_test = train_test_split(X, y)

# X...input data, y...actual target values
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[-2.21362212  0.7419133  -0.662548   ... -0.86681467 -0.93700379
  -0.07720543]
 [ 3.30664159 -0.94773874 -0.54197581 ... -0.23229528 -1.78662322
   1.09848952]
 [-0.86824725 -0.35477441  1.66037585 ...  3.49722971 -0.61980381
  -1.23332267]
 ...
 [-1.19984504  0.87254857 -0.09118974 ... -0.70901104 -0.59112535
   0.06460512]
 [ 1.07526607  0.50101922 -1.49543418 ...  1.18775056  0.10510677
  -0.65712335]
 [-0.48604855  2.10984095  0.39428335 ... -0.233032   -0.44325019
   0.36186579]]
[[ 0.2065969  -0.52855856  0.49173243 ... -0.1190151  -0.9976995
  -0.40198868]
 [ 0.27391653  0.03288849  0.46155937 ... -0.45803822  0.5474452
  -0.68471199]
 [ 0.34777477 -0.77001906 -0.38582974 ...  1.84877211  0.14280116
   2.16250197]
 ...
 [-0.67089464 -0.36941866 -0.74938349 ...  1.44818809  0.94807832
  -0.97554056]
 [ 0.17705896  1.29653456 -0.39890742 ...  2.66007333  0.4835973
   0.12850823]
 [ 0.67278084  2.2169736  -0.06263591 ...  0.51205328 -0.86854421
   0.70678252]]
[-249.51380384 -470

In [18]:
# set up the regression conditions 
# parametres should be experimented in each case. 

reg = GradientBoostingRegressor(n_estimators = 100, # the number of trees
                               max_depth = 3, # the maximum number of nodes in each tree. Deeper (larger), the more precise but risks overfitting
                               learning_rate = 0.1, # weight of contributions of each tree to the final classifier
                               min_samples_split = 3 # smallest sample size for each tree to be split to internal nodes
                               )

In [19]:
# fit the model 'fit' means to train
reg.fit(X_train, y_train)

In [20]:
# predict the value of the first test case
reg.predict(X_test[:1])

array([27.38868032])

In [22]:
# R^2 score for the model (on the test data)
reg.score(X_test, y_test) # model "reg" predicts 'X_test', and compares these values with the actual values "y_test"

0.30845894718563294