## 3.2.4 Gradient Boosted Decision Trees (GBDT)

- Author: Phanxuan Phuc
- Project: https://github.com/phanxuanphucnd/TensorFlow-2.0-Tutorial


### Dataset

- Boston Housing Dataset: [Reference](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)

- Description:

    The dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

    The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.`

    *For the full features list, please see the link above*

In [1]:
import os
import copy
import numpy as np
import tensorflow as tf

# Ignore all GPUs because the current TF GBDT doesn't support GPU
os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

2021-07-07 01:29:40.669440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-07 01:29:40.669466: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
# Dataset parameters
num_classes = 2    # Total classes: greater or equal to $23,000, or NOT 
num_features = 13  # Data features size

# Training parameters
max_steps = 2000
batch_size = 256
learning_rate = 1.0
l1_regular = 0.0
l2_regular = 0.1

# GBDT parameters
num_batches_per_layer = 1000
num_trees = 10
max_depth = 4

In [3]:
# Prepare Boston Housing Dataset

from tensorflow.keras.datasets import boston_housing

(x_train, y_train), (x_test, y_test) = boston_housing.load_data()


# For classification purpose, we build 2 classes: price greater or lower than $23,000
def to_binary_class(y):
    for i, label in enumerate(y):
        if label >= 23.0:
            y[i] = 1
        else:
            y[i] = 0
            
    return y
            
y_train_binary = to_binary_class(copy.deepcopy(y_train))
y_test_binary = to_binary_class(copy.deepcopy(y_test))

### GBDT Classifier

In [4]:
# Build the Input function

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_train}, y=y_train_binary,
    batch_size=batch_size, num_epochs=None, shuffle=True
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_test}, y=y_test_binary,
    batch_size=batch_size, num_epochs=1, shuffle=False
)

test_train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_train}, y=y_train_binary,
    batch_size=batch_size, num_epochs=1, shuffle=False
)

# GBDT models from TF Estimator requires `feature_column` data format
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(num_features, ))]





In [5]:
gbdt_classifier = tf.estimator.BoostedTreesClassifier(
    n_batches_per_layer=num_batches_per_layer,
    feature_columns=feature_columns,
    n_classes=num_classes,
    learning_rate=learning_rate,
    n_trees=num_trees,
    max_depth=max_depth,
    l1_regularization=l1_regular,
    l2_regularization=l2_regular
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpuymq6v22', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [6]:
gbdt_classifier.train(
    train_input_fn,
    max_steps=max_steps
)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.


2021-07-07 01:29:42.099823: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-07 01:29:42.099851: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-07 01:29:42.099872: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (phucphan-ThinkPad): /proc/driver/nvidia/version does not exist


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.


Exception ignored in: <function CapturableResource.__del__ at 0x7f493e1039d8>
Traceback (most recent call last):
  File "/home/phucphan/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py", line 269, in __del__
    with self._destruction_context():
AttributeError: 'TreeEnsemble' object has no attribute '_destruction_context'
2021-07-07 01:29:42.494282: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-07 01:29:42.535832: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 1999965000 Hz


INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpuymq6v22/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.6931473, step = 0
INFO:tensorflow:loss = 0.6931473, step = 0 (0.385 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.105 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.168 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.192 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.192 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.175 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.163 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.162 sec)
INFO:tensorflow:loss = 0.6931473, step = 0 (0.152 sec)
INFO:tensorflow:loss = 0.

<tensorflow_estimator.python.estimator.canned.boosted_trees.BoostedTreesClassifier at 0x7f4933ac0518>

In [7]:
gbdt_classifier.evaluate(test_train_input_fn)

INFO:tensorflow:Calling model_fn.
Instructions for updating:
The value of AUC returned by this may race with the update so this is deprecated. Please use tf.keras.metrics.AUC instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-07-07T01:29:48
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpuymq6v22/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Exception ignored in: <function CapturableResource.__del__ at 0x7f493e1039d8>
Traceback (most recent call last):
  File "/home/phucphan/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py", line 269, in __del__
    with self._destruction_context():
AttributeError: 'TreeEnsemble' object has no attribute '_destruction_context'


INFO:tensorflow:Inference Time : 0.47890s
INFO:tensorflow:Finished evaluation at 2021-07-07-01:29:48
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.87376237, accuracy_baseline = 0.63118815, auc = 0.92280567, auc_precision_recall = 0.9104949, average_loss = 0.38152993, global_step = 2000, label/mean = 0.36881188, loss = 0.38536403, precision = 0.8888889, prediction/mean = 0.37860456, recall = 0.7516779
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmpuymq6v22/model.ckpt-2000


{'accuracy': 0.87376237,
 'accuracy_baseline': 0.63118815,
 'auc': 0.92280567,
 'auc_precision_recall': 0.9104949,
 'average_loss': 0.38152993,
 'label/mean': 0.36881188,
 'loss': 0.38536403,
 'precision': 0.8888889,
 'prediction/mean': 0.37860456,
 'recall': 0.7516779,
 'global_step': 2000}

In [8]:
gbdt_classifier.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-07-07T01:29:49
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpuymq6v22/model.ckpt-2000


Exception ignored in: <function CapturableResource.__del__ at 0x7f493e1039d8>
Traceback (most recent call last):
  File "/home/phucphan/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py", line 269, in __del__
    with self._destruction_context():
AttributeError: 'TreeEnsemble' object has no attribute '_destruction_context'


INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.42627s
INFO:tensorflow:Finished evaluation at 2021-07-07-01:29:49
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.78431374, accuracy_baseline = 0.5588235, auc = 0.8458089, auc_precision_recall = 0.8628531, average_loss = 0.4937335, global_step = 2000, label/mean = 0.44117647, loss = 0.4937335, precision = 0.87096775, prediction/mean = 0.37429, recall = 0.6
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmpuymq6v22/model.ckpt-2000


{'accuracy': 0.78431374,
 'accuracy_baseline': 0.5588235,
 'auc': 0.8458089,
 'auc_precision_recall': 0.8628531,
 'average_loss': 0.4937335,
 'label/mean': 0.44117647,
 'loss': 0.4937335,
 'precision': 0.87096775,
 'prediction/mean': 0.37429,
 'recall': 0.6,
 'global_step': 2000}

### GBDT Regressor

In [9]:
# Build the input function.

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_train}, y=y_train,
    batch_size=batch_size, num_epochs=None, shuffle=True
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': x_test}, y=y_test,
    batch_size=batch_size, num_epochs=1, shuffle=False
)

# GBDT models from TF Estimator requires `feature_column` data format
feature_columns = [tf.feature_column.numeric_column(key='x', shape=(num_features,))]

In [10]:
gbdt_regressor = tf.estimator.BoostedTreesRegressor(
    n_batches_per_layer=num_batches_per_layer,
    feature_columns=feature_columns, 
    learning_rate=learning_rate, 
    n_trees=num_trees,
    max_depth=max_depth,
    l1_regularization=l1_regular, 
    l2_regularization=l2_regular
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpwq46lp_b', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [11]:
gbdt_regressor.train(train_input_fn, max_steps=max_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Exception ignored in: <function CapturableResource.__del__ at 0x7f493e1039d8>
Traceback (most recent call last):
  File "/home/phucphan/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py", line 269, in __del__
    with self._destruction_context():
AttributeError: 'TreeEnsemble' object has no attribute '_destruction_context'


'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpwq46lp_b/model.ckpt.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 569.7765, step = 0
INFO:tensorflow:loss = 523.4455, step = 0 (0.247 sec)
INFO:tensorflow:loss = 632.2858, step = 0 (0.114 sec)
INFO:tensorflow:loss = 620.88464, step = 0 (0.121 sec)
INFO:tensorflow:loss = 582.4016, step = 0 (0.116 sec)
INFO:tensorflow:loss = 568.84973, step = 0 (0.119 sec)
INFO:tensorflow:loss = 628.28766, step = 0 (0.155 sec)
INFO:tensorflow:loss = 596.1156, step = 0 (0.138 sec)
INFO:tensorflow:loss = 538.71844, step = 0 (0.121 sec)
INFO:tensorflow:loss = 552.3838, step = 0 (0.120 sec)
INFO:tensorflow:loss = 564.7405, step = 0 (0.119 sec)
INFO:tensorflow:global_step/sec: 62.3676
INFO:tensorflow:loss = 588.93726, step = 100 (0.235 sec)
INFO:t

<tensorflow_estimator.python.estimator.canned.boosted_trees.BoostedTreesRegressor at 0x7f4912ac6e80>

In [12]:
gbdt_regressor.evaluate(test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-07-07T01:31:29
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpwq46lp_b/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Exception ignored in: <function CapturableResource.__del__ at 0x7f493e1039d8>
Traceback (most recent call last):
  File "/home/phucphan/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py", line 269, in __del__
    with self._destruction_context():
AttributeError: 'TreeEnsemble' object has no attribute '_destruction_context'


INFO:tensorflow:Inference Time : 0.19255s
INFO:tensorflow:Finished evaluation at 2021-07-07-01:31:29
INFO:tensorflow:Saving dict for global step 2000: average_loss = 29.69382, global_step = 2000, label/mean = 23.078432, loss = 29.69382, prediction/mean = 22.49272
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: /tmp/tmpwq46lp_b/model.ckpt-2000


{'average_loss': 29.69382,
 'label/mean': 23.078432,
 'loss': 29.69382,
 'prediction/mean': 22.49272,
 'global_step': 2000}