# XGBoost Regression with TensorFlow Pooling and Loss
## Tutorial on early stopping
This tutorial demonstrates the use of `xgb_tf_metric` decorator for early stopping. For a more comprehensive tutorial on how to use `tf2xgb` library, please refer to [this](example.ipynb) example.

In [1]:
import numpy as np
import xgboost as xgb
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

from tf2xgb import get_ragged_nested_index_lists, gen_random_dataset, xgb_tf_loss, xgb_tf_metric
from sklearn.metrics import mean_squared_error

  from pandas import MultiIndex, Int64Index


In [2]:
N = 100000
N_SUBGRP = N//2
N_GRP = 0 # we will use only one level of pooling in this tutorial
BETA_TRUE = [2,1,0,0,0]
SIGMA = 1

In [3]:
# main data frame with features X, subgroup IDs subgrp_id and group ID grp_id;
# target y is NOT observable on the individual level in real data,
# we have it here to be able to simulate target on group level
# and to be able to compared result of the estimate on the group-level
# target with the estimate on the individual level.
df_train = gen_random_dataset(N, N_SUBGRP, N_GRP, BETA_TRUE, SIGMA)
df_val = gen_random_dataset(N, N_SUBGRP, N_GRP, BETA_TRUE, SIGMA)

In [4]:
df_train.head()

Unnamed: 0,_row_,X,y,subgrp_id
0,0,"[0.990320213665162, -0.9073304936192282, 1.435...",0.702018,SUBGRP0
1,1,"[1.0007154451546965, 0.11364963136298611, 0.10...",2.432624,SUBGRP1
2,2,"[-0.9489771875967585, 0.5654649942882454, -1.3...",-0.710235,SUBGRP2
3,3,"[1.4588152857051915, -0.6525509713800075, -0.1...",4.482466,SUBGRP3
4,4,"[0.007055279881034012, -1.7218724585797522, -0...",-0.695231,SUBGRP4


In [5]:
X_train = np.asarray(df_train['X'].to_list())
y_train = np.asarray(df_train['y'].to_list())
X_val = np.asarray(df_val['X'].to_list())
y_val = np.asarray(df_val['y'].to_list())

Calculate simulated target `y` on the level of `subgrp_id` (by max pooling of individual-level `y`'s).

In [6]:
df_train_subgrp_y = (df_train
    .groupby('subgrp_id')
    .agg({'y':np.max})
    .reset_index()
)
df_train_subgrp_inds = get_ragged_nested_index_lists(df_train, ['subgrp_id'])

In [7]:
df_val_subgrp_y = (df_val
    .groupby('subgrp_id')
    .agg({'y':np.max})
    .reset_index()
)
df_val_subgrp_inds = get_ragged_nested_index_lists(df_val, ['subgrp_id'])

## Custom TF Pooling and Loss Function

In [8]:
@xgb_tf_loss(df_train_subgrp_inds.sort_values(by=['subgrp_id'])['_row_'].to_list(), 
             df_train_subgrp_y.sort_values(by=['subgrp_id'])['y'].to_numpy())
def max_pooling_mse_loss(target, preds_cube):
    """Custom TF Pooling and Loss function.

    This example function performs max pooling from the individual
    level to subgroups.
    The function takes appropriate care of missing values in preds_cube.

    Inputs:
    = target: 1D tensor with target on the level of groups
    = preds_cube: ND tensor with predictions on the individual level;
    the first dimension is that of groups, the other dimensions reflect
    sub-groups on different levels and individual observations
    (target.shape[0] == preds_cube.shape[0]; 
    preds_cube.shape[-1] == max # indiv observations per the most detailed 
    sub-group).
    Missing values are denoted by np.nan and have to be taken care of in 
    this function body. They occur simply because preds_cube
    has typically much more elements that the original flat predictions
    vector from XGBoost.

    Output: scalar tensor reflecting MEAN of losses over all dimensions.
    This is the output of e.g. tf.keras.losses.mean_squared_error().
    The mean is translated to SUM later in tf_d_loss() because of the 
    compatibility with XGB custom objective function.
    """
    x = preds_cube
    # replace NaNs with -Inf: neutral value for reduce_max()
    x = tf.where(tf.math.is_nan(x), tf.constant(-np.inf, dtype=x.dtype), x)
    x = tf.math.reduce_max(x, axis=-1)
    l = tf.keras.losses.mean_squared_error(target, x)
    return l

## Custom Pooling Metric

In [9]:
@xgb_tf_metric(df_val_subgrp_inds.sort_values(by=['subgrp_id'])['_row_'].to_list(), 
               df_val_subgrp_y.sort_values(by=['subgrp_id'])['y'].to_numpy())
def max_pooling_mse_metric(target, preds_cube):
    """Custom Pooling MSE.

    This example function performs max pooling from the individual
    level to subgroups and computes MSE.
    The function takes appropriate care of missing values in preds_cube.

    Inputs:
    = target: 1D tensor with target on the level of groups
    = preds_cube: ND tensor with predictions on the individual level;
    the first dimension is that of groups, the other dimensions reflect
    sub-groups on different levels and individual observations
    (target.shape[0] == preds_cube.shape[0]; 
    preds_cube.shape[-1] == max # indiv observations per the most detailed 
    sub-group).
    Missing values are denoted by np.nan and have to be taken care of in 
    this function body. They occur simply because preds_cube
    has typically much more elements that the original flat predictions
    vector from XGBoost.

    Output: tuple (metric_name, metric_value)
    """
    preds_cube = np.nan_to_num(preds_cube, nan=-np.inf)
    preds = np.max(preds_cube, axis=-1)
    score = mean_squared_error(target, preds)
    return 'max_mse', score

## Estimation

In [10]:
dtrain = xgb.DMatrix(X_train)
dval = xgb.DMatrix(X_val)

In [11]:
%%time
regr_subgrp = xgb.train({'tree_method': 'hist',
                         'seed': 1994,
                         'n_jobs': 20,
                         'learning_rate': 0.12,
                         'disable_default_eval_metric': 1
                        }, 
                        num_boost_round=100,
                        dtrain=dtrain,
                        evals=[(dval, 'val')],
                        obj=max_pooling_mse_loss,
                        feval=max_pooling_mse_metric,
                        early_stopping_rounds=3
                       )

[0]	val-max_mse:4.43615
[1]	val-max_mse:3.65869
[2]	val-max_mse:3.05419
[3]	val-max_mse:2.58527
[4]	val-max_mse:2.22042
[5]	val-max_mse:1.93740
[6]	val-max_mse:1.71603
[7]	val-max_mse:1.54372
[8]	val-max_mse:1.41002
[9]	val-max_mse:1.30572
[10]	val-max_mse:1.22442
[11]	val-max_mse:1.16187
[12]	val-max_mse:1.11270
[13]	val-max_mse:1.07487
[14]	val-max_mse:1.04528
[15]	val-max_mse:1.02249
[16]	val-max_mse:1.00472
[17]	val-max_mse:0.99116
[18]	val-max_mse:0.98074
[19]	val-max_mse:0.97254
[20]	val-max_mse:0.96643
[21]	val-max_mse:0.96143
[22]	val-max_mse:0.95773
[23]	val-max_mse:0.95515
[24]	val-max_mse:0.95302
[25]	val-max_mse:0.95140
[26]	val-max_mse:0.95026
[27]	val-max_mse:0.94930
[28]	val-max_mse:0.94863
[29]	val-max_mse:0.94814
[30]	val-max_mse:0.94787
[31]	val-max_mse:0.94759
[32]	val-max_mse:0.94749
[33]	val-max_mse:0.94742
[34]	val-max_mse:0.94748
[35]	val-max_mse:0.94745
[36]	val-max_mse:0.94749
CPU times: total: 54.6 s
Wall time: 38.3 s
