# XGBoost feature interactions demo

This demo demonstrates how to use two XGBoost improvements using H2O XGBoost integration - **feature interaction contraints** and geting **feature interactions** from the model.

**More information:**

- H2O XGboost interaction constraints documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/interaction_constraints.html
- Native XGBoost interaction contraints tutorial: https://xgboost.readthedocs.io/en/latest/tutorials/feature_interaction_constraint.html
- H2O XGboost feature interaction documentation: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html#xgboost-feature-interactions
- Original XGBFI package: https://github.com/Far0n/xgbfi


## Feature Interaction Constraints

Feature interaction constraints allow users to decide which variables are allowed to interact and which are not.

**Potential benefits include:**

- Better predictive performance from focusing on interactions that work – whether through domain specific knowledge or algorithms that rank interactions
- Less noise in predictions; better generalization
- More control to the user on what the model can fit. For example, the user may want to exclude some interactions even if they perform well due to regulatory constraints

(Source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/interaction_constraints.html)

In [None]:
# start h2o
import h2o
h2o.init(strict_version_check=False, port=54321)

In [None]:
from h2o.estimators.xgboost import *
# check if the H2O XGBoostEstimator is available
assert H2OXGBoostEstimator.available() is True

# import data
data = h2o.import_file(path = "../../smalldata/logreg/prostate.csv")

x = list(range(1, data.ncol-2))
y = data.names[len(data.names) - 1]

ntree = 5

h2o_params = {
    'eta': 0.3, 
    'max_depth': 3,  
    'ntrees': ntree,
    'tree_method': 'hist'
} 

# define interactions as a list of list of names of colums
# the lists defines allowed interaction
# the interactions of each column with itself are always allowed
# so you cannot specified list with one column e.g. ["PSA"]
h2o_params["interaction_constraints"] = [["CAPSULE", "AGE"], ["PSA", "DPROS"]]

# train h2o XGBoost model
h2o_model = H2OXGBoostEstimator(**h2o_params)
h2o_model.train(x=x, y=y, training_frame=data)

In [None]:
# check the trees have allowed structure
# so in each tree can be as split feature only 
from h2o.tree import H2OTree
for i in range(0, ntree):
    print("Tree index:"+str(i))
    tree = H2OTree(h2o_model, i)
    for i in range(0, len(tree)):
        if tree.left_children[i] == -1:
            print("Leaf ID {0}.".format(tree.node_ids[i]))
        else:
            print("Node ID {0} has left child node with index {1} and right child node with index {2}. The split feature is {3}.".format(tree.node_ids[i], tree.left_children[i], tree.right_children[i], tree.features[i]))

In [None]:
try:
    import xgboost as xgb
    import pandas as pd
    data = pd.read_csv("../../smalldata/logreg/prostate.csv")

    y = data["GLEASON"]
    x_names = data.columns.to_list()
    x_names.remove("GLEASON")
    x_names.remove("ID")
    x = data[x_names]


    D_train = xgb.DMatrix(x, label=y)

    param = {
        'eta': 0.3, 
        'max_depth': 3,  
        'interaction_constraints': '[[0,1], [3, 5]]', # same as [["CAPSULE", "AGE"], ["PSA", "DPROS"]]
        'tree_method': 'hist'
    } 

    steps = ntree

    xgboost_model = xgb.train(param, D_train, steps)
    # you can compare the H2O XGBoost and native XGBoost have the same tree structure
    xgboost_model.trees_to_dataframe()
except ImportError as e:
    print(e)
    xgboost_model = None

In [None]:
# plot xgboost trees
try:
    if xgboost_model is not None:
        from xgboost import plot_tree
        import matplotlib.pyplot as plt
        from matplotlib.pylab import rcParams

        rcParams['figure.figsize'] = 50, 80
        for i in range(0, ntree):
            plot_tree(xgboost_model, num_trees=i)
except Exception as e: print(e)

In [None]:
# show xgboost model variable importance
if xgboost_model is not None:
    print(xgboost_model.get_score(importance_type='total_gain'))

In [None]:
# show h2o xgboost model variable importance
h2o_model.varimp()

## XGBFI-like Feature Interaction

In the code below, multiple table output provides comprendious insights into higher order interactions between the features of xgboost trees visualized above. Also additional usefull tables summarizing leaf statistics and split value histograms per each feature are provided. Measures used are either one of:

**Gain** implies the relative contribution of the corresponding feature to the model calculated by taking each feature's contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.

**Cover** is a metric to measure the number of observations affected by the split. Counted over the specific feature it measures the relative quantity of observations concerned by a feature.

**Frequency (FScore)** is the number of times a feature is used in all generated trees. Please note that it does not take the tree-depth nor tree-index of splits a feature occurs into consideration, neither the amount of possible splits of a feature. Hence, it is often suboptimal measure for importance.


or their averaged / weighed / ranked alternatives.

In [None]:
# calculate multi-level feature interactions
h2o_model.feature_interaction()