Speed up Friedman Popescus H #15916

tomasfryda · 2023-11-09T08:15:38Z

computation of the H-statistic takes unusually long in h2o compared to other things that are out there.
I benchmarked on a 10,000 row dataset that's generated with sklearn-gbmi (we reference this package), and it takes ~ 30 seconds with sklearn and takes 30 minutes with h2o. The results are slightly different so I'm wondering if there's some sampling that's happening on sklearn-gbmi. Would potentially be useful to expose a sampling parameter, even if that means slightly more unstable results? People typically aren't looking for the exact H anyway it's more just rough strength of interaction.

This takes way too long:

import numpy as np
import pandas as pd
DATA_COUNT = 10000
RANDOM_SEED = 137
TRAIN_FRACTION = 0.9

np.random.seed(RANDOM_SEED)

xs = pd.DataFrame(np.random.uniform(size = (DATA_COUNT, 3)))
xs.columns = ['x0', 'x1', 'x2']

y = pd.DataFrame(xs.x0*xs.x1 + xs.x2 + pd.Series(0.1*np.random.randn(DATA_COUNT)))
y.columns = ['y']

train_ilocs = range(int(TRAIN_FRACTION*DATA_COUNT))
test_ilocs = range(int(TRAIN_FRACTION*DATA_COUNT), DATA_COUNT)

merged_data = pd.concat((xs, y), axis=1)

import h2o
h2o.init()
from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm = H2OGradientBoostingEstimator(max_depth=50)

data_sample = h2o.H2OFrame(merged_data)
gbm.train(x=['x0', 'x1', 'x2'], y='y', training_frame=data_sample)


h2o_single_pair_start_time = time.time()
h2o_single_pair_h = gbm.h(frame=data_sample, variables=['x0', 'x1'])
h2o_single_pair_end_time = time.time()

# => h2o_single_pair_end_time - h2o_single_pair_start_time = ~30 mins

* Initial speed up * Additional speedups * adjust test duration to reflect further improvements * Increase maximum duration (jenkins is sometimes slow) * rename test file to move it to large stage

tomasfryda added the bug label Nov 9, 2023

tomasfryda added this to the 3.44.0.3 milestone Nov 9, 2023

tomasfryda self-assigned this Nov 9, 2023

tomasfryda mentioned this issue Nov 9, 2023

GH-15916: Speed up Friedman-Popescu's H #15917

Merged

wendycwong closed this as completed Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Friedman Popescus H #15916

Speed up Friedman Popescus H #15916

tomasfryda commented Nov 9, 2023 •

edited by wendycwong

Speed up Friedman Popescus H #15916

Speed up Friedman Popescus H #15916

Comments

tomasfryda commented Nov 9, 2023 • edited by wendycwong

tomasfryda commented Nov 9, 2023 •

edited by wendycwong