Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Friedman Popescus H #15916

Closed
tomasfryda opened this issue Nov 9, 2023 · 0 comments
Closed

Speed up Friedman Popescus H #15916

tomasfryda opened this issue Nov 9, 2023 · 0 comments
Assignees
Labels
Milestone

Comments

@tomasfryda
Copy link
Contributor

tomasfryda commented Nov 9, 2023

computation of the H-statistic takes unusually long in h2o compared to other things that are out there.
I benchmarked on a 10,000 row dataset that's generated with sklearn-gbmi (we reference this package), and it takes ~ 30 seconds with sklearn and takes 30 minutes with h2o. The results are slightly different so I'm wondering if there's some sampling that's happening on sklearn-gbmi. Would potentially be useful to expose a sampling parameter, even if that means slightly more unstable results? People typically aren't looking for the exact H anyway it's more just rough strength of interaction.

This takes way too long:

import numpy as np
import pandas as pd
DATA_COUNT = 10000
RANDOM_SEED = 137
TRAIN_FRACTION = 0.9

np.random.seed(RANDOM_SEED)

xs = pd.DataFrame(np.random.uniform(size = (DATA_COUNT, 3)))
xs.columns = ['x0', 'x1', 'x2']

y = pd.DataFrame(xs.x0*xs.x1 + xs.x2 + pd.Series(0.1*np.random.randn(DATA_COUNT)))
y.columns = ['y']

train_ilocs = range(int(TRAIN_FRACTION*DATA_COUNT))
test_ilocs = range(int(TRAIN_FRACTION*DATA_COUNT), DATA_COUNT)

merged_data = pd.concat((xs, y), axis=1)

import h2o
h2o.init()
from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm = H2OGradientBoostingEstimator(max_depth=50)

data_sample = h2o.H2OFrame(merged_data)
gbm.train(x=['x0', 'x1', 'x2'], y='y', training_frame=data_sample)


h2o_single_pair_start_time = time.time()
h2o_single_pair_h = gbm.h(frame=data_sample, variables=['x0', 'x1'])
h2o_single_pair_end_time = time.time()

# => h2o_single_pair_end_time - h2o_single_pair_start_time = ~30 mins
@tomasfryda tomasfryda added the bug label Nov 9, 2023
@tomasfryda tomasfryda added this to the 3.44.0.3 milestone Nov 9, 2023
@tomasfryda tomasfryda self-assigned this Nov 9, 2023
wendycwong pushed a commit that referenced this issue Nov 13, 2023
* Initial speed up

* Additional speedups

* adjust test duration to reflect further improvements

* Increase maximum duration (jenkins is sometimes slow)

* rename test file to move it to large stage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants