Support for H2O models

Awesome package! Thanks for all the work, @mayer79. Great to see such a solid implementation of H statistics in R. Well documented and compatibile with many modelling packages. 

Would it be feasible to add out-of-the-box support for H2O models? Due to the way how you've set up the code, it's already possible now to use the `pred_fun` argument for H2O models in the various functions. See the code example down below, where I illustrate this for binomial, regression and multinomial H2O models.

I believe overwriting the generics for `hstats()`, `partial_dep()`, `ice()`, and `perm_importance()` would require two main things:

1. Changing the `pred_fun`argument to:

```
# Class "H2OBinomialModel" 
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[3]]
}

# Class "H2ORegressionModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[1]]
}

# Class "H2OMulticlassModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[, -1]
}
```

2. Ensuring the provided data argument (potentially an H2OFrame) gets converted to a data.frame using `as.data.frame()`

Further suggestions / potential improvements:
- Would it be possible/better to set up a generic for the `pred_fun` itself? Then you'd only have to overwrite that one for ranger, Learner, explainer, H2OBinomialModel, H2ORegressionModel, and H2OMulticlassModel classes instead of all of `hstats()`, `partial_dep()`, `ice()`, and `perm_importance()`.
- The prediction function for H2O models requires converting the data to an H2OFrame using `as.h2o()` and then calling `h2o.predict()`. Doing this only once is much faster than calling `pred_fun` multiple times. Hence, particularly for H2O models, further speed improvements are possible by first combining the data and then calling `pred_fun` instead of the other way around. [Try running the below examples using `h2o.show_progress()` to see a progress bar whenever `as.h2o()` and `h2o.predict()` get called]
    - `partial_dep(..., BY = ...)` could be faster by avoiding the for loop over the `BY` argument. 
    - `hstats()` could be faster by avoiding the for loop over the one-way, two-way and three-way effects. [This would be hard to refactor though]
    - `perm_importance()` could be faster by avoiding the for loop over the `v` argument. [This might lead to memory issues though when stacking too many data frames when `v`, `m_rep` and/or `n_max` is large, so potentially having an optional argument for this would be better]
- For H2O models, the default argument when `v = NULL` in `hstats()` and `perm_importance()` could be set to `object@allparameters$x`. That way no unnecessary computations are performed for columns in `X` not used as features.
- For H2O models, the default argument for `y` in `perm_importance()` can be set to `object@allparameters$y`.

Code example for H2O models:

``` r
library(hstats)
library(h2o)
h2o.init()
h2o.no_progress()

# H2O Binomial ----

# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
prostate_df <- as.data.frame(prostate)

# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
prostate$RACE <- as.factor(prostate$RACE)
prostate$DPROS <- as.factor(prostate$DPROS)
prostate$DCAPS <- as.factor(prostate$DCAPS)
prostate$GLEASON <- as.factor(prostate$GLEASON)
predictors <- c("AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"

# Build and train the model:
pros_gbm <- h2o.gbm(x = predictors,
                    y = response,
                    nfolds = 5,
                    seed = 1111,
                    keep_cross_validation_predictions = TRUE,
                    training_frame = prostate)

# Class "H2OBinomialModel" 
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[3]]
}

# H stats
s <- hstats(pros_gbm, X = prostate_df, pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#> [1] 0.2818805

# H statistics: overall level of interaction & pairwise
plot(s)
```

![](https://i.imgur.com/NlomcdR.png)

``` r

# Unnormalized pairwise statistics
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))
```

![](https://i.imgur.com/xLPvDAa.png)

``` r

# Stratified PD
plot(partial_dep(pros_gbm, v = "PSA", X = prostate_df, BY = "GLEASON", pred_fun = pred_fun), show_points = FALSE)
```

![](https://i.imgur.com/T0Q1SDq.png)

``` r

# Two-dimensional PDP
pd <- partial_dep(pros_gbm, v = c("PSA", "GLEASON"), X = prostate_df, pred_fun = pred_fun, grid_size = 1000)
plot(pd)
```

![](https://i.imgur.com/CI5mkkU.png)

``` r

# Centered ICE plot with colors
ic <- ice(pros_gbm, v = "PSA", X = prostate_df, BY = "GLEASON", pred_fun = pred_fun)
plot(ic, center = TRUE)
```

![](https://i.imgur.com/329TrPL.png)

``` r

# Variable importance based on permutation
plot(perm_importance(pros_gbm, X = prostate_df, y = pros_gbm@allparameters$y, pred_fun = pred_fun, v = pros_gbm@allparameters$x))
```

![](https://i.imgur.com/Qbv81UJ.png)

``` r

# Variable importance based on PD function
plot(pd_importance(s))
```

![](https://i.imgur.com/cgVkGSY.png)

``` r


# H2O Regression ----

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
prostate_path = system.file("extdata", "prostate.csv", package = "h2o")
prostate = h2o.importFile(path = prostate_path)
predictors <- setdiff(colnames(prostate), c("ID", "DPROS", "DCAPS", "VOL"))
pros_glm <- h2o.glm(y = "VOL", x = predictors, training_frame = prostate, family = "tweedie",
                    nfolds = 0, alpha = 0.1, lambda_search = FALSE,
                    interactions = c("AGE", "GLEASON"))

# Class "H2ORegressionModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[[1]]
}

# H stats
s <- hstats(pros_glm, X = as.data.frame(prostate), pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#> [1] 0.0007039798

# Pairwise statistics 
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))
```

![](https://i.imgur.com/U3AB2gV.png)

``` r

# PD plot 
pd <- partial_dep(pros_glm, v = "AGE", X = as.data.frame(prostate), BY = "GLEASON", pred_fun = pred_fun, grid_size = 500)
plot(pd)
```

![](https://i.imgur.com/BufICyK.png)

``` r


# H2O Multinomial ----

# import the cars dataset
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# set the predictor names and the response column name
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "cylinders"
cars[, response] <- as.factor(cars[response])

# split into train and validation sets
cars_splits <- h2o.splitFrame(data =  cars, ratios = 0.8, seed = 1234)
train <- cars_splits[[1]]
valid <- cars_splits[[2]]

# build and train the model:
cars_gbm <- h2o.gbm(x = predictors,
                    y = response,
                    training_frame = train,
                    validation_frame = valid,
                    distribution = "multinomial",
                    seed = 1234)

# Class "H2OMulticlassModel"
pred_fun <- function(object, newdata) {
  as.data.frame(h2o::h2o.predict(object, h2o::as.h2o(newdata)))[, -1]
} 

# H stats
s <- hstats(cars_gbm, X = as.data.frame(cars), pred_fun = pred_fun)
s
#> 'hstats' object. Use plot() or summary() for details.
#> 
#> H^2 (normalized)
#>         p3         p4         p5         p6         p8 
#> 0.74177545 0.06181569 0.95886855 0.07111337 0.00996367

# Pairwise statistics 
plot(h2_pairwise(s, normalize = FALSE, squared = FALSE))
```

![](https://i.imgur.com/Sjmz8WO.png)

``` r

# PD plot 
pd <- partial_dep(cars_gbm, v = "displacement", X = as.data.frame(cars), pred_fun = pred_fun, grid_size = 500)
plot(pd)
```

![](https://i.imgur.com/V2TS4W3.png)

<sup>Created on 2023-10-28 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for H2O models #91

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for H2O models #91

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions