Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a RuleFit page to the User Guide #7878

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 2 comments
Closed

Add a RuleFit page to the User Guide #7878

exalate-issue-sync bot opened this issue May 11, 2023 · 2 comments
Assignees

Comments

@exalate-issue-sync
Copy link

We should add a page on RuleFit to the Supervised algorithms section in the User Guide: [http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#supervised|http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#supervised]

The new content:

h2. Introduction

Rulefit algorithm combines tree ensembles and linear models to take advantage of both methods: a tree ensemble accuracy and a linear model interpretability.

The general algorithm fits a tree ensebmle to the data, builds a rule ensemble by traversing each tree, evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set.

h2. Defining a RuleFit Model (beta API)

  • model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
  • training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.
  • validation_frame: (Optional) Specify the dataset used to evaluate the accuracy of the model.
  • seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
  • y: (Required) Specify the column to use as the dependent variable.
    ** For a regression model, this column must be numeric (Real or Int).
    ** For a classification model, this column must be categorical (Enum or String). If the family is Binomial, the dataset cannot contain more than two levels.
  • x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If x is missing, then all columns except y are used.
  • algorithm: The algorithm to use to fit a tree ensemble. Must be one of: "AUTO", "DRF", "GBM". Defaults to DRF.
  • min_rule_length: Specify minimal depth of trees to be fit. Defaults to 1.
  • max_rule_length: Specify maximal depth of trees to be fit. Defaults to 10.
  • max_num_rules: The maximum number of rules to return. Defaults to -1, which means the number of rules is selected by diminishing returns in model deviance.
  • model_type: Specify the type of base learners in the ensemble. Must be one of: "RULES_AND_LINEAR", "RULES", "LINEAR". Defaults to RULES_AND_LINEAR.
  • weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified weights_column must be included in the specified training_frame. Python only: To use a weights column when passing an H2OFrame to x instead of a list of column names, the specified training_frame must contain the specified weights_column.Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
  • distribution: Specify the distribution (i.e., the loss function). The options are AUTO, bernoulli, multinomial, gaussian, poisson, gamma, laplace, quantile, huber, or tweedie.
    ** If the distribution is bernoulli, the the response column must be 2-class categorical.
    ** If the distribution is quasibinomial, the response column must be numeric and binary.
    ** If the distribution is multinomial, the response column must be categorical.
    ** If the distribution is poisson, the response column must be numeric.
    ** If the distribution is tweedie, the response column must be numeric.
    ** If the distribution is gaussian, the response column must be numeric.
    ** If the distribution is gamma, the response column must be numeric.
    ** If the distribution is fractionalbinomial, the response column must be numeric between 0 and 1.
    ** If the distribution is negativebinomial, the response must be numeric and non-negative.
    ** If the distribution is ordinal, the response must be categorical with at least 3 levels.
    ** If the distribution is AUTO,
    *** and the response is Enum with cardinality = 2, then the family is automatically determined as bernoulli.
    *** and the response is Enum with cardinality > 2, then the family is automatically determined as multinomial.
    *** and the response is numeric (Real or Int), then the family is automatically determined as gaussian.

h2. Interpreting a RuleFit Model

The output for the RuleFit model includes:

  • model parameters
  • rule importance in tabular form
  • training and validation metrics of underlying linear model.

h2. Examples

in R:

{noformat}library(h2o)
h2o.init()

f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
titanic <- h2o.importFile(f)

response = "survived"
predictors <- c("age", "sibsp", "parch", "fare", "sex", "pclass")

titanic[,response] <- as.factor(titanic[,response])
titanic[,"pclass"] <- as.factor(titanic[,"pclass"])

rf_h2o = h2o.rulefit(y=response, x=predictors, training_frame = titanic, max_rule_length=10,
max_num_rules=100, seed=1234)

print(rf_h2o@model$rule_importance){noformat}

in Py:

{noformat}import h2o
h2o.init()
from h2o.estimators.rulefit import H2ORuleFitEstimator

df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv",
col_types={'pclass': "enum", 'survived': "enum"})

x = ["age", "sibsp", "parch", "fare", "sex", "pclass"]

rf_h2o = H2ORuleFitEstimator(max_rule_length=10, max_num_rules=100, seed=1234, model_type="rules_and_linear")
rf_h2o.train(training_frame=df, x=x, y="survived")

print(rf_h2o._model_json['output']['rule_importance']){noformat}

h2. References

FRIEDMAN, J. H., & POPESCU, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954.

@exalate-issue-sync
Copy link
Author

Zuzana Olajcová commented: Hi [~accountid:5d1185d4f46aa30c271c7cc6] , I’ve prepared the content to update docs. Can you please review it from your POV and add it to the User Guide? Thanks!

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-7763
Assignee: hannah.tillman
Reporter: Erin LeDell
State: Resolved
Fix Version: 3.32.0.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#4928
#4943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants