PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

hannah-tillman · 2022-07-15T17:13:22Z

There are three goals within this PR:

Restructure the parameters
Expand and clarify the values of parameters
Standardize the style

Restructuring: I separated the parameters into common and hyperparameters. The order now follows more closely the R documentation order of importance. I lead with all required params. I went through the schemas to find gridable=True for params that don't have a page in the appendix to figure out whether they were hyperparams or not. Please let me know if any of these are incorrect.

Expanding: I compared the information in the Python & R docs to what was in the user guide to expand on some params that were lacking info. I also got some outside help :)

Standardizing: Because so many different hands have written these parameter lists, there was not a lot of cohesion on their style. I started to standardize it here (e.g. made all "input-able" values code backticks instead of bolded or bare and created vertical lists when four or more objects were listed in a row).

I would appreciate any and all input. I would especially appreciate algorithm owners double-checking to make sure I got the hyperparameters correct. Please let me know if you have any questions or critiques!

I've included a screenshot of what Aggregator looks like when built to get a feel of what the initial idea looks like:

note: I'm not including Infogram, AutoML, Model Explainability, or miscellaneous algos in this PR since they're structured a little differently. After this initial batch gets solidified, I will make a new PR for changes in those if & where needed.

excludes: AutoML, Infogram, Explain, Target Encoding, TF-IDF, Word2vec, Permutation Varimp

h2o-docs/src/product/data-science/stacked-ensembles.rst

h2o-docs/src/product/data-science/gam.rst

SE `seed` hyperparam; spline_orders update

tomasfryda

Stacked Ensemble looks good. Thank you @hannah-tillman !

h2o-docs/src/product/data-science/eif.rst

h2o-docs/src/product/data-science/if.rst

h2o-docs/src/product/data-science/drf.rst

wendycwong · 2022-09-12T17:53:43Z

h2o-docs/src/product/data-science/gam.rst


   -  For a regression model, this column must be numeric (**Real** or **Int**).
-   -  For a classification model, this column must be categorical (**Enum** or **String**). If the family is **Binomial**, the dataset cannot contain more than two levels.
+   -  For a classification model, this column must be categorical (**Enum** or **String**). If the family is ``Binomial``, the dataset cannot contain more than two levels.


Please change:

If the family is "Binomial", the dataset cannot contain more than two levels

to

If the family is "Binomial", the dataset must contain two levels

wendycwong · 2022-09-12T17:55:47Z

h2o-docs/src/product/data-science/gam.rst


-  `remove_collinear_columns <algo-params/remove_collinear_columns.html>`__: Specify whether to automatically remove collinear columns during model-building. When enabled, collinear columns will be dropped from the model and will have 0 coefficient in the returned model. This can only be set if there is no regularization (lambda=0). This option is defaults to false (not enabled).
+-  `interactions <algo-params/interactions.html>`__: Specify a list of predictor column indices to interact. All pairwise combinations will be computed for this list. 


Please add for interaction:

Interaction with and among gamified columns are not supported at the moment.

wendycwong · 2022-09-12T18:01:27Z

h2o-docs/src/product/data-science/gam.rst

-  `prior <algo-params/prior.html>`__: Specify prior probability for p(y==1). Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. This value must be a value in the range (0,1) or set to -1 (disabled).  This option is set to -1 (disabled) by default.  
-
-     **Note**: This is a simple method affecting only the intercept. You may want to use weights and offset for a better fit.
+These parameters can be used in grid search.



Please also add the following parameters to the grid search

scale
num_knots
spline_order
bs

All of these are all already under the grid search section :)

wendycwong · 2022-09-12T18:11:10Z

h2o-docs/src/product/data-science/model_selection.rst


-  `custom_metric_func <algo-params/custom_metric_func.html>`__: Optionally specify a custom evaluation function.
+-  `standardize <algo-params/standardize.html>`__: Specify whether to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option defaults to ``True`` (enabled).


Standardize is not griddable

wendycwong · 2022-09-12T18:11:41Z

h2o-docs/src/product/data-science/model_selection.rst


- **max_predictor_number**: Maximum number of predictors to be considered when building GLM models. Defaults to 1.
+-  `plug_values <algo-params/plug_values.html>`__: When ``missing_values_handling="PlugValues"``, specify a single row frame containing values that will be used to impute missing values of the training/validation frame.


plug_values is not griddable

wendycwong · 2022-09-12T18:12:01Z

h2o-docs/src/product/data-science/model_selection.rst


- **min_predictor_number**: For ``mode = "backward"`` only.  Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. Defaults to ``1``.
+-  `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations (defaults to ``-1``).


max_iterations not griddable

wendycwong · 2022-09-12T23:46:48Z

h2o-docs/src/product/data-science/pca.rst

+-  `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations. The value must be between 1 and 1e6 and the default is ``1000``.
+
+-  `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).
+


compute_metrics is not gridable.

wendycwong · 2022-09-12T23:47:13Z

h2o-docs/src/product/data-science/pca.rst

+-  `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).
+
+-  `seed <algo-params/seed.html>`__: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This value defaults to ``-1`` (time-based random number).
+


seed is not gridable

I got scared that we have a bug in SE (we have seed as gridable) but after short inspection I found that seed is usually gridable except for PCA, SVD, and Aggregator. I'm mentioning this just to avoid confusion.

wendycwong · 2022-09-13T21:29:25Z

h2o-docs/src/product/data-science/gam.rst


   -  For a regression model, this column must be numeric (**Real** or **Int**).
-   -  For a classification model, this column must be categorical (**Enum** or **String**). If the family is **Binomial**, the dataset cannot contain more than two levels.
+   -  For a classification model, this column must be categorical (**Enum** or **String**). If the family is ``Binomial``, the dataset must contain two levels.

 -  `x <algo-params/x.html>`__: Specify a vector containing the names or indices of the predictor variables to use when building the model. If ``x`` is missing, then all columns except ``y`` are used.



@hannah-tillman For GAM, if x is missing, then no predictors will be used. This is different from other algos. Please change the descrption.

valenad1 · 2022-09-15T17:09:31Z

Extended Isolation Forest ✅

valenad1 · 2022-09-15T17:40:53Z

I have mixed feelings about the parameter sections. No doubt that we need to put an order in it 💯

When I am thinking about the Common parameters and Hyperparameters sections in general, the way I understand those sections is that Common parameters should contain only the common parameters through all available algos. Is it correct? Without a doubt common parameters are model_id, x, training_frame, y (If algo is supervised),... what bothers me are parameters like contamination (Isolation Forest), pca_method (PCA),... because they are to say the least algo specific and IMHO belong to hyperparameter section (or some Algo Specific hyperparameters section).

The second thought I have is that we already have this page:

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html#supported-grid-search-hyperparameters

about the supported grid search hyperparameters. I would rather see the page updated (E.g. I forgot to add Extended Isolation Forest section there)and the information included in algos pages rather than copy this information under Hyperparameter. Because hyperparameter is not the same as "gridable" parameter. And not all algos support Grid search (E.g., Uplift RF)

I would suggest removing These parameters can be used in grid search. and rather update and segment grid search page to by able to include supported grid search parameters inside algo documentation pages. We can have the information twice but we need to write it only on one place.

What do you think?

wendycwong · 2022-09-16T14:32:49Z

Adam:

I agree with you that we need to put in some effort to help Hannah finish her work. Having a common parameters section for all algos is great. Then, for each algo, we list out the algo specific parameters.

Regarding gridsearch, my first confusion is that there is no consistency. Some parameters are gridable in one algo but not gridable in another. I think gridsearch parameter can also be divided into two sections: paramaters common to all algos, then algorithm specific parameters.

The problem here is we need to help Hannah to figure this out. I know all of us are busy but somehow this needs to be done.

W

michalkurka · 2022-09-16T18:13:44Z

agreed @wendycwong - let's make this a priority after the 3.38 release

PUBDEV-8461: first draft standardizing params

a92c4d5

excludes: AutoML, Infogram, Explain, Target Encoding, TF-IDF, Word2vec, Permutation Varimp

hannah-tillman added the docs label Jul 15, 2022

hannah-tillman requested review from ledell, michalkurka, arunaryasomayajula and narasimhard July 15, 2022 17:13

tomasfryda reviewed Jul 18, 2022

View reviewed changes

h2o-docs/src/product/data-science/stacked-ensembles.rst Show resolved Hide resolved

narasimhard reviewed Jul 18, 2022

View reviewed changes

h2o-docs/src/product/data-science/gam.rst Outdated Show resolved Hide resolved

requested updates

2a5ded8

SE `seed` hyperparam; spline_orders update

hannah-tillman requested a review from narasimhard July 20, 2022 14:21

requested spline_orders update

afd09a7

hannah-tillman requested review from valenad1, wendycwong and tomasfryda September 12, 2022 12:25

hannah-tillman added the 4RELEASE label Sep 12, 2022

added hyperparam note

6167be7

hannah-tillman requested a review from maurever September 12, 2022 14:07

tomasfryda approved these changes Sep 12, 2022

View reviewed changes

valenad1 requested changes Sep 12, 2022

View reviewed changes

h2o-docs/src/product/data-science/eif.rst Show resolved Hide resolved

h2o-docs/src/product/data-science/if.rst Show resolved Hide resolved

h2o-docs/src/product/data-science/drf.rst Show resolved Hide resolved

requested reorder for tree algos

f3b015b

hannah-tillman requested a review from valenad1 September 12, 2022 17:55

wendycwong requested changes Sep 12, 2022

View reviewed changes

hannah-tillman added 2 commits September 12, 2022 13:32

requested gam updates

eb1f53c

shifting gam non-griddables

db6c2b8

hannah-tillman requested a review from wendycwong September 12, 2022 18:37

wendycwong requested changes Sep 12, 2022

View reviewed changes

moved seed & compute_metrics to common

b968460

hannah-tillman requested a review from wendycwong September 13, 2022 12:41

wendycwong requested changes Sep 13, 2022

View reviewed changes

gam x update

9994c49

hannah-tillman requested a review from wendycwong September 13, 2022 22:35

hannah-tillman added 2 commits September 14, 2022 10:16

added new training checkpoint params to gbm

091e8b3

added new VIF/dispersion params for GLM

8fc5299

michalkurka removed the 4RELEASE label Sep 19, 2022

maurever removed their request for review October 24, 2022 14:59

hannah-tillman closed this Apr 27, 2023

hannah-tillman deleted the PUBDEV-8461 branch April 27, 2023 19:18

h2o-ops-ro mentioned this pull request May 14, 2023

Documentation for Generalized Additive Models (GAM) could be improved #7200

Closed

h2o-ops mentioned this pull request May 14, 2023

Restructure algorithm pages parameters section in User Guide #7600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

hannah-tillman commented Jul 15, 2022 •

edited by jira bot

Loading

tomasfryda left a comment

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

hannah-tillman Sep 12, 2022

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

wendycwong Sep 12, 2022

tomasfryda Sep 13, 2022

wendycwong Sep 13, 2022

valenad1 commented Sep 15, 2022

valenad1 commented Sep 15, 2022

wendycwong commented Sep 16, 2022

michalkurka commented Sep 16, 2022


		- `remove_collinear_columns <algo-params/remove_collinear_columns.html>`__: Specify whether to automatically remove collinear columns during model-building. When enabled, collinear columns will be dropped from the model and will have 0 coefficient in the returned model. This can only be set if there is no regularization (lambda=0). This option is defaults to false (not enabled).
		- `interactions <algo-params/interactions.html>`__: Specify a list of predictor column indices to interact. All pairwise combinations will be computed for this list.


		- `custom_metric_func <algo-params/custom_metric_func.html>`__: Optionally specify a custom evaluation function.
		- `standardize <algo-params/standardize.html>`__: Specify whether to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option defaults to ``True`` (enabled).


		- max_predictor_number: Maximum number of predictors to be considered when building GLM models. Defaults to 1.
		- `plug_values <algo-params/plug_values.html>`__: When ``missing_values_handling="PlugValues"``, specify a single row frame containing values that will be used to impute missing values of the training/validation frame.


		- min_predictor_number: For ``mode = "backward"`` only. Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. Defaults to ``1``.
		- `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations (defaults to ``-1``).

		- `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations. The value must be between 1 and 1e6 and the default is ``1000``.

		- `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).

		- `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).

		- `seed <algo-params/seed.html>`__: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This value defaults to ``-1`` (time-based random number).

PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

Conversation

hannah-tillman commented Jul 15, 2022 • edited by jira bot Loading

tomasfryda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valenad1 commented Sep 15, 2022

valenad1 commented Sep 15, 2022

wendycwong commented Sep 16, 2022

michalkurka commented Sep 16, 2022

hannah-tillman commented Jul 15, 2022 •

edited by jira bot

Loading