Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-8461: Standardizing and improving algorithm parameter sections #6251

Closed
wants to merge 11 commits into from

Conversation

hannah-tillman
Copy link
Contributor

@hannah-tillman hannah-tillman commented Jul 15, 2022

For: PUBDEV-8461 & PUBDEV-8049

There are three goals within this PR:

  1. Restructure the parameters
  2. Expand and clarify the values of parameters
  3. Standardize the style

Restructuring: I separated the parameters into common and hyperparameters. The order now follows more closely the R documentation order of importance. I lead with all required params. I went through the schemas to find gridable=True for params that don't have a page in the appendix to figure out whether they were hyperparams or not. Please let me know if any of these are incorrect.

Expanding: I compared the information in the Python & R docs to what was in the user guide to expand on some params that were lacking info. I also got some outside help :)

Standardizing: Because so many different hands have written these parameter lists, there was not a lot of cohesion on their style. I started to standardize it here (e.g. made all "input-able" values code backticks instead of bolded or bare and created vertical lists when four or more objects were listed in a row).

I would appreciate any and all input. I would especially appreciate algorithm owners double-checking to make sure I got the hyperparameters correct. Please let me know if you have any questions or critiques!

I've included a screenshot of what Aggregator looks like when built to get a feel of what the initial idea looks like:
Screen Shot 2022-07-15 at 12 10 12 PM

note: I'm not including Infogram, AutoML, Model Explainability, or miscellaneous algos in this PR since they're structured a little differently. After this initial batch gets solidified, I will make a new PR for changes in those if & where needed.

excludes: AutoML, Infogram, Explain, Target Encoding, TF-IDF, Word2vec, Permutation Varimp
SE `seed` hyperparam; spline_orders update
Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stacked Ensemble looks good. Thank you @hannah-tillman !


- For a regression model, this column must be numeric (**Real** or **Int**).
- For a classification model, this column must be categorical (**Enum** or **String**). If the family is **Binomial**, the dataset cannot contain more than two levels.
- For a classification model, this column must be categorical (**Enum** or **String**). If the family is ``Binomial``, the dataset cannot contain more than two levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change:

If the family is "Binomial", the dataset cannot contain more than two levels

to

If the family is "Binomial", the dataset must contain two levels


- `remove_collinear_columns <algo-params/remove_collinear_columns.html>`__: Specify whether to automatically remove collinear columns during model-building. When enabled, collinear columns will be dropped from the model and will have 0 coefficient in the returned model. This can only be set if there is no regularization (lambda=0). This option is defaults to false (not enabled).
- `interactions <algo-params/interactions.html>`__: Specify a list of predictor column indices to interact. All pairwise combinations will be computed for this list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add for interaction:

Interaction with and among gamified columns are not supported at the moment.

- `prior <algo-params/prior.html>`__: Specify prior probability for p(y==1). Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. This value must be a value in the range (0,1) or set to -1 (disabled). This option is set to -1 (disabled) by default.

**Note**: This is a simple method affecting only the intercept. You may want to use weights and offset for a better fit.
These parameters can be used in grid search.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add the following parameters to the grid search

scale
num_knots
spline_order
bs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are all already under the grid search section :)


- `custom_metric_func <algo-params/custom_metric_func.html>`__: Optionally specify a custom evaluation function.
- `standardize <algo-params/standardize.html>`__: Specify whether to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option defaults to ``True`` (enabled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardize is not griddable


- **max_predictor_number**: Maximum number of predictors to be considered when building GLM models. Defaults to 1.
- `plug_values <algo-params/plug_values.html>`__: When ``missing_values_handling="PlugValues"``, specify a single row frame containing values that will be used to impute missing values of the training/validation frame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plug_values is not griddable


- **min_predictor_number**: For ``mode = "backward"`` only. Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. Defaults to ``1``.
- `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations (defaults to ``-1``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_iterations not griddable

- `max_iterations <algo-params/max_iterations.html>`__: Specify the number of training iterations. The value must be between 1 and 1e6 and the default is ``1000``.

- `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_metrics is not gridable.

- `compute_metrics <algo-params/compute_metrics.html>`__: Enable metrics computations on the training data. This option defaults to ``True`` (enabled).

- `seed <algo-params/seed.html>`__: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This value defaults to ``-1`` (time-based random number).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed is not gridable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got scared that we have a bug in SE (we have seed as gridable) but after short inspection I found that seed is usually gridable except for PCA, SVD, and Aggregator. I'm mentioning this just to avoid confusion.


- For a regression model, this column must be numeric (**Real** or **Int**).
- For a classification model, this column must be categorical (**Enum** or **String**). If the family is **Binomial**, the dataset cannot contain more than two levels.
- For a classification model, this column must be categorical (**Enum** or **String**). If the family is ``Binomial``, the dataset must contain two levels.

- `x <algo-params/x.html>`__: Specify a vector containing the names or indices of the predictor variables to use when building the model. If ``x`` is missing, then all columns except ``y`` are used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannah-tillman For GAM, if x is missing, then no predictors will be used. This is different from other algos. Please change the descrption.

@valenad1
Copy link
Collaborator

Extended Isolation Forest ✅

@valenad1
Copy link
Collaborator

I have mixed feelings about the parameter sections. No doubt that we need to put an order in it 💯

When I am thinking about the Common parameters and Hyperparameters sections in general, the way I understand those sections is that Common parameters should contain only the common parameters through all available algos. Is it correct? Without a doubt common parameters are model_id, x, training_frame, y (If algo is supervised),... what bothers me are parameters like contamination (Isolation Forest), pca_method (PCA),... because they are to say the least algo specific and IMHO belong to hyperparameter section (or some Algo Specific hyperparameters section).

The second thought I have is that we already have this page:

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html#supported-grid-search-hyperparameters

about the supported grid search hyperparameters. I would rather see the page updated (E.g. I forgot to add Extended Isolation Forest section there)and the information included in algos pages rather than copy this information under Hyperparameter. Because hyperparameter is not the same as "gridable" parameter. And not all algos support Grid search (E.g., Uplift RF)

I would suggest removing These parameters can be used in grid search. and rather update and segment grid search page to by able to include supported grid search parameters inside algo documentation pages. We can have the information twice but we need to write it only on one place.

What do you think?

@wendycwong
Copy link
Contributor

Adam:

I agree with you that we need to put in some effort to help Hannah finish her work. Having a common parameters section for all algos is great. Then, for each algo, we list out the algo specific parameters.

Regarding gridsearch, my first confusion is that there is no consistency. Some parameters are gridable in one algo but not gridable in another. I think gridsearch parameter can also be divided into two sections: paramaters common to all algos, then algorithm specific parameters.

The problem here is we need to help Hannah to figure this out. I know all of us are busy but somehow this needs to be done.

W

@michalkurka
Copy link
Contributor

agreed @wendycwong - let's make this a priority after the 3.38 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants