Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: offer the possibility to specify the order in which training steps will be executed #8793

Closed
exalate-issue-sync bot opened this issue May 12, 2023 · 3 comments

Comments

@exalate-issue-sync
Copy link

After discussing Epsilon's needs with [~accountid:557058:9328661f-241f-4a0f-9d9a-d4e78ef05ba0] and [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] regarding case 94685, we decided for now to provide the possibility for AutoML to specify the order in which training steps will be executed.
This can be done at higher/coarse-grained level (order of default algos, default grids):

  • XGB_defaults, GBM_defaults, … XGB_grid, …

Or at a more fine-grained level (order of each hardcoded model):

  • XGB_default_1, XGB_default_2, …., GBM_def_1, ….

h1. Proposal

The suggested parameter name for this specification is {{modeling_plan}}.

Here is the suggested JSON representation to specify those steps in an ordered way:

{code:json}[
{"name":"XGBoost", "steps":[{"id":"def_1"}, {"id":"def_2"}, {"id":"def_3"}],
{"name":"GLM"},
{"name":"DRF", "alias":"all"},
{"name":"GBM", "alias":"defaults"},
{"name":"XRT"},
{"name":"XGBoost", "steps":[{"id":"grid_1"}]},
{"name":"GBM", "alias":"grids"],
{"name":"StackedEnsemble", "steps":[{"id":"best"}, {"id":"all"}]}
]{code}

Unfortunately, JSON doesn’t guarantee conservation of object keys so we can’t use a JSON object for this but have to use only arrays.

The semantic of the example above goes as follow:

  • starts with {{XGBoost}} algorithm, but only hardcoded models with ids {{def_1}}, {{def_2}}, {{def_3}} in the given order.
  • then train all the {{GLM }} models (default models and/or grids), followed by all {{DRF}} models (using alias {{all}} in the latter case).
  • then train all the default {{GBM}} models (using alias {{defaults}} to avoid typing all the model ids explicitly).
  • then train all the {{XRT}} models
  • then train {{XGBoost}} step with id {{grid_1}} (probably a grid…)
  • then train all the {{GBM}} grids (using alias {{grids}} to avoid listing them explicitly).
  • then train the {{StackedEnsemble}} models with ids {{best}} and {{all}} in this order.
  • {{DeepLearning}} algo hasn’t been mentioned in this example, so it will be skipped.

If an algo or a model id (e.g. {{def_3}}) is present in this order specification but the id doesn’t exist anymore in the new {{AutoML}} version, then it will be ignored with a warning message.

The representation is also easily extensible: we can add new algos, new default models, new grids, new hyperparameter search methods…

If user also specifies {{exclude_algos}} parameter, this one will apply on top of the order specification: this allows user to keep this specification in one variable, without having to change it later. For example {{exclude_algos=[“XRT“]}}in combination with {{modeling_plan=the_example_above}} will execute the steps defined in the example except {{XRT}}. Same thing if using {{include_algos}} instead.

After running {{AutoML}}, the detailed {{modeling_steps}} specification (with all step ids) will be available from the automl instance so that the user can save it for later use.

Python representation examples (can use list or tuples):

{code:python}# the JSON example translated to Python using simple syntax:
modeling_plan=[
('XGBoost', ['def_1', 'def_2', 'def_3']),
('GLM'),
('DRF', 'all'),
('GBM', 'defaults'),
'XRT',
('XGBoost', ['grid_1']),
('GBM', 'grids'),
('StackedEnsemble', ['best', 'all'])
]

specify only algos ordering: in this case it will always execute

all default models first (if any)

immediately followed by the algo grids (if any):

modeling_plan=['XGBoost', 'GLM', 'DRF', 'GBM', 'DeepLearning']

only specify algos order, making the distinction between default models and grids (the order of each individual model is the default one defined by backend):

modeling_plan=[
('XGBoost', 'defaults'),
('GLM', 'grids'),
('DRF', 'defaults'),
('GBM', 'defaults'),
('XGBoost', 'grids'),
('GBM', 'grids'),
('StackedEnsemble', 'all')
]{code}

And an equivalent representation in R:

{code:r}modeling_plan=list(
list(name='XGBoost', steps=c('def_1', 'def_2', 'def_3')),
list(name='GLM'),
list(name='DRF', alias='all'),
list(name='GBM', alias='defaults'),
'XRT',
list(name='XGBoost', steps=c('grid_1')),
list(name='GBM', alias='grids'),
list(name='StackedEnsemble', steps=c('best', 'all'))
){code}

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] [~accountid:557058:9328661f-241f-4a0f-9d9a-d4e78ef05ba0] this is the new ticket for training order specification.

Please have a look at the detailed proposal.

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Thank you Erin and Sebastien

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-6840
Assignee: Sebastien Poirier
Reporter: Sebastien Poirier
State: Resolved
Fix Version: 3.28.0.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#3867

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant