Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

amholler · 2021-12-21T00:01:58Z

Update hyperopt as discussed recently in automl community.

Choose best model from Validation data
For stopped Ray Tune trials, run evaluate at search end
Set epochs appropriately when max_t is expressed in seconds
Handle case where no trial reports any results [rare]

Details for the first item:

Update Ludwig hyperopt operation to compute metric_score on training validation stats.
This means that the best model will be chosen based on validation statistics, rather than on the test statistics corresponding to the best validation statistics as is done currently.
An error will be reported if no validation set is provided.
N.B.:
** The metric_score is constrained to stats computed during training; it cannot be drawn from
those only computed during post-train overall stats model evaluation. If needed, additional
stats can be added to the training set.
** The post-train overall stats evaluation for the best model is computed on the validation set.
An overall stats evaluation for the best model run on the test set can be performed by the
use as a separate step after the hyperparameter optimization job is completed.

Details for the second item:

For Ray Tune trials that are stopped before training/evaluation completes, load
and evaluate the trial's best model after the overall Ray Tune run completes.
Currently, Ludwig calls tune.report to report a trial's intermediate results, which include
training_stats set to train_stats[TRAINING] and eval_stats set to train_stats[VALIDATION].
And when the trial's training completes normally, it then calls tune.report to report the
trial's final results, which include training_stats set to include all 3 train_stats
(train_stats[TRAINING],train_stats[VALIDATION],train_stats[TEST]) and eval_stats set to
the output of running an overall stats evaluation the trial's best model on the eval_set.
For Ray Tune trials that are stopped before training/evaluation completes, the final
tune.report is never executed, meaning that the overall stats evaluation is not computed
and reported as eval_stats. Also, the train_stats[TEST] is not reported in training_stats.
This PR changes the intermediate tune.report calls so that training_stats includes all 3
train_stats (train_stats[TRAINING],train_stats[VALIDATION],train_stats[TEST]), with eval_stats
set to empty. And when the overall Ray Tune run completes, for any stopped trials, it loads
and evaluates the trial's best model, setting eval_stats for that trial in ordered_trials,
which is returned and persisted in hyperopt_statistics.json.

github-actions · 2021-12-21T00:42:48Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 33m 5s ⏱️ - 10m 41s
1 216 tests ±0 1 192 ✔️ ±0 24 💤 ±0 0 ❌ ±0
3 648 runs ±0 3 576 ✔️ ±0 72 💤 ±0 0 ❌ ±0

Results for commit 18adfeb. ± Comparison against base commit 2c31bc0.

♻️ This comment has been updated with latest results.

w4nderlust · 2021-12-23T09:43:00Z

ludwig/hyperopt/execution.py

+                                debug=debug
+                            )
+                            trial['eval_stats'] = json.dumps(eval_stats, cls=NumpyEncoder)
+                        except NotImplementedError:


Curious, when does this actually happen?

It doesn't actually happen the way we run by default, which is to use the local backend to do batch evaluation.

If one tries to use the ray backend to do batch evaluation, control goes through this code:
def batch_evaluation(self, model, dataset, collect_predictions=False, **kwargs):
raise NotImplementedError(
'Ray backend does not support batch evaluation at this time.'
)

w4nderlust

Thanks for the @amholler !

For stopped Ray Tune trials, evaluate at search end

231573f

anneholler added 4 commits December 21, 2021 08:09

Update to reflect validation_set requirement for post-tune evaluation

44ec1ee

Update to not set epochs to max_t given in seconds

f592ca4

Handle case where no trials report any results

15e7a64

Update to have best model chosen based on validation stats

8691b65

amholler changed the title ~~For stopped Ray Tune trials, run evaluate at search end~~ Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end Dec 22, 2021

anneholler added 2 commits December 22, 2021 07:47

update tests for get_metric_score

fead1b0

Write output from trial best model evaluation into trial_path

18adfeb

w4nderlust reviewed Dec 23, 2021

View reviewed changes

w4nderlust approved these changes Dec 23, 2021

View reviewed changes

amholler mentioned this pull request Dec 23, 2021

Update hyperopt: Choose best model from validation data; For stopped … #1625

Merged

justinxzhao merged commit 20a5847 into ludwig-ai:tf-legacy Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

amholler commented Dec 21, 2021 •

edited

Loading

github-actions bot commented Dec 21, 2021 •

edited

Loading

w4nderlust Dec 23, 2021

amholler Dec 23, 2021

w4nderlust left a comment

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

Conversation

amholler commented Dec 21, 2021 • edited Loading

github-actions bot commented Dec 21, 2021 • edited Loading

Unit Test Results

w4nderlust Dec 23, 2021

Choose a reason for hiding this comment

amholler Dec 23, 2021

Choose a reason for hiding this comment

w4nderlust left a comment

Choose a reason for hiding this comment

amholler commented Dec 21, 2021 •

edited

Loading

github-actions bot commented Dec 21, 2021 •

edited

Loading