Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end #1612

Merged
merged 7 commits into from
Dec 23, 2021

Conversation

amholler
Copy link
Collaborator

@amholler amholler commented Dec 21, 2021

Update hyperopt as discussed recently in automl community.

  • Choose best model from Validation data
  • For stopped Ray Tune trials, run evaluate at search end
  • Set epochs appropriately when max_t is expressed in seconds
  • Handle case where no trial reports any results [rare]

Details for the first item:

  • Update Ludwig hyperopt operation to compute metric_score on training validation stats.
  • This means that the best model will be chosen based on validation statistics, rather than on the test statistics corresponding to the best validation statistics as is done currently.
  • An error will be reported if no validation set is provided.
  • N.B.:
    ** The metric_score is constrained to stats computed during training; it cannot be drawn from
    those only computed during post-train overall stats model evaluation. If needed, additional
    stats can be added to the training set.
    ** The post-train overall stats evaluation for the best model is computed on the validation set.
    An overall stats evaluation for the best model run on the test set can be performed by the
    use as a separate step after the hyperparameter optimization job is completed.

Details for the second item:

  • For Ray Tune trials that are stopped before training/evaluation completes, load
    and evaluate the trial's best model after the overall Ray Tune run completes.
  • Currently, Ludwig calls tune.report to report a trial's intermediate results, which include
    training_stats set to train_stats[TRAINING] and eval_stats set to train_stats[VALIDATION].
    And when the trial's training completes normally, it then calls tune.report to report the
    trial's final results, which include training_stats set to include all 3 train_stats
    (train_stats[TRAINING],train_stats[VALIDATION],train_stats[TEST]) and eval_stats set to
    the output of running an overall stats evaluation the trial's best model on the eval_set.
  • For Ray Tune trials that are stopped before training/evaluation completes, the final
    tune.report is never executed, meaning that the overall stats evaluation is not computed
    and reported as eval_stats. Also, the train_stats[TEST] is not reported in training_stats.
  • This PR changes the intermediate tune.report calls so that training_stats includes all 3
    train_stats (train_stats[TRAINING],train_stats[VALIDATION],train_stats[TEST]), with eval_stats
    set to empty. And when the overall Ray Tune run completes, for any stopped trials, it loads
    and evaluates the trial's best model, setting eval_stats for that trial in ordered_trials,
    which is returned and persisted in hyperopt_statistics.json.

@github-actions
Copy link

github-actions bot commented Dec 21, 2021

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 33m 5s ⏱️ - 10m 41s
1 216 tests ±0  1 192 ✔️ ±0  24 💤 ±0  0 ±0 
3 648 runs  ±0  3 576 ✔️ ±0  72 💤 ±0  0 ±0 

Results for commit 18adfeb. ± Comparison against base commit 2c31bc0.

♻️ This comment has been updated with latest results.

@amholler amholler changed the title For stopped Ray Tune trials, run evaluate at search end Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end Dec 22, 2021
debug=debug
)
trial['eval_stats'] = json.dumps(eval_stats, cls=NumpyEncoder)
except NotImplementedError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, when does this actually happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually happen the way we run by default, which is to use the local backend to do batch evaluation.

If one tries to use the ray backend to do batch evaluation, control goes through this code:
def batch_evaluation(self, model, dataset, collect_predictions=False, **kwargs):
raise NotImplementedError(
'Ray backend does not support batch evaluation at this time.'
)

Copy link
Collaborator

@w4nderlust w4nderlust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the @amholler !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants