Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Improve recovery of preempted jobs #633

Merged
merged 13 commits into from
Jan 17, 2022
Merged

Improve recovery of preempted jobs #633

merged 13 commits into from
Jan 17, 2022

Conversation

ant0nsc
Copy link
Contributor

@ant0nsc ant0nsc commented Jan 12, 2022

Autosaving checkpoints by default every 1 epoch to a fixed file name. Retiring the "top k" recovery checkpoint notion because that was tied to specific models that needed more than 1 checkpoint.

Please follow the guidelines for PRs contained here. Checklist:

  • Ensure that your PR is small, and implements one change.
  • Add unit tests for all functions that you introduced or modified.
  • Run PyCharm's code cleanup tools on your Python files.
  • Link the correct GitHub issue for tracking.
  • Update the Changelog file: Describe your change in terms of
    Added/Changed/Removed/... in the "Upcoming" section.
  • When merging your PR, replace the default merge message with a description of your PR,
    and if needed a motivation why that change was required.

@ant0nsc ant0nsc enabled auto-merge (squash) January 13, 2022 14:13
InnerEye/ML/common.py Outdated Show resolved Hide resolved
Tests/ML/util.py Outdated Show resolved Hide resolved
Shruthi42
Shruthi42 previously approved these changes Jan 13, 2022
InnerEye/ML/model_training.py Outdated Show resolved Hide resolved
@ant0nsc ant0nsc merged commit ccb53d0 into main Jan 17, 2022
@ant0nsc ant0nsc deleted the antonsc/recovery2 branch January 17, 2022 12:05
@ant0nsc ant0nsc linked an issue Jan 17, 2022 that may be closed by this pull request
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable the new recovery functionality of PL 1.5
3 participants