Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override max_epochs on resume from checkpoint with SSR #56

Closed
ajaysaini725 opened this issue Nov 2, 2021 · 1 comment
Closed

Override max_epochs on resume from checkpoint with SSR #56

ajaysaini725 opened this issue Nov 2, 2021 · 1 comment
Assignees

Comments

@ajaysaini725
Copy link
Contributor

When resuming from a checkpoint, max_epochs currently defaults to the original max_epochs which prevents users from being able to train for more than the original max_epochs when resuming from a checkpoint.

It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs. However, we need to come up with a scheme to make this work with scale_schedule_ratio because scale schedule ratios are computed assuming that max_epochs does not change.

How should we go about handling this?

@ravi-mosaicml ravi-mosaicml added this to the Backlog milestone Feb 15, 2022
@ravi-mosaicml
Copy link
Contributor

Fixed via #445....the SSR and max duration are not persisted in a checkpoint, so whatever values are passed to the trainer upon checkpoint resume will be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants