Override `max_epochs` on resume from checkpoint with SSR #56

ajaysaini725 · 2021-11-02T06:33:15Z

When resuming from a checkpoint, max_epochs currently defaults to the original max_epochs which prevents users from being able to train for more than the original max_epochs when resuming from a checkpoint.

It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs. However, we need to come up with a scheme to make this work with scale_schedule_ratio because scale schedule ratios are computed assuming that max_epochs does not change.

How should we go about handling this?

The text was updated successfully, but these errors were encountered:

ravi-mosaicml · 2022-02-24T07:10:46Z

Fixed via #445....the SSR and max duration are not persisted in a checkpoint, so whatever values are passed to the trainer upon checkpoint resume will be used.

ajaysaini725 added the question label Nov 2, 2021

hanlint assigned ajaysaini725 Dec 2, 2021

ravi-mosaicml removed the question label Jan 20, 2022

ravi-mosaicml assigned jbloxham and unassigned ajaysaini725 Feb 15, 2022

ravi-mosaicml added this to the Backlog milestone Feb 15, 2022

ravi-mosaicml closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override `max_epochs` on resume from checkpoint with SSR #56

Override `max_epochs` on resume from checkpoint with SSR #56

ajaysaini725 commented Nov 2, 2021

ravi-mosaicml commented Feb 24, 2022

Override max_epochs on resume from checkpoint with SSR #56

Override max_epochs on resume from checkpoint with SSR #56

Comments

ajaysaini725 commented Nov 2, 2021

ravi-mosaicml commented Feb 24, 2022

Override `max_epochs` on resume from checkpoint with SSR #56

Override `max_epochs` on resume from checkpoint with SSR #56