You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When resuming from a checkpoint, max_epochs currently defaults to the original max_epochs which prevents users from being able to train for more than the original max_epochs when resuming from a checkpoint.
It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs. However, we need to come up with a scheme to make this work with scale_schedule_ratio because scale schedule ratios are computed assuming that max_epochs does not change.
How should we go about handling this?
The text was updated successfully, but these errors were encountered:
Fixed via #445....the SSR and max duration are not persisted in a checkpoint, so whatever values are passed to the trainer upon checkpoint resume will be used.
When resuming from a checkpoint,
max_epochs
currently defaults to the originalmax_epochs
which prevents users from being able to train for more than the originalmax_epochs
when resuming from a checkpoint.It would be good to be able to resume from checkpoint and train for more epochs than the original
max_epochs
. However, we need to come up with a scheme to make this work withscale_schedule_ratio
because scale schedule ratios are computed assuming thatmax_epochs
does not change.How should we go about handling this?
The text was updated successfully, but these errors were encountered: