Modify save_steps when resuming from a checkpoint.#36303
Modify save_steps when resuming from a checkpoint.#36303dignfei wants to merge 2 commits intohuggingface:mainfrom
Conversation
…gs for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.
| self.state.compute_steps(args, max_steps) | ||
| self.compare_trainer_and_checkpoint_args(self.args, self.state) |
There was a problem hiding this comment.
could you explain a bit better why this is needed ? in the next line, we are comparing if we have the same args
There was a problem hiding this comment.
In the method compare_trainer_and_checkpoint_args, the comparison we perform will not cause the program to stop running; it will only issue a warning message.
There was a problem hiding this comment.
After restoring from the checkpoint, we need to modify the save_steps. If it's not modified here, where else should it be modified?
There was a problem hiding this comment.
Then I would suggest to change the function name compare_trainer_and_checkpoint_args to maybe_update_checkpoint_from_trainer_args and perform the changes there + raise meaningful warning. Also, could you update the tests related to that ?
What does this PR do?
When training is halfway through, it's common to find that the settings for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.