Skip to content

Modify save_steps when resuming from a checkpoint.#36303

Open
dignfei wants to merge 2 commits intohuggingface:mainfrom
dignfei:modify_save_steps
Open

Modify save_steps when resuming from a checkpoint.#36303
dignfei wants to merge 2 commits intohuggingface:mainfrom
dignfei:modify_save_steps

Conversation

@dignfei
Copy link
Copy Markdown

@dignfei dignfei commented Feb 20, 2025

What does this PR do?

When training is halfway through, it's common to find that the settings for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.

…gs for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.
@Rocketknight1
Copy link
Copy Markdown
Member

cc @SunMarc @muellerzr

Comment on lines +2429 to 2430
self.state.compute_steps(args, max_steps)
self.compare_trainer_and_checkpoint_args(self.args, self.state)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain a bit better why this is needed ? in the next line, we are comparing if we have the same args

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the method compare_trainer_and_checkpoint_args, the comparison we perform will not cause the program to stop running; it will only issue a warning message.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After restoring from the checkpoint, we need to modify the save_steps. If it's not modified here, where else should it be modified?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would suggest to change the function name compare_trainer_and_checkpoint_args to maybe_update_checkpoint_from_trainer_args and perform the changes there + raise meaningful warning. Also, could you update the tests related to that ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants