Modify save_steps when resuming from a checkpoint. by dignfei · Pull Request #36303 · huggingface/transformers

dignfei · 2025-02-20T15:30:14Z

What does this PR do?

When training is halfway through, it's common to find that the settings for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.

…gs for save_steps, eval_steps, or logging_steps are not optimal and need to be adjusted. After stopping the training, modifying the parameters, and restarting, the new parameters cannot be overwritten by the checkpoint parameters.

Rocketknight1 · 2025-02-21T15:38:40Z

cc @SunMarc @muellerzr

SunMarc · 2025-02-21T16:35:46Z

+            self.state.compute_steps(args, max_steps)
            self.compare_trainer_and_checkpoint_args(self.args, self.state)


could you explain a bit better why this is needed ? in the next line, we are comparing if we have the same args

In the method compare_trainer_and_checkpoint_args, the comparison we perform will not cause the program to stop running; it will only issue a warning message.

After restoring from the checkpoint, we need to modify the save_steps. If it's not modified here, where else should it be modified?

Then I would suggest to change the function name compare_trainer_and_checkpoint_args to maybe_update_checkpoint_from_trainer_args and perform the changes there + raise meaningful warning. Also, could you update the tests related to that ?

dignfei added 2 commits February 20, 2025 23:19

Merge branch 'main' into modify_save_steps

538f242

SunMarc reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify save_steps when resuming from a checkpoint.#36303

Modify save_steps when resuming from a checkpoint.#36303
dignfei wants to merge 2 commits intohuggingface:mainfrom
dignfei:modify_save_steps

dignfei commented Feb 20, 2025

Uh oh!

Rocketknight1 commented Feb 21, 2025

Uh oh!

SunMarc Feb 21, 2025

Uh oh!

dignfei Feb 24, 2025

Uh oh!

dignfei Feb 24, 2025

Uh oh!

SunMarc Feb 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		self.state.compute_steps(args, max_steps)
		self.compare_trainer_and_checkpoint_args(self.args, self.state)

Conversation

dignfei commented Feb 20, 2025

What does this PR do?

Uh oh!

Rocketknight1 commented Feb 21, 2025

Uh oh!

SunMarc Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

dignfei Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

dignfei Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants