Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-resume training from checkpoint #9776

Merged
merged 3 commits into from
Jan 25, 2021
Merged

Auto-resume training from checkpoint #9776

merged 3 commits into from
Jan 25, 2021

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jan 25, 2021

What does this PR do?

Feature suggested by @madlag.
In the examples scripts, change the current behavior so if checkpoints are present in the output_dir passed to the script command, the training resumes from there. If the output_dir exists, is nonempty but has no checkpoint, the same behavior as before is applied (error). If it exists with checkpoints inside, the last checkpoint is grabbed to resume training from there. If --overwrite_output_dir is passed, the folder is destroyed as before.

This avoids user having to pass output_dir/checkpoint-xxx as their model name or path to resume training from a checkpoint, which is a nice improvement. The bad surprise can be if you set that output_dir with a trained model you like by mistake, but at the same time, the training is resumed from the last checkpoint so shouldn't be too long (and will converge to the same model) and interrupting before the end will not erase the model inside the folder, so I think the pros outweigh the cons.

Tweak the run_glue example script for now, will expand it to all scripts if accepted.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

examples/text-classification/run_glue.py Outdated Show resolved Hide resolved
sgugger and others added 2 commits January 25, 2021 08:04
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for taking care of the template too!

@sgugger sgugger merged commit caf4abf into master Jan 25, 2021
@sgugger sgugger deleted the auto_resume_training branch January 25, 2021 17:03
@jncasey
Copy link
Contributor

jncasey commented Jan 26, 2021

I was having a problem getting this to actually resume training on my system, and I had to make three small changes to the new code in trainer_utils.py:

  1. checkpoints = [path for path in content if _re_checkpoint.search(path) is not None and os.path.isdir(path)] was returning empty. I changed os.path.isdir(path) to os.path.isdir(os.path.join(folder, path)) and now it returns a list of the checkpoint folders as expected.
  2. Similarly, the get_last_checkpoint function was returning the basename of the checkpoint folder, not the full path, which seems to be expected based on the updates to the example scripts. I changed the last line of the function to return os.path.join(folder, max(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0])))
  3. After I made those update, it was resuming from the oldest checkpoint, not the newest. I noticed the checkpoint regex was only capturing the final digit in the directory name. I changed it to _re_checkpoint = re.compile(r"^" + PREFIX_CHECKPOINT_DIR + r"\-(\d+)$") with the + inside the capture group, and now get_last_checkpoint is giving me the newest checkpoint as expected.

I'm just a novice, so I'm not sure if those tweaks would break anything other systems. Does os.listdir() return full paths instead of basenames under some OS/python combinations?

But with these changes I'm able to resume an aborted training from within the same folder.

@sgugger
Copy link
Collaborator Author

sgugger commented Jan 26, 2021

Oh, I went a bit too fast and all the issues you list are completely valid! You should make a PR with all your changes since you were the one to find the problems and fix them :-)

@jncasey
Copy link
Contributor

jncasey commented Jan 26, 2021

Got sucked back into work for my day job, but I'll try getting to that soonish.

Unrelated, do you guys have an office in DUMBO? I think we're in the same building (at least until we all started working from home)

@sgugger
Copy link
Collaborator Author

sgugger commented Jan 26, 2021

DUMBO offices are at 20 Jay street, you should come and say hi for a coffee when the pandemic is over if you're around :-)

@jncasey
Copy link
Contributor

jncasey commented Jan 26, 2021

Yeah, I run a small animation studio up on the 10th floor. Turns out the fancy GPUs I have for animation are also good for messing around with machine learning :) And so my pandemic side project became teaching my computer to write poetry.

Someday when it's safe again I'll definitely stop by.

@julien-c
Copy link
Member

Haha small world @jncasey! I'm missing 20 Jay right now. What is your company's name?

@jncasey
Copy link
Contributor

jncasey commented Jan 27, 2021

We're called Mixtape Club. We moved into the building from Manhattan last January, and were only there for about 10 weeks before everyone started working from home. Now my business partner is the only one who goes in, since he can't bring our whole sound studio back to his apartment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants