Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for resuming training #8878

Merged
merged 1 commit into from
Dec 1, 2020
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 1, 2020

What does this PR do?

This PR adds two things linked to resuming training:

  1. It brings full reproducibility when resuming an interrupted training from a checkpoint (i.e., resuming a training from a checkpoint will give the exact same results as a training from the beginning with the same seeding). This was not currently the case because the dataloader shuffle was not triggered epochs_already_trained times, so the shuffle of the dataloader was the same as epoch 0. So the full reproducibility was only there for trainings resumed from an early checkpoint (during the first epoch).

  2. It also adds the option to ignore that data skipping which can take a very long time on a large dataset. This will go faster but yield different results from a training from scratch.

Fixes #8874 and #8876

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks!

Comment on lines -668 to +673
logger.info(" Num examples = %d", num_examples)
logger.info(" Num Epochs = %d", num_train_epochs)
logger.info(" Instantaneous batch size per device = %d", self.args.per_device_train_batch_size)
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d", total_train_batch_size)
logger.info(" Gradient Accumulation steps = %d", self.args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", max_steps)
logger.info(f" Num examples = {num_examples}")
logger.info(f" Num Epochs = {num_train_epochs}")
logger.info(f" Instantaneous batch size per device = {self.args.per_device_train_batch_size}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
logger.info(f" Gradient Accumulation steps = {self.args.gradient_accumulation_steps}")
logger.info(f" Total optimization steps = {max_steps}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fight for f-strings has started!

@sgugger sgugger merged commit 7c10dd2 into master Dec 1, 2020
@sgugger sgugger deleted the better_training_resume branch December 1, 2020 18:45
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Results are different when fine-tuning continues after loading model from checkpoint
2 participants