Better support for resuming training #8878

sgugger · 2020-12-01T16:44:26Z

What does this PR do?

This PR adds two things linked to resuming training:

It brings full reproducibility when resuming an interrupted training from a checkpoint (i.e., resuming a training from a checkpoint will give the exact same results as a training from the beginning with the same seeding). This was not currently the case because the dataloader shuffle was not triggered epochs_already_trained times, so the shuffle of the dataloader was the same as epoch 0. So the full reproducibility was only there for trainings resumed from an early checkpoint (during the first epoch).
It also adds the option to ignore that data skipping which can take a very long time on a large dataset. This will go faster but yield different results from a training from scratch.

Fixes #8874 and #8876

LysandreJik

This looks good to me. Thanks!

LysandreJik · 2020-12-01T18:18:44Z

src/transformers/trainer.py

-        logger.info("  Num examples = %d", num_examples)
-        logger.info("  Num Epochs = %d", num_train_epochs)
-        logger.info("  Instantaneous batch size per device = %d", self.args.per_device_train_batch_size)
-        logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d", total_train_batch_size)
-        logger.info("  Gradient Accumulation steps = %d", self.args.gradient_accumulation_steps)
-        logger.info("  Total optimization steps = %d", max_steps)
+        logger.info(f"  Num examples = {num_examples}")
+        logger.info(f"  Num Epochs = {num_train_epochs}")
+        logger.info(f"  Instantaneous batch size per device = {self.args.per_device_train_batch_size}")
+        logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
+        logger.info(f"  Gradient Accumulation steps = {self.args.gradient_accumulation_steps}")
+        logger.info(f"  Total optimization steps = {max_steps}")


The fight for f-strings has started!

Better support for resuming training

14a32ac

sgugger requested a review from LysandreJik December 1, 2020 16:44

LysandreJik approved these changes Dec 1, 2020

View reviewed changes

sgugger merged commit 7c10dd2 into master Dec 1, 2020

sgugger deleted the better_training_resume branch December 1, 2020 18:45

stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020

Better support for resuming training (huggingface#8878)

64ec8bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for resuming training #8878

Better support for resuming training #8878

sgugger commented Dec 1, 2020

LysandreJik left a comment

LysandreJik Dec 1, 2020

Better support for resuming training #8878

Better support for resuming training #8878

Conversation

sgugger commented Dec 1, 2020

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Dec 1, 2020

Choose a reason for hiding this comment