Multiple fixes in SageMakerTrainer #10687

sgugger · 2021-03-12T14:05:27Z

What does this PR do?

This PR adds quite a few fixes to the SageMakerTrainer to make sure example scripts run fully. In particular it fixes:

save made the training hanging forever
predict didn't work
evaluation required using drop_last=True which is not something anyone wants.

The goal is now to test a little bit more that functionality before merging the SageMakerTrainer into the main Trainer (otherwise one can't use model parallelism in seq2seq examples or QA example). The plan is to have them merged in the v4.5.0.

sgugger · 2021-03-12T14:08:24Z

src/transformers/sagemaker/trainer_sm.py

+        if self.is_model_parallel_enabled:
+            self._save_smp(output_dir)


We need to use a special save because model parallelism requires to:

gather the state dict on all processes of d_rank 0 (it triggers a sync across those processes)

save it only on processes 0

sgugger · 2021-03-12T14:10:19Z

src/transformers/sagemaker/trainer_sm.py

+            # Consolidate the state dict on all processed of dp_rank 0
+            opt_state_dict = self.optimizer.state_dict()


The method is overloaded for this particular line/behavior. As for the model, the state dict of the optimizer needs to be gathered from all dp rank 0 processes.

sgugger · 2021-03-12T14:10:44Z

src/transformers/sagemaker/trainer_sm.py

+                os.path.join(checkpoint, "scheduler.pt")
+            ):
+                self.optimizer.load_state_dict(
+                    torch.load(os.path.join(checkpoint, "optimizer.pt"), map_location="cpu")


The method is overloaded for this particular line (needs to be loaded on the CPU and not the device).

sgugger · 2021-03-12T14:10:59Z

src/transformers/trainer.py

@@ -927,6 +924,9 @@ def train(
        if delay_optimizer_creation:
            self.create_optimizer_and_scheduler(num_training_steps=max_steps)

+        # Check if saved optimizer or scheduler states exist
+        self._load_optimizer_and_scheduler(resume_from_checkpoint)


This move should be harmless.

sgugger · 2021-03-12T14:11:22Z

src/transformers/trainer.py

-        elif self.args.local_rank != -1:
-            world_size = dist.get_world_size()
-        world_size = max(1, world_size)
+        world_size = max(1, self.args.world_size)


Refactor + allows world_size to be overloaded in SageMakerTrainingArguments

philschmid

LGTM. Tested on run_glue.py on mnli and mrpc.

LysandreJik

LGTM

* Handle save differently * Missing imports * Fix typo * Adapt to recent changes in save_pretrained * Forgotten brackets * Optimizer load * Fix world size * Deal wth None * Remove needless self

sgugger added 9 commits March 12, 2021 08:41

Handle save differently

8b2a929

Missing imports

2113981

Fix typo

be8f87c

Adapt to recent changes in save_pretrained

2a5dea8

Forgotten brackets

eb48cc1

Optimizer load

5cff738

Fix world size

b15ec8e

Deal wth None

f4f48ac

Remove needless self

05f9ae7

sgugger requested review from LysandreJik and philschmid March 12, 2021 14:05

sgugger changed the title ~~Save model smtrainer~~ Multiple fixes in SageMakerTrainer Mar 12, 2021

sgugger commented Mar 12, 2021

View reviewed changes

philschmid approved these changes Mar 12, 2021

View reviewed changes

anirudh2290 approved these changes Mar 14, 2021

View reviewed changes

LysandreJik approved these changes Mar 15, 2021

View reviewed changes

sgugger merged commit 6bef764 into master Mar 15, 2021

sgugger deleted the save_model_smtrainer branch March 15, 2021 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple fixes in SageMakerTrainer #10687

Multiple fixes in SageMakerTrainer #10687

Uh oh!

sgugger commented Mar 12, 2021

Uh oh!

sgugger Mar 12, 2021

Uh oh!

sgugger Mar 12, 2021

Uh oh!

sgugger Mar 12, 2021

Uh oh!

sgugger Mar 12, 2021

Uh oh!

sgugger Mar 12, 2021

Uh oh!

philschmid left a comment

Uh oh!

LysandreJik left a comment

Uh oh!

Uh oh!

		if self.is_model_parallel_enabled:
		self._save_smp(output_dir)

		# Consolidate the state dict on all processed of dp_rank 0
		opt_state_dict = self.optimizer.state_dict()

Multiple fixes in SageMakerTrainer #10687

Multiple fixes in SageMakerTrainer #10687

Uh oh!

Conversation

sgugger commented Mar 12, 2021

What does this PR do?

Uh oh!

sgugger Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

philschmid left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!