Introduce breakpoint API #1940

muellerzr · 2023-09-07T13:45:16Z

Introduce breakpoint API

What does this add?

This PR adds two new functions to Accelerator: set_breakpoint and check_breakpoint based on https://discuss.pytorch.org/t/how-to-use-break-in-distributeddataparallel-training/88296/7

Who is it for?

https://discuss.huggingface.co/t/early-stopping-for-eval-loss-causes-timeout/51349

Why is it needed?

When doing early stopping in DDP, if each process has a specific conditional that it can check, where it may not be synchronized across all of them, a break can happen on process 0 but not on process 1. As a result this will cause the code to hang indefinitely until a timeout occurs.

To address that this PR introduces a new set and check breakpoint API, which should be used in-tandem with such conditionals to ensure that the breakpoint will be reached. This API can be used on any distributed type, not just DDP/multi-gpu

What parts of the API does this impact?

User-facing:

Accelerator.set_breakpoint and Accelerator.check_breakpoint

Internal structure:

The Accelerator object now keeps track of a tensor at self.flag_tensor

Basic Usage Example(s):

# Assume `should_do_breakpoint` is a custom defined function that returns a conditional, 
# and that conditional might be true only on process 1
if should_do_breakpoint(loss):
    accelerator.set_breakpoint()

# Later in the training script when we need to check for the breakpoint
if accelerator.check_breakpoint():
    break

When would I use it, and when wouldn't I?

A prime example would be triggering early stopping, where stopping on process 0 should trigger stopping the process across all of them.

HuggingFaceDocBuilderDev · 2023-09-07T13:50:29Z

The documentation is not available anymore as the PR was closed or merged.

BenjaminBossan

Thanks for investigating that issue and coming up with a solution.

This is not an in depth review yet, but maybe these comments can already help a bit. Apart from those, I also wondered about the term "breakpoint". I mostly associate it to the context of debugging, which is not what is happening here, but maybe my view is too narrow. In reality, this feature is really a synchronized flag and could in theory be used as a basis for all kinds of features, right?

src/accelerate/accelerator.py

docs/source/concept_guides/deferring_execution.md

muellerzr · 2023-09-07T14:45:56Z

@BenjaminBossan I'd say it's a conditional trigger setup, more than synchronizing flags. Because also synchronizing flags aren't really true, it's checking if a particular condition should be met that affects all processes which occurred in a singular process. (Or more than one of these processes).

Agreed the naming could use some work however, if anyone has some ideas lmk otherwise I'll think on it :)

BenjaminBossan · 2023-09-08T08:32:44Z

I'd say it's a conditional trigger setup, more than synchronizing flags

True, it's basically an any operator that works across processes.

Agreed the naming could use some work however, if anyone has some ideas lmk otherwise I'll think on it :)

Always the hard questions. Since you used the word yourself, I wonder if it could be something with "trigger".

BenjaminBossan

Thanks a lot, LGTM. I have a few suggestions, but no blockers for merging. Feel free to follow or ignore them.

docs/source/concept_guides/deferring_execution.md

examples/by_feature/early_stopping.py

src/accelerate/accelerator.py

src/accelerate/test_utils/scripts/test_script.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

src/accelerate/accelerator.py

src/accelerate/test_utils/scripts/test_script.py

SunMarc

LGTM ! Left a few suggestions ;)

beyondguo · 2024-06-20T09:03:32Z

Hi @muellerzr , is set_breakpoint diabled now? I got AttributeError: 'Accelerator' object has no attribute 'set_breakpoint' with the latest acclerate (v0.31.0).

Actually I am very excited to find this PR, since this seems quite relevant to my problem (similar to the early stopping problem), I'm using 2-gpu to train a roberta model, and I want to calculate the valication loss after eval_steps, here's the core code:

...
model, optimizer, train_dataloader, val_dataloader = accelerator.prepare(model, optimizer, train_dataloader, val_dataloader)
...
best_eval_loss = float('inf')

# Training loop:
for epoch in range(num_epochs): 
    for batch in train_dataloader:
        model.train()
        outputs = model(**batch)
        loss = outputs.loss 
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        progress_bar.update(1)

        # validation when reach 'eval_steps'
        if progress_bar.n > 10 and progress_bar.n % args.eval_steps == 0:
            model.eval()
            losses = []
            for val_batch in val_dataloader:
                # ----------------> stuck here!
                with torch.no_grad():
                    outputs = model(**val_batch)
                e_loss = outputs.loss
                losses.append(accelerator.gather_for_metrics(e_loss.repeat(val_bs)))
            losses = torch.cat(losses)
            eval_loss = torch.mean(losses)
            print("eval_loss", eval_loss.item())
            
            # save the current best model
            if eval_loss.item() < best_eval_loss:
                print("Current best model!, Steps:", progress_bar.n, "Eval loss:", eval_loss.item())
                # save
                ...

The process just got stuck when computing the first validation batch. Two GPUs are both 100% utilization, so I guess the process is hanging.

After a long time, an error will occur:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1808546 milliseconds before timing out.
...
Traceback (most recent call last):
  File "rolling_model_train.py", line 289, in <module>
    train()
  File "rolling_model_train.py", line 195, in train
    losses.append(accelerator.gather_for_metrics(loss.repeat(val_bs)))
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/accelerate/accelerator.py", line 2217, in gather_for_metrics
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1808546 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 89866) of binary: /home/guoby/app/Anaconda3-2021.05/envs/news/bin/python
Traceback (most recent call last):
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/guoby/app/Anaconda3-2021.05/envs/news/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
rolling_model_train.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-20_15:02:22
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 89867)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 89867
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-20_15:02:22
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 89866)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 89866
======================================================

Note that, if I move the validation code after each epoch, instead during one epoch, the program runs fine! Which is like this:

# Training loop:
for epoch in range(num_epochs): 
    for batch in train_dataloader:
        model.train()
        outputs = model(**batch)
        loss = outputs.loss 
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # validation:
    model.eval()
    losses = []
    for val_batch in val_dataloader:
        with torch.no_grad():
            outputs = model(**val_batch)
        e_loss = outputs.loss
        losses.append(accelerator.gather_for_metrics(e_loss.repeat(val_bs)))
    losses = torch.cat(losses)
    eval_loss = torch.mean(losses)
    print("eval_loss", eval_loss.item())

Could you help me on this? Thanks a lot!

early stopping

f1cd296

muellerzr requested review from BenjaminBossan, pacman100 and SunMarc September 7, 2023 13:45

muellerzr added 3 commits September 7, 2023 14:00

Fix tests

4ad9dc8

Works on multi-gpu, uncomment

91fb8f2

Rm reset

58bc203

BenjaminBossan reviewed Sep 7, 2023

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

docs/source/concept_guides/deferring_execution.md Outdated Show resolved Hide resolved

muellerzr added 2 commits September 7, 2023 14:43

Check for >=1

3105bc3

equal

00ee7db

Trigger

9297483

muellerzr requested a review from BenjaminBossan September 11, 2023 14:01

Fix test

6fe16f7

BenjaminBossan approved these changes Sep 11, 2023

View reviewed changes

muellerzr and others added 4 commits September 12, 2023 12:21

Update docs/source/concept_guides/deferring_execution.md

9efdf59

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Explicit example loop

535ad35

Set to zero, not None

60ea398

rename test

af68b87

SunMarc reviewed Sep 13, 2023

View reviewed changes

src/accelerate/accelerator.py Show resolved Hide resolved

SunMarc reviewed Sep 13, 2023

View reviewed changes

src/accelerate/test_utils/scripts/test_script.py Show resolved Hide resolved

SunMarc approved these changes Sep 13, 2023

View reviewed changes

Check again to ensure it's been reset

3ab6ffd

muellerzr merged commit 40a73e0 into main Sep 13, 2023
26 checks passed

muellerzr deleted the breakpoint branch September 13, 2023 16:42

MKhalusova mentioned this pull request Nov 10, 2023

[docs] troubleshooting guide #2133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce breakpoint API #1940

Introduce breakpoint API #1940

muellerzr commented Sep 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 7, 2023 •

edited

Loading

BenjaminBossan left a comment

muellerzr commented Sep 7, 2023

BenjaminBossan commented Sep 8, 2023

BenjaminBossan left a comment

SunMarc left a comment

beyondguo commented Jun 20, 2024

Introduce breakpoint API #1940

Introduce breakpoint API #1940

Conversation

muellerzr commented Sep 7, 2023 • edited Loading

Introduce breakpoint API

What does this add?

Who is it for?

Why is it needed?

What parts of the API does this impact?

User-facing:

Internal structure:

Basic Usage Example(s):

When would I use it, and when wouldn't I?

HuggingFaceDocBuilderDev commented Sep 7, 2023 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

muellerzr commented Sep 7, 2023

BenjaminBossan commented Sep 8, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

beyondguo commented Jun 20, 2024

muellerzr commented Sep 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 7, 2023 •

edited

Loading