Fully remove `subprocess` from the multi-gpu launcher by muellerzr · Pull Request #623 · huggingface/accelerate

muellerzr · 2022-08-09T23:30:26Z

Completely remove all `subprocess` from `multi_gpu_launcher`

What does this add?

This PR removes the subprocess parts of the multi_gpu launcher code and replaces it with raw torchrun. This reduces the number of subprocesses had (2 to 1, just torchrun's), and lets us actually get a readable stack trace now that isn't overwhelmed with subprocess bits (so we can play with them now! 🥳 )

Who is it for?

Users of Accelerate

Why is it needed?

Old stack traces were very hard to read and parse through, especially to play with since it's a double-nesting of subprocess. Now only one instance of subprocess is used, leaving us with the following new stack trace example:

(base) zach_mueller_huggingface_co@zach-multi-gpu:~/accelerate/examples$ accelerate launch nlp_example.py 
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `2` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "nlp_example.py", line 193, in <module>
    main()
  File "nlp_example.py", line 189, in main
    training_function(config, args)
  File "nlp_example.py", line 100, in training_function
    raise ValueError()
ValueError
Traceback (most recent call last):
  File "nlp_example.py", line 193, in <module>
    main()
  File "nlp_example.py", line 189, in main
    training_function(config, args)
  File "nlp_example.py", line 100, in training_function
    raise ValueError()
ValueError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 165573) of binary: /opt/conda/bin/python3.7
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 33, in <module>
    sys.exit(load_entry_point('accelerate', 'console_scripts', 'accelerate')())
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/launch.py", line 823, in launch_command
    multi_gpu_launcher(args)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/launch.py", line 442, in multi_gpu_launcher
    distrib_run.run(distrib_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
nlp_example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-09_23:17:00
  host      : zach-multi-gpu.c.huggingface-ml.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 165574)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-09_23:17:00
  host      : zach-multi-gpu.c.huggingface-ml.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 165573)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

A follow-up PR will help with the repeating structure of this stack trace via rich

It would also be a good idea to see what the other launchers would take to run them via native python to launch rather than calling subprocess to call the script/entrypoint

What parts of the API does this impact?

User-facing:

Absolutely nothing

Internal structure:

Functionally nothing

TODO:

Support non torchrun (torch.distributed)

HuggingFaceDocBuilderDev · 2022-08-09T23:33:28Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Very clean, thanks!

I just have one concern in terms of PyTorch versions. The run is pretty recent I think only 1.10 and above (which is why we had the util get_launch_prefix). So I think we should keep the old launcher as is for earlier versions and dispatch to the correct launcher depending on the PyTorch version.

sgugger · 2022-08-10T12:07:25Z

src/accelerate/utils/launch.py

    return cmd


+def _filter_args(args):


sgugger · 2022-08-10T13:19:17Z

src/accelerate/utils/constants.py

+# These are the args for `torch.distributed.launch` for pytorch < 1.9
+TORCH_LAUNCH_PARAMS = """nnodes,nproc_per_node,rdzv_backend,rdzv_endpoint,rdzv_id,rdzv_conf,standalone,max_restarts,monitor_interval,start_method,role,module,m,no_python,run_path,log_dir,r,redirects,t,tee,node_rank,master_addr,master_port""".split(
+    ","
+)


Please use a regular list here, it's going to be easier to maintain (in terms of diff)

muellerzr added 2 commits August 9, 2022 23:23

Remove one of the subprocesses!

771bdbe

Don't need const anymore

149e321

muellerzr requested review from pacman100 and sgugger August 9, 2022 23:30

muellerzr added the enhancement New feature or request label Aug 9, 2022

sgugger reviewed Aug 10, 2022

View reviewed changes

src/accelerate/utils/launch.py

return cmd

def _filter_args(args):

Copy link

Collaborator

sgugger Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean!

muellerzr reacted with heart emoji

muellerzr added 2 commits August 10, 2022 12:22

Support pytorch < 1.9.0

97afc37

Working

13f3014

muellerzr requested a review from sgugger August 10, 2022 12:27

sgugger approved these changes Aug 10, 2022

View reviewed changes

muellerzr added 3 commits August 10, 2022 13:00

Debugging...

ca3b265

Continue

544dcbd

Fin

45f644a

sgugger reviewed Aug 10, 2022

View reviewed changes

muellerzr added 2 commits August 10, 2022 13:21

Use list

0a1c35d

Clean import

cd056a2

muellerzr merged commit 9fd08d7 into main Aug 10, 2022

muellerzr deleted the subprocess-remover branch August 10, 2022 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully remove `subprocess` from the multi-gpu launcher#623

Fully remove `subprocess` from the multi-gpu launcher#623
muellerzr merged 9 commits intomainfrom
subprocess-remover

muellerzr commented Aug 9, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 9, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

sgugger Aug 10, 2022

Uh oh!

sgugger Aug 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

muellerzr commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Completely remove all subprocess from multi_gpu_launcher

What does this add?

Who is it for?

Why is it needed?

What parts of the API does this impact?

User-facing:

Internal structure:

TODO:

Uh oh!

HuggingFaceDocBuilderDev commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

sgugger Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

muellerzr commented Aug 9, 2022 •

edited

Loading

Completely remove all `subprocess` from `multi_gpu_launcher`

HuggingFaceDocBuilderDev commented Aug 9, 2022 •

edited

Loading