Skip to content

Fully remove subprocess from the multi-gpu launcher#623

Merged
muellerzr merged 9 commits intomainfrom
subprocess-remover
Aug 10, 2022
Merged

Fully remove subprocess from the multi-gpu launcher#623
muellerzr merged 9 commits intomainfrom
subprocess-remover

Conversation

@muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Aug 9, 2022

Completely remove all subprocess from multi_gpu_launcher

What does this add?

This PR removes the subprocess parts of the multi_gpu launcher code and replaces it with raw torchrun. This reduces the number of subprocesses had (2 to 1, just torchrun's), and lets us actually get a readable stack trace now that isn't overwhelmed with subprocess bits (so we can play with them now! 🥳 )

Who is it for?

Users of Accelerate

Why is it needed?

Old stack traces were very hard to read and parse through, especially to play with since it's a double-nesting of subprocess. Now only one instance of subprocess is used, leaving us with the following new stack trace example:

(base) zach_mueller_huggingface_co@zach-multi-gpu:~/accelerate/examples$ accelerate launch nlp_example.py 
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `2` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
  File "nlp_example.py", line 193, in <module>
    main()
  File "nlp_example.py", line 189, in main
    training_function(config, args)
  File "nlp_example.py", line 100, in training_function
    raise ValueError()
ValueError
Traceback (most recent call last):
  File "nlp_example.py", line 193, in <module>
    main()
  File "nlp_example.py", line 189, in main
    training_function(config, args)
  File "nlp_example.py", line 100, in training_function
    raise ValueError()
ValueError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 165573) of binary: /opt/conda/bin/python3.7
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 33, in <module>
    sys.exit(load_entry_point('accelerate', 'console_scripts', 'accelerate')())
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/launch.py", line 823, in launch_command
    multi_gpu_launcher(args)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/commands/launch.py", line 442, in multi_gpu_launcher
    distrib_run.run(distrib_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
nlp_example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-09_23:17:00
  host      : zach-multi-gpu.c.huggingface-ml.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 165574)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-09_23:17:00
  host      : zach-multi-gpu.c.huggingface-ml.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 165573)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

A follow-up PR will help with the repeating structure of this stack trace via rich

It would also be a good idea to see what the other launchers would take to run them via native python to launch rather than calling subprocess to call the script/entrypoint

What parts of the API does this impact?

User-facing:

Absolutely nothing

Internal structure:

Functionally nothing

TODO:

  • Support non torchrun (torch.distributed)

@muellerzr muellerzr requested review from pacman100 and sgugger August 9, 2022 23:30
@muellerzr muellerzr added the enhancement New feature or request label Aug 9, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 9, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean, thanks!

I just have one concern in terms of PyTorch versions. The run is pretty recent I think only 1.10 and above (which is why we had the util get_launch_prefix). So I think we should keep the old launcher as is for earlier versions and dispatch to the correct launcher depending on the PyTorch version.

return cmd


def _filter_args(args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean!

@muellerzr muellerzr requested a review from sgugger August 10, 2022 12:27
Comment on lines +35 to +38
# These are the args for `torch.distributed.launch` for pytorch < 1.9
TORCH_LAUNCH_PARAMS = """nnodes,nproc_per_node,rdzv_backend,rdzv_endpoint,rdzv_id,rdzv_conf,standalone,max_restarts,monitor_interval,start_method,role,module,m,no_python,run_path,log_dir,r,redirects,t,tee,node_rank,master_addr,master_port""".split(
","
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a regular list here, it's going to be easier to maintain (in terms of diff)

@muellerzr muellerzr merged commit 9fd08d7 into main Aug 10, 2022
@muellerzr muellerzr deleted the subprocess-remover branch August 10, 2022 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants