Skip to content

Commit

Permalink
Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic (pytorch#133)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: pytorch#133

NCCL Async Error Handling is a new mechanism implemented in ProcessGroupNCCL to provide reliability for DDP training runs using NCCL. See here for a more detailed background and implementation details: pytorch/pytorch#46874.

At a high-level, this system was designed to ensure desynchronization, high GPU utilization, and NCCL errors don't cause indefinite hanging in distributed training runs. This system catches these errors without any perf impact and brings down the training process, and torchelastic can detect this and restart training from the previous checkpoint. The time after which stuck collectives are detected can be tuned using the `timeout` argument to `init_process_group`.

Reviewed By: kiukchung

Differential Revision: D23610237

fbshipit-source-id: 0183ce0fe0f7c3e6d615c352183ae74fd0bee854
  • Loading branch information
osalpekar authored and facebook-github-bot committed Nov 10, 2020
1 parent 1b8ca31 commit bd65a04
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions torchelastic/agent/server/local_elastic_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ def _get_worker_env(dist_info: _DistInfo, local_rank: int) -> Dict[str, str]:
worker_env["TORCHELASTIC_MAX_RESTARTS"] = str(dist_info.max_restarts)
worker_env["TORCHELASTIC_RUN_ID"] = dist_info.run_id
worker_env["TORCHELASTIC_ERROR_DIR"] = get_error_dir()
worker_env["NCCL_ASYNC_ERROR_HANDLING"] = str(1)
if "OMP_NUM_THREADS" in os.environ:
worker_env["OMP_NUM_THREADS"] = os.environ["OMP_NUM_THREADS"]
return worker_env
Expand Down

0 comments on commit bd65a04

Please sign in to comment.