Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check port availability only in main deepspeed/torchrun launcher #2078

Merged
merged 3 commits into from
Nov 17, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/accelerate/utils/launch.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,10 @@ def prepare_multi_gpu_env(args: argparse.Namespace) -> Dict[str, str]:
if main_process_port is None:
main_process_port = 29500

if is_port_in_use(main_process_port):
# only need to check port availability in main process, in case we have to start multiple launchers on the same machine
# for some reasons like splitting log files.
need_port_check = num_machines <= 1 or int(args.machine_rank) == 0
pacman100 marked this conversation as resolved.
Show resolved Hide resolved
if need_port_check and is_port_in_use(main_process_port):
Comment on lines +133 to +134
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit though thinking on this more, can we add a comment here just mentioning how/why we should only check on the main node for this, in case others want to learn why as well :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Jingru :)

Copy link
Contributor Author

@Jingru Jingru Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Jingru :)

@muellerzr already add some comments

raise ConnectionError(
f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
"Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file)"
Expand Down Expand Up @@ -272,7 +275,10 @@ def prepare_deepspeed_cmd_env(args: argparse.Namespace) -> Tuple[List[str], Dict
if main_process_port is None:
main_process_port = 29500

if is_port_in_use(main_process_port):
# only need to check port availability in main process, in case we have to start multiple launchers on the same machine
# for some reasons like splitting log files.
need_port_check = num_machines <= 1 or int(args.machine_rank) == 0
if need_port_check and is_port_in_use(main_process_port):
raise ConnectionError(
f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
"Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file)"
Expand Down
Loading