Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check port availability only in main deepspeed/torchrun launcher #2078

Merged
merged 3 commits into from
Nov 17, 2023

Conversation

Jingru
Copy link
Contributor

@Jingru Jingru commented Oct 24, 2023

What does this PR do?

Currently, main_process_port availability is checked on all the deepspeed launchers (for multi_node senario), but this check is only necessary for main launcher whose machine rank is 0.

Who can review?

@muellerzr
Copy link
Collaborator

This is great @Jingru, can we also do this for the torchrun/multi-gpu launcher since it would be good to have the same thing there? Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Jingru Jingru changed the title check port availability only in main deepspeed launcher check port availability only in main deepspeed/torchrun launcher Oct 25, 2023
@Jingru
Copy link
Contributor Author

Jingru commented Oct 25, 2023

This is great @Jingru, can we also do this for the torchrun/multi-gpu launcher since it would be good to have the same thing there? Thanks!

Sure thing. Already updated.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks great to me.

Comment on lines +131 to +132
need_port_check = num_machines <= 1 or int(args.machine_rank) == 0
if need_port_check and is_port_in_use(main_process_port):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit though thinking on this more, can we add a comment here just mentioning how/why we should only check on the main node for this, in case others want to learn why as well :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Jingru :)

Copy link
Contributor Author

@Jingru Jingru Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Jingru :)

@muellerzr already add some comments

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from Zach's comment, LGTM, thanks.

add comments
@Jingru
Copy link
Contributor Author

Jingru commented Nov 4, 2023

comments added

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jingru for fixing this issue, left a comment.

src/accelerate/utils/launch.py Show resolved Hide resolved
@muellerzr muellerzr merged commit cf745c9 into huggingface:main Nov 17, 2023
8 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants