Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Add example for Gloo backend #67

Closed
johnugeorge opened this issue Sep 5, 2018 · 11 comments
Closed

Add example for Gloo backend #67

johnugeorge opened this issue Sep 5, 2018 · 11 comments

Comments

@johnugeorge
Copy link
Member

Related to #7

@royxue
Copy link

royxue commented Sep 5, 2018

I just tested this gloo backend could work, no need to change the existed mnist test alot.
Just replace backend, add GPU support, and add destroy_process_group.

If you need help, I could make a PR

@royxue
Copy link

royxue commented Sep 5, 2018

However, hmmm which is kinda strange, the gloo can only work with v1alpha2 version, but failed on v1alpha1

@gaocegege
Copy link
Member

We do not maintain v1alpha1 anymore, thus it works for us. IMO.

@royxue
Copy link

royxue commented Sep 5, 2018

There is still one issue left for gloo backend I was looking into.
The worker pod could reach completed status after destroy_process_group. But the master pod cannot, and will always restart.

This might be related to the pytorch version and NCCL version.

@gaocegege
Copy link
Member

gaocegege commented Sep 5, 2018

We can dive into the issue. If we cannot fix it, we can refer to tf-operator and add a cleanpodpolicy to delete the master pod after the job is finished.

@johnugeorge
Copy link
Member Author

@royxue Why is the master pod restart? Currently RestartPolicy is OnFailure. Does it fail?

@royxue
Copy link

royxue commented Sep 5, 2018

yep, it will throw out an error, which would be regarded as failed. I could provide the error info later.

Also, for example, if you didnt call destroy_process_group, gloo would throw the EnForceNotMet error. as shown in (pytorch/pytorch#2530) which would cause all pod restarted.

@royxue
Copy link

royxue commented Sep 5, 2018

@johnugeorge
The error is

terminate called after throwing an instance of 'std::system_error'
  what():  Software caused connection abort

@johnugeorge
Copy link
Member Author

Looks like pytorch bug

@royxue
Copy link

royxue commented Sep 5, 2018

I think pytorch didnt stop the gloo process correctly when distributed training finished.

@johnugeorge
Copy link
Member Author

This issue is not reproduced in pytorch:04

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants