-
Notifications
You must be signed in to change notification settings - Fork 143
Add example for Gloo backend #67
Comments
I just tested this gloo backend could work, no need to change the existed mnist test alot. If you need help, I could make a PR |
However, hmmm which is kinda strange, the gloo can only work with v1alpha2 version, but failed on v1alpha1 |
We do not maintain v1alpha1 anymore, thus it works for us. IMO. |
There is still one issue left for gloo backend I was looking into. This might be related to the pytorch version and NCCL version. |
We can dive into the issue. If we cannot fix it, we can refer to tf-operator and add a cleanpodpolicy to delete the master pod after the job is finished. |
@royxue Why is the master pod restart? Currently RestartPolicy is OnFailure. Does it fail? |
yep, it will throw out an error, which would be regarded as failed. I could provide the error info later. Also, for example, if you didnt call |
@johnugeorge
|
Looks like pytorch bug |
I think pytorch didnt stop the gloo process correctly when distributed training finished. |
This issue is not reproduced in pytorch:04 |
Related to #7
The text was updated successfully, but these errors were encountered: