Add example for Gloo backend #67

johnugeorge · 2018-09-05T05:00:32Z

Related to #7

royxue · 2018-09-05T07:38:29Z

I just tested this gloo backend could work, no need to change the existed mnist test alot.
Just replace backend, add GPU support, and add destroy_process_group.

If you need help, I could make a PR

royxue · 2018-09-05T07:42:31Z

However, hmmm which is kinda strange, the gloo can only work with v1alpha2 version, but failed on v1alpha1

gaocegege · 2018-09-05T07:43:24Z

We do not maintain v1alpha1 anymore, thus it works for us. IMO.

royxue · 2018-09-05T07:51:02Z

There is still one issue left for gloo backend I was looking into.
The worker pod could reach completed status after destroy_process_group. But the master pod cannot, and will always restart.

This might be related to the pytorch version and NCCL version.

gaocegege · 2018-09-05T07:53:49Z

We can dive into the issue. If we cannot fix it, we can refer to tf-operator and add a cleanpodpolicy to delete the master pod after the job is finished.

johnugeorge · 2018-09-05T08:03:44Z

@royxue Why is the master pod restart? Currently RestartPolicy is OnFailure. Does it fail?

royxue · 2018-09-05T08:27:40Z

yep, it will throw out an error, which would be regarded as failed. I could provide the error info later.

Also, for example, if you didnt call destroy_process_group, gloo would throw the EnForceNotMet error. as shown in (pytorch/pytorch#2530) which would cause all pod restarted.

royxue · 2018-09-05T08:43:59Z

@johnugeorge
The error is

terminate called after throwing an instance of 'std::system_error'
  what():  Software caused connection abort

johnugeorge · 2018-09-05T08:58:09Z

Looks like pytorch bug

royxue · 2018-09-05T09:36:31Z

I think pytorch didnt stop the gloo process correctly when distributed training finished.

johnugeorge · 2018-09-13T04:59:34Z

This issue is not reproduced in pytorch:04

garganubhav mentioned this issue Sep 12, 2018

Gloo support added #74

Merged

k8s-ci-robot closed this as completed in #74 Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example for Gloo backend #67

Add example for Gloo backend #67

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

royxue commented Sep 5, 2018

gaocegege commented Sep 5, 2018

royxue commented Sep 5, 2018

gaocegege commented Sep 5, 2018 •

edited

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

royxue commented Sep 5, 2018

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

johnugeorge commented Sep 13, 2018

Add example for Gloo backend #67

Add example for Gloo backend #67

Comments

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

royxue commented Sep 5, 2018

gaocegege commented Sep 5, 2018

royxue commented Sep 5, 2018

gaocegege commented Sep 5, 2018 • edited

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

royxue commented Sep 5, 2018

johnugeorge commented Sep 5, 2018

royxue commented Sep 5, 2018

johnugeorge commented Sep 13, 2018

gaocegege commented Sep 5, 2018 •

edited