MXNet distributed training #122

vandanavk · 2019-06-25T19:01:51Z

Adding an example of distributed training using Apache MXNet.

This change is

googlebot · 2019-06-25T19:01:53Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

k8s-ci-robot · 2019-06-25T19:01:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jlewi
You can assign the PR to them by writing /assign @jlewi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-06-25T19:02:05Z

Hi @vandanavk. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rongou · 2019-06-25T21:22:33Z

examples/mxnet/Dockerfile

@@ -0,0 +1,27 @@
+FROM horovod/horovod:0.16.2-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5 AS build


You don't need the AS build part.

rongou · 2019-06-25T21:28:29Z

examples/mxnet/Dockerfile

@@ -0,0 +1,27 @@
+FROM horovod/horovod:0.16.2-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5 AS build
+
+# Create a wrapper for OpenMPI to allow running as root by default


These lines seem to have been removed in horovod's Dockerfile in favor of horovodrun. Have you tried to use it?

I haven't tried horovodrun. Should this example be using horovodrun?

I guess it's up to you. We can certainly do it at a later time.

wackxu · 2019-06-26T09:28:37Z

examples/mxnet/mxnet-mnist.yaml

@@ -0,0 +1,50 @@
+apiVersion: kubeflow.org/v1alpha1


Could you add this example for v1alpha2 version?

gaocegege

Thanks for your contribution! 🎉 👍

Ｃａｎ you add a README to illustrate how to use the example?

terrytangyuan · 2019-06-26T12:52:59Z

Thank you @vandanavk!

terrytangyuan · 2019-06-28T17:49:26Z

examples/mxnet/mxnet_mnist.py

+                     train_acc, name, val_acc)
+
+    if hvd.rank() == 0 and epoch == args.epochs - 1:
+        assert val_acc > 0.96, "Achieved accuracy (%f) is lower than expected\


Avoid using '\'

examples/mxnet/mxnet_mnist.py

examples/mxnet/mxnet-mnist.yaml

terrytangyuan · 2019-06-28T17:51:23Z

examples/mxnet/mxnet-mnist.yaml

@@ -0,0 +1,50 @@
+apiVersion: kubeflow.org/v1alpha2


Have you verified that this is working? The spec still looks like v1alpha1.

examples/mxnet/mxnet_mnist.py

Newline related comments

googlebot · 2019-07-01T17:32:43Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

This reverts commit 163aed7.

gaocegege · 2019-07-03T09:26:41Z

/ok-to-test

MXNet distributed training

8cacda2

k8s-ci-robot requested review from cheyang and fisherxu June 25, 2019 19:01

k8s-ci-robot added the needs-ok-to-test label Jun 25, 2019

k8s-ci-robot added the size/L label Jun 25, 2019

rongou reviewed Jun 25, 2019

View reviewed changes

wackxu reviewed Jun 26, 2019

View reviewed changes

gaocegege reviewed Jun 26, 2019

View reviewed changes

change apiVersion

163aed7

terrytangyuan reviewed Jun 28, 2019

View reviewed changes

Addressed some review comments

4ee4c23

Newline related comments

Revert "change apiVersion"

668ab59

This reverts commit 163aed7.

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jul 3, 2019

terrytangyuan approved these changes Jul 3, 2019

View reviewed changes

terrytangyuan merged commit 6f627a8 into kubeflow:master Jul 3, 2019

		@@ -0,0 +1,27 @@
		FROM horovod/horovod:0.16.2-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5 AS build

		@@ -0,0 +1,27 @@
		FROM horovod/horovod:0.16.2-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5 AS build

		# Create a wrapper for OpenMPI to allow running as root by default

MXNet distributed training #122

MXNet distributed training #122

Uh oh!

Conversation

vandanavk commented Jun 25, 2019 • edited by jlewi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

googlebot commented Jun 25, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

k8s-ci-robot commented Jun 25, 2019

Uh oh!

k8s-ci-robot commented Jun 25, 2019

Uh oh!

rongou Jun 25, 2019

Choose a reason for hiding this comment

Uh oh!

rongou Jun 25, 2019

Choose a reason for hiding this comment

Uh oh!

vandanavk Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

rongou Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

wackxu Jun 26, 2019

Choose a reason for hiding this comment

Uh oh!

gaocegege left a comment

Choose a reason for hiding this comment

Uh oh!

terrytangyuan commented Jun 26, 2019

Uh oh!

terrytangyuan Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

terrytangyuan Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

googlebot commented Jul 1, 2019

Uh oh!

gaocegege commented Jul 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vandanavk commented Jun 25, 2019 •

edited by jlewi

Loading