Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

Support BytePS #82

Merged
merged 2 commits into from Jul 4, 2020
Merged

Support BytePS #82

merged 2 commits into from Jul 4, 2020

Conversation

jasonliu747
Copy link
Member

BytePS is a high performance and generic framework for distributed DNN training. For a successful start, it needs scheduler, server and worker, which is similar to MXNet.

After a short discussion with @suleisl2000 , we both agreed it would be OK to launch BytePS job using mxnet-operator. This minor change would only inject env DMLC_WORKER_ID to each worker, while others remain unchanged.

@kubeflow-bot
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @jasonliu747. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since BytePS is similar to Horovod, would it be better suited to run with MPI Operator?

@jasonliu747
Copy link
Member Author

Exactly what I was thinking at first! Though the way BytePS is called similarly to Horovod, it's totally different at its core. One is based on PS-Lite, and the other is based on MPI concepts. Therefore, it might be better suited to run with MXNet-Operator.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this for now since BytePS relies on DMLC envs similar time MXNet. We can continue discussing this further in the future.

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit fe5865a into kubeflow:master Jul 4, 2020
3 of 4 checks passed
@jasonliu747 jasonliu747 deleted the byteps branch July 4, 2020 02:37
@GuoHaiqing GuoHaiqing mentioned this pull request Apr 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants