Conversation
Hi @jasonliu747. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since BytePS is similar to Horovod, would it be better suited to run with MPI Operator?
Exactly what I was thinking at first! Though the way BytePS is called similarly to Horovod, it's totally different at its core. One is based on PS-Lite, and the other is based on MPI concepts. Therefore, it might be better suited to run with MXNet-Operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this for now since BytePS relies on DMLC envs similar time MXNet. We can continue discussing this further in the future.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
BytePS is a high performance and generic framework for distributed DNN training. For a successful start, it needs scheduler, server and worker, which is similar to MXNet.
After a short discussion with @suleisl2000 , we both agreed it would be OK to launch BytePS job using mxnet-operator. This minor change would only inject env
DMLC_WORKER_ID
to each worker, while others remain unchanged.