-
Notifications
You must be signed in to change notification settings - Fork 228
create launcher and worker togther #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @davidstack. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @terrytangyuan
Did you have a try? We create launcher and workers at the same time in the PR. I am wondering what will happen if the launcher cannot find all workers.
| : ${TARGET_DIR:?"Need to set TARGET_DIR, e.g. /opt/kube"} | ||
|
|
||
| cp $(which kubectl) ${TARGET_DIR} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the script here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to check the worker is running, if the worker is in other state, the init contianer will be in loop,just waiting for the worker running
|
@gaocegege in my dev env, i test ok ,it will create lancher and worker togther (waiting kube-batch schedule ). caffe-mpi-worker-0 slots=2 the status of mpi woker and lancher is `[root@node2 mpi]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES |
|
Thanks, I understand the meaning of the script. BTW, @terrytangyuan @wackxu Please take a look. |
|
@davidstack The CI is failed, Could you fix the CI first? |
| return err | ||
| } | ||
| } | ||
| // If the worker is ready, start the launcher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the commented code since it's no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
@wackxu i find three test function failed. but it always has this error could you give me some suggestion to change this ci test? thanks |
|
@wackxu i change the CI ,and all tests PASSED,but it still have this error
|
|
@davidstack You can try to run |
|
@wackxu thanks, everything is ok. |
| // If the worker is ready, start the launcher. | ||
| workerReady := worker.Status.ReadyReplicas == workerReplicas | ||
| if workerReady && launcher == nil { | ||
| _, lancherErr := c.kubeClient.BatchV1().Jobs(namespace).Get(mpiJob.Name+launcherSuffix, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need get mpijob from kube-apiserver again, we have just get launcher above from the lister, so just judge whether the launcher is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
| return err | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
| func (f *fixture) expectUpdateJobAction(d *batchv1.Job) { | ||
| f.kubeActions = append(f.kubeActions, core.NewUpdateAction(schema.GroupVersionResource{Resource: "jobs"}, d.Namespace, d)) | ||
| } | ||
| func (f *fixture) expectGetJobAction(d *kubeflow.MPIJob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one blank line before the func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
|
@wackxu have any other problem? |
gaocegege
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
|
@davidstack Could you rebase the commit to one? then lgtm |
|
/assign @terrytangyuan @rongou for approval |
|
@wackxu: GitHub didn't allow me to assign the following users: for, approval. Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wackxu the commit is one commit now,thanks |
|
/lgtm |
| done; | ||
|
|
||
| #/etc/mpi/hostfile | ||
| #caffe-mpi-worker-0 slots=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Please remove other unneeded code/comment as well.
|
New changes are detected. LGTM label has been removed. |
terrytangyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for issue #131 ,this commit will create lancher and worker togther, in the init container of lancher, the init container will wait the worker untill running
This change is