Skip to content

Conversation

@davidstack
Copy link
Contributor

@davidstack davidstack commented Jul 17, 2019

for issue #131 ,this commit will create lancher and worker togther, in the init container of lancher, the init container will wait the worker untill running


This change is Reviewable

@k8s-ci-robot
Copy link

Hi @davidstack. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @terrytangyuan

Did you have a try? We create launcher and workers at the same time in the PR. I am wondering what will happen if the launcher cannot find all workers.

: ${TARGET_DIR:?"Need to set TARGET_DIR, e.g. /opt/kube"}

cp $(which kubectl) ${TARGET_DIR}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the script here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to check the worker is running, if the worker is in other state, the init contianer will be in loop,just waiting for the worker running

@davidstack
Copy link
Contributor Author

@gaocegege in my dev env, i test ok ,it will create lancher and worker togther (waiting kube-batch schedule ).
the init continer log is
`[root@node3 ~]# docker logs -f e953b1fef837

caffe-mpi-worker-0 slots=2
1
caffe-mpi-worker-1 slots=2
2
worker name is caffe-mpi-worker-0
Unable to use a TTY - input is not a terminal or the right kind of file
hello is hello
caffe-mpi-worker-0 is running..
worker name is caffe-mpi-worker-1
Unable to use a TTY - input is not a terminal or the right kind of file
hello is hello
caffe-mpi-worker-1 is running..`

the status of mpi woker and lancher is

`[root@node2 mpi]# kubectl get pods -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
caffe-mpi-launcher-pssbh 0/1 Init:0/1 0 2s node2
caffe-mpi-worker-0 0/1 ContainerCreating 0 2s node2
caffe-mpi-worker-1 0/1 ContainerCreating 0 2s node3
`

@gaocegege
Copy link
Member

@davidstack

Thanks, I understand the meaning of the script.

BTW, @terrytangyuan @wackxu Please take a look.

@wackxu
Copy link
Contributor

wackxu commented Jul 23, 2019

@davidstack The CI is failed, Could you fix the CI first?

return err
}
}
// If the worker is ready, start the launcher.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented code since it's no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@davidstack
Copy link
Contributor Author

@wackxu i find three test function failed.
TestLauncherRestarting
TestLauncherDoesNotExist
TestLauncherActive

but it always has this error
error syncing mpi job: jobs.batch "test-launcher" already exists

could you give me some suggestion to change this ci test? thanks

@davidstack
Copy link
Contributor Author

@wackxu i change the CI ,and all tests PASSED,but it still have this error

The command "gometalinter --config=linter_config.json --vendor ./..." exited with 1.
any suggestion? thanks

@wackxu
Copy link
Contributor

wackxu commented Jul 29, 2019

@davidstack You can try to run gofmt -s -w yourfile to fix this.

@davidstack
Copy link
Contributor Author

@wackxu thanks, everything is ok.

// If the worker is ready, start the launcher.
workerReady := worker.Status.ReadyReplicas == workerReplicas
if workerReady && launcher == nil {
_, lancherErr := c.kubeClient.BatchV1().Jobs(namespace).Get(mpiJob.Name+launcherSuffix, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need get mpijob from kube-apiserver again, we have just get launcher above from the lister, so just judge whether the launcher is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

return err
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

func (f *fixture) expectUpdateJobAction(d *batchv1.Job) {
f.kubeActions = append(f.kubeActions, core.NewUpdateAction(schema.GroupVersionResource{Resource: "jobs"}, d.Namespace, d))
}
func (f *fixture) expectGetJobAction(d *kubeflow.MPIJob) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one blank line before the func

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@davidstack
Copy link
Contributor Author

@wackxu have any other problem?

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@wackxu
Copy link
Contributor

wackxu commented Aug 6, 2019

@davidstack Could you rebase the commit to one? then lgtm

@wackxu
Copy link
Contributor

wackxu commented Aug 6, 2019

/assign @terrytangyuan @rongou for approval

@k8s-ci-robot
Copy link

@wackxu: GitHub didn't allow me to assign the following users: for, approval.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @terrytangyuan @rongou for approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@davidstack
Copy link
Contributor Author

@wackxu the commit is one commit now,thanks

@wackxu
Copy link
Contributor

wackxu commented Aug 6, 2019

/lgtm

@terrytangyuan terrytangyuan changed the title create lancher and worker togther create launcher and worker togther Aug 13, 2019
done;

#/etc/mpi/hostfile
#caffe-mpi-worker-0 slots=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Please remove other unneeded code/comment as well.

@k8s-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 15, 2019
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants