Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Switch to use Job API with ArrayJob semantics #315

Closed
terrytangyuan opened this issue Jan 6, 2021 · 21 comments · Fixed by #360
Closed

[Discussion] Switch to use Job API with ArrayJob semantics #315

terrytangyuan opened this issue Jan 6, 2021 · 21 comments · Fixed by #360

Comments

@terrytangyuan
Copy link
Member

From Aldo Culquicondor:

I left you a message in k8s Slack, but posting it here too in case you are not so active there.

I'm a sig-scheduling maintainer and I'm now looking into supporting array Jobs kubernetes/kubernetes#97169
We want this to enable MPI workloads on the Job API. I noticed that the MPI operator uses a StatefulSet. That seems like the best idea today. I'm wondering what it would take to migrate to Job, other than having a stable host name.

For extra context, we are also looking into having Job level orchestration for provisioning and scheduling. We have published this doc on how we envision it to work: bit.ly/k8s-job-management

I hope these topics interest you. Feel free to answer here or on Slack,

Related issue: kubernetes/kubernetes#97169

cc @alculquicondor @ahg-g @kubeflow/wg-training-leads

Let's use this issue to discuss with the community.

@alculquicondor
Copy link
Collaborator

One of the things I'm thinking could be a problem is stopping the Job.

In the current architecture, the driver-job triggers a scale-down of the StatefulSet. Is there a way for the runners to know when the task is done and finish by themselves?

@alculquicondor
Copy link
Collaborator

Side note: we won't hold the KEP on solving all the problems for MPI. Static partitioning is enough justification for the Array Job proposal.

@gaocegege
Copy link
Member

/cc @carmark

@ahg-g
Copy link

ahg-g commented Jan 7, 2021

Another thing to think about is readiness status, I am not sure the job status tracks the number of "ready" replicas like StatefulSets and ReplicaSets do. This makes sense since the job api is not supposed to be used as a service, but it is a difference that we may want to keep in mind. This may not be a major issue though since rank-0 could directly check the status of the individual pods before starting the MPI job.

@gaocegege
Copy link
Member

This makes sense since the job api is not supposed to be used as a service, but it is a difference that we may want to keep in mind.

Then I am wondering how to set the job status to running/succeeded/failed.

The driver needs to check all worker status(IP) in Horovod. To support elastic training, the driver needs to maintain a list of works' PodIP.

Now we want to have a configmap to do it. But I think it is not related to this new Job API, I think.

@alculquicondor
Copy link
Collaborator

Would it make sense for the driver to obtain the pod IPs by listing the worker nodes that belong to the Job?

OTOH, how do workers use the index in the Pod name coming from the statefulset?

@carmark
Copy link
Member

carmark commented Jan 19, 2021

@alculquicondor @gaocegege

Now, the mpi-operator obtains status from the launcher pod(which run the mpirun command), all the workers will always sync with launcher in MPI program. Any worker fails, the launcher will know that and do something.

The driver needs to check all worker status(IP) in Horovod. To support elastic training, the driver needs to maintain a list of works' PodIP.

The pod IPs are not necessary. By default, the MPI program will use ssh with worker IP list to communicate, but we can set the env(rsh_agent) to force to use another tool to communicate. In mpi-operator, we use kubectl exec and Pod Name to setup rank.

@alculquicondor
Copy link
Collaborator

In mpi-operator, we use kubectl exec and Pod Name to setup rank.

That's interesting. In that case, what are the reasons to use StatefulSet instead of Deployment?

@rongou
Copy link
Member

rongou commented Jan 19, 2021

The main thing we need from StatefulSet is the stable, unique pod names. When a new MPI job starts, a hostfile ConfigMap is created to list the pod names from the StatefulSet, then the mpi command can just use that hostfile without any further assistance from the operator. The launcher that runs the mpi command is a Job so that it can retry, report status, etc.

I suppose you could use a Deployment for the workers, wait for them to start up and query the pod IPs, put them in the hostfile, and then start the launcher. However, we then need to monitor the pods, if one restarts, we need to find its new IP, update the hostfile, and restart the mpi command. Seems like a lot of extra work without much benefit.

The original design is outlined in this yaml file that might be easier to see: https://github.com/rongou/k8s-openmpi/blob/master/openmpi-test.yaml.

@ahg-g
Copy link

ahg-g commented Jan 19, 2021

wouldn't MPI jobs typically fail if any of the pods restart? I thought MPI is not tolerant to failures.

@rongou
Copy link
Member

rongou commented Jan 19, 2021

Yes, that's why the launcher is a Job so that it can retry if one of the pods fails.

@alculquicondor
Copy link
Collaborator

In that case, it sounds like the requirement for mpi-operator to migrate to Job, in replacement of the StatefulSet, is for Job to support stable names.

It doesn't sound like a Pod annotation containing an index would help in any way, would it?
Any other requirements?

@rongou
Copy link
Member

rongou commented Jan 20, 2021

It doesn't sound like a Pod annotation containing an index would help in any way, would it?

As long as it can be translated into a stable hostname then it should work.

@alculquicondor
Copy link
Collaborator

@alculquicondor
Copy link
Collaborator

We noticed that the operator no longer uses StatefulSet #203

Can you clarify the motivations for that?

@terrytangyuan
Copy link
Member Author

@carmark Would you like to clarify the motivation for that change?

@carmark
Copy link
Member

carmark commented Mar 18, 2021

Sure, @alculquicondor .

The StatefulSet will create pod one by one, but the worker pods does not need it. By the way, with StatefulSet, we can not make sure the real status of pods.

@ahg-g
Copy link

ahg-g commented Mar 18, 2021

We are looking to expand the k8s Job API to support indexed job with stable pod names. Unlike statefulsets, it will not need to scale up and down incrementally, so the problem of slow scale up should go away compared to statefulsets. When that is ready, mpi-operator can avoid creating independent pods and rely on k8s Job.

What pod status are we interested in exactly? We want to make sure we capture that in pods created by k8s Job.

@carmark
Copy link
Member

carmark commented Mar 18, 2021

@ahg-g

We need know the Pending/Running/Failed/Succeed status for the Pod. With those status, the operator can handle it and do some works.

@alculquicondor
Copy link
Collaborator

Your input is welcome in kubernetes/kubernetes#99497

@alculquicondor
Copy link
Collaborator

With those status, the operator can handle it and do some works.

Can you expand on this?

If a Pod Fails, since we are planning "stable pod names", we have to delete it before we can replace it with another pod with the same name. How are you handling this today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants