[Discussion] Switch to use Job API with ArrayJob semantics #315

terrytangyuan · 2021-01-06T14:55:51Z

From Aldo Culquicondor:

I left you a message in k8s Slack, but posting it here too in case you are not so active there.

I'm a sig-scheduling maintainer and I'm now looking into supporting array Jobs kubernetes/kubernetes#97169
We want this to enable MPI workloads on the Job API. I noticed that the MPI operator uses a StatefulSet. That seems like the best idea today. I'm wondering what it would take to migrate to Job, other than having a stable host name.

For extra context, we are also looking into having Job level orchestration for provisioning and scheduling. We have published this doc on how we envision it to work: bit.ly/k8s-job-management

I hope these topics interest you. Feel free to answer here or on Slack,

Related issue: kubernetes/kubernetes#97169

cc @alculquicondor @ahg-g @kubeflow/wg-training-leads

Let's use this issue to discuss with the community.

alculquicondor · 2021-01-06T15:31:18Z

One of the things I'm thinking could be a problem is stopping the Job.

In the current architecture, the driver-job triggers a scale-down of the StatefulSet. Is there a way for the runners to know when the task is done and finish by themselves?

alculquicondor · 2021-01-06T15:33:27Z

Side note: we won't hold the KEP on solving all the problems for MPI. Static partitioning is enough justification for the Array Job proposal.

gaocegege · 2021-01-07T01:25:32Z

/cc @carmark

ahg-g · 2021-01-07T14:53:39Z

Another thing to think about is readiness status, I am not sure the job status tracks the number of "ready" replicas like StatefulSets and ReplicaSets do. This makes sense since the job api is not supposed to be used as a service, but it is a difference that we may want to keep in mind. This may not be a major issue though since rank-0 could directly check the status of the individual pods before starting the MPI job.

gaocegege · 2021-01-08T02:09:44Z

This makes sense since the job api is not supposed to be used as a service, but it is a difference that we may want to keep in mind.

Then I am wondering how to set the job status to running/succeeded/failed.

The driver needs to check all worker status(IP) in Horovod. To support elastic training, the driver needs to maintain a list of works' PodIP.

Now we want to have a configmap to do it. But I think it is not related to this new Job API, I think.

alculquicondor · 2021-01-08T14:35:45Z

Would it make sense for the driver to obtain the pod IPs by listing the worker nodes that belong to the Job?

OTOH, how do workers use the index in the Pod name coming from the statefulset?

carmark · 2021-01-19T02:13:27Z

@alculquicondor @gaocegege

Now, the mpi-operator obtains status from the launcher pod(which run the mpirun command), all the workers will always sync with launcher in MPI program. Any worker fails, the launcher will know that and do something.

The driver needs to check all worker status(IP) in Horovod. To support elastic training, the driver needs to maintain a list of works' PodIP.

The pod IPs are not necessary. By default, the MPI program will use ssh with worker IP list to communicate, but we can set the env(rsh_agent) to force to use another tool to communicate. In mpi-operator, we use kubectl exec and Pod Name to setup rank.

alculquicondor · 2021-01-19T14:39:21Z

In mpi-operator, we use kubectl exec and Pod Name to setup rank.

That's interesting. In that case, what are the reasons to use StatefulSet instead of Deployment?

rongou · 2021-01-19T17:55:12Z

The main thing we need from StatefulSet is the stable, unique pod names. When a new MPI job starts, a hostfile ConfigMap is created to list the pod names from the StatefulSet, then the mpi command can just use that hostfile without any further assistance from the operator. The launcher that runs the mpi command is a Job so that it can retry, report status, etc.

I suppose you could use a Deployment for the workers, wait for them to start up and query the pod IPs, put them in the hostfile, and then start the launcher. However, we then need to monitor the pods, if one restarts, we need to find its new IP, update the hostfile, and restart the mpi command. Seems like a lot of extra work without much benefit.

The original design is outlined in this yaml file that might be easier to see: https://github.com/rongou/k8s-openmpi/blob/master/openmpi-test.yaml.

ahg-g · 2021-01-19T18:46:13Z

wouldn't MPI jobs typically fail if any of the pods restart? I thought MPI is not tolerant to failures.

rongou · 2021-01-19T19:52:23Z

Yes, that's why the launcher is a Job so that it can retry if one of the pods fails.

alculquicondor · 2021-01-19T20:17:34Z

In that case, it sounds like the requirement for mpi-operator to migrate to Job, in replacement of the StatefulSet, is for Job to support stable names.

It doesn't sound like a Pod annotation containing an index would help in any way, would it?
Any other requirements?

rongou · 2021-01-20T19:17:41Z

It doesn't sound like a Pod annotation containing an index would help in any way, would it?

As long as it can be translated into a stable hostname then it should work.

alculquicondor · 2021-01-20T19:21:45Z

It cannot. But we will consider that as a follow up feature https://github.com/kubernetes/enhancements/pull/2245/files#diff-6e034c24686b4aca732ab8c705102d41d5661943e62cb4996d8b89d3dbf7a6c5R540

alculquicondor · 2021-02-24T23:11:24Z

We noticed that the operator no longer uses StatefulSet #203

Can you clarify the motivations for that?

terrytangyuan · 2021-03-17T15:07:19Z

@carmark Would you like to clarify the motivation for that change?

carmark · 2021-03-18T01:47:15Z

Sure, @alculquicondor .

The StatefulSet will create pod one by one, but the worker pods does not need it. By the way, with StatefulSet, we can not make sure the real status of pods.

ahg-g · 2021-03-18T03:24:35Z

We are looking to expand the k8s Job API to support indexed job with stable pod names. Unlike statefulsets, it will not need to scale up and down incrementally, so the problem of slow scale up should go away compared to statefulsets. When that is ready, mpi-operator can avoid creating independent pods and rely on k8s Job.

What pod status are we interested in exactly? We want to make sure we capture that in pods created by k8s Job.

carmark · 2021-03-18T04:03:59Z

@ahg-g

We need know the Pending/Running/Failed/Succeed status for the Pod. With those status, the operator can handle it and do some works.

alculquicondor · 2021-03-18T13:44:57Z

Your input is welcome in kubernetes/kubernetes#99497

alculquicondor · 2021-03-18T13:47:10Z

With those status, the operator can handle it and do some works.

Can you expand on this?

If a Pod Fails, since we are planning "stable pod names", we have to delete it before we can replace it with another pod with the same name. How are you handling this today?

terrytangyuan added area/operator kind/discussion labels Jan 6, 2021

terrytangyuan mentioned this issue Apr 27, 2021

Replace Launcher from v1.Job to v1.Pod #201

Closed

alculquicondor mentioned this issue May 17, 2021

Propose a new architecture with focus on scalability and robustness #360

Merged

google-oss-robot closed this as completed in #360 Jun 21, 2021

alculquicondor mentioned this issue Aug 17, 2021

Add running field to Job status kubernetes/kubernetes#104422

Closed

terrytangyuan mentioned this issue Apr 25, 2024

REQUEST: New membership for @terrytangyuan kubernetes/org#4904

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Switch to use Job API with ArrayJob semantics #315

[Discussion] Switch to use Job API with ArrayJob semantics #315

terrytangyuan commented Jan 6, 2021

alculquicondor commented Jan 6, 2021

alculquicondor commented Jan 6, 2021

gaocegege commented Jan 7, 2021

ahg-g commented Jan 7, 2021

gaocegege commented Jan 8, 2021

alculquicondor commented Jan 8, 2021

carmark commented Jan 19, 2021 •

edited

Loading

alculquicondor commented Jan 19, 2021

rongou commented Jan 19, 2021

ahg-g commented Jan 19, 2021

rongou commented Jan 19, 2021

alculquicondor commented Jan 19, 2021

rongou commented Jan 20, 2021

alculquicondor commented Jan 20, 2021

alculquicondor commented Feb 24, 2021

terrytangyuan commented Mar 17, 2021

carmark commented Mar 18, 2021

ahg-g commented Mar 18, 2021

carmark commented Mar 18, 2021

alculquicondor commented Mar 18, 2021

alculquicondor commented Mar 18, 2021

[Discussion] Switch to use Job API with ArrayJob semantics #315

[Discussion] Switch to use Job API with ArrayJob semantics #315

Comments

terrytangyuan commented Jan 6, 2021

alculquicondor commented Jan 6, 2021

alculquicondor commented Jan 6, 2021

gaocegege commented Jan 7, 2021

ahg-g commented Jan 7, 2021

gaocegege commented Jan 8, 2021

alculquicondor commented Jan 8, 2021

carmark commented Jan 19, 2021 • edited Loading

alculquicondor commented Jan 19, 2021

rongou commented Jan 19, 2021

ahg-g commented Jan 19, 2021

rongou commented Jan 19, 2021

alculquicondor commented Jan 19, 2021

rongou commented Jan 20, 2021

alculquicondor commented Jan 20, 2021

alculquicondor commented Feb 24, 2021

terrytangyuan commented Mar 17, 2021

carmark commented Mar 18, 2021

ahg-g commented Mar 18, 2021

carmark commented Mar 18, 2021

alculquicondor commented Mar 18, 2021

alculquicondor commented Mar 18, 2021

carmark commented Jan 19, 2021 •

edited

Loading