-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Switch to use Job API with ArrayJob semantics #315
Comments
One of the things I'm thinking could be a problem is stopping the Job. In the current architecture, the driver-job triggers a scale-down of the StatefulSet. Is there a way for the runners to know when the task is done and finish by themselves? |
Side note: we won't hold the KEP on solving all the problems for MPI. Static partitioning is enough justification for the Array Job proposal. |
/cc @carmark |
Another thing to think about is readiness status, I am not sure the job status tracks the number of "ready" replicas like StatefulSets and ReplicaSets do. This makes sense since the job api is not supposed to be used as a service, but it is a difference that we may want to keep in mind. This may not be a major issue though since rank-0 could directly check the status of the individual pods before starting the MPI job. |
Then I am wondering how to set the job status to running/succeeded/failed. The driver needs to check all worker status(IP) in Horovod. To support elastic training, the driver needs to maintain a list of works' PodIP. Now we want to have a configmap to do it. But I think it is not related to this new Job API, I think. |
Would it make sense for the driver to obtain the pod IPs by listing the worker nodes that belong to the Job? OTOH, how do workers use the index in the Pod name coming from the statefulset? |
Now, the mpi-operator obtains status from the launcher pod(which run the
The pod IPs are not necessary. By default, the MPI program will use |
That's interesting. In that case, what are the reasons to use StatefulSet instead of Deployment? |
The main thing we need from I suppose you could use a The original design is outlined in this yaml file that might be easier to see: https://github.com/rongou/k8s-openmpi/blob/master/openmpi-test.yaml. |
wouldn't MPI jobs typically fail if any of the pods restart? I thought MPI is not tolerant to failures. |
Yes, that's why the launcher is a |
In that case, it sounds like the requirement for mpi-operator to migrate to Job, in replacement of the StatefulSet, is for Job to support stable names. It doesn't sound like a Pod annotation containing an index would help in any way, would it? |
As long as it can be translated into a stable hostname then it should work. |
It cannot. But we will consider that as a follow up feature https://github.com/kubernetes/enhancements/pull/2245/files#diff-6e034c24686b4aca732ab8c705102d41d5661943e62cb4996d8b89d3dbf7a6c5R540 |
We noticed that the operator no longer uses StatefulSet #203 Can you clarify the motivations for that? |
@carmark Would you like to clarify the motivation for that change? |
Sure, @alculquicondor . The StatefulSet will create pod one by one, but the |
We are looking to expand the k8s Job API to support indexed job with stable pod names. Unlike statefulsets, it will not need to scale up and down incrementally, so the problem of slow scale up should go away compared to statefulsets. When that is ready, mpi-operator can avoid creating independent pods and rely on k8s Job. What pod status are we interested in exactly? We want to make sure we capture that in pods created by k8s Job. |
We need know the Pending/Running/Failed/Succeed status for the Pod. With those status, the operator can handle it and do some works. |
Your input is welcome in kubernetes/kubernetes#99497 |
Can you expand on this? If a Pod Fails, since we are planning "stable pod names", we have to delete it before we can replace it with another pod with the same name. How are you handling this today? |
From Aldo Culquicondor:
Related issue: kubernetes/kubernetes#97169
cc @alculquicondor @ahg-g @kubeflow/wg-training-leads
Let's use this issue to discuss with the community.
The text was updated successfully, but these errors were encountered: