Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose a new architecture with focus on scalability and robustness #360
Propose a new architecture with focus on scalability and robustness #360
Changes from 1 commit
88afc08
e96473a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The launcher only has the execution permission on the worker pods belonging to its job, you can find it at code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed that. However, that doesn't scale as the number of workers increase (k8s objects have a size limit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we replace kubectl exec with ssh, we can also support the elastic feature, thus I do not think it is a problem for us if I understand it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are user cases for launcher pod retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running jobs on a spot/preemptible VMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my perspective, I don't think the launcher pod should be scheduled to a spot/preemptible VMs. For the worker pods, yes, because worker pods are stateless. However, the launcher pod seems a stateful instance, whose re-restart means a new job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, when running the workers in a preemptible VM, if any of them fail, the entire job fails, including the launcher. So we need launcher retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor What MPI implementation are you expecting users to use after launcher gets restarted to make sure the workers are aware and can reconnect to the new launcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process goes like this:
This is independent of the MPI implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I observed from other HPC system, like Slurm, for jobs that are not fault-tolerant, when some of the worker fails, it does not retry via a launcher restart. Instead, the entire job will be marked as failed with it resources released. The system will re-queue the failed job if 'retry' is demanded by the user and create a new job. Such process is compatible with the contemporary design of mpi-operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, it kind of goes against the declarative and fault-tolerant approach of k8s APIs, including the Job API. Retries is what kubernetes users would expect. And if they don't need it, they could always disable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is useful if we make automatic retries an option that we default to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u mean the init-container
kubectl-delivery
will be removed? If that, how could we keep the launcher start after the workers?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 options:
I prefer option 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, you would like to speed up startup time, but it does not if you want the launcher to retry for failures. And the scheduler should not know about the job startup strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the same time, how do you verify the real launcher failed reason for option 1? for example,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "job startup strategy"?
As for differentiating failures, I don't think we can or should do that. There is another type of failure which is kind of a mix of the two you described: a worker pod gets evicted by kubelet after the job already started. We cannot easily differentiate this one from a case where the user's code had a crash, for the purpose of retrying.
But perhaps option 2 is reasonable. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support retries, then by product we want option 1, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can still use option 2. It might make startups less flaky. But it's not a guarantee anyways. A worker can simply just fail in between the time it reported running and the launcher started running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how to support elastic mode with statefulset? We may add/remove workers in the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just change
.spec.replicas
.