New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch support to openmpi package #671
Conversation
* introduce "cmd" parameter * use redis for synchronization between master and worker
/assign @jlewi |
cc @alsrgv |
/ok-to-test |
/retest |
/retest |
What's the fault tolerance model? If we're using pods and a pod gets preempted (e.g. node is under memory pressure or healthy); what happens to the job? Will it exit or recover? @jiezhang and I chatted briefly in slack and IIUC there is an internal system that will be managing the job and handling these failures. So Kubeflow might need another solution (e.g. a CRD). I don't think we need to resolve this but would be good to open an issue to track this. |
@jlewi In those cases, the job would fail. We probably need a job management service to monitor job status and reschedule the job as needed. Note that even if the container is managed by StatefulSet, if one of the worker containers gets evicted while the job is running, the job will fail and cannot recover itself after StatefulSet provisions a new container. We still need an external service to monitor job status and provide fault tolerance. Given MPI jobs involve multiple containers, I think it would be challenging to make each container fault tolerant. It's easier to re-deploy all the components if something goes wrong. |
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi, pdmack The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add batch support to openmpi package * introduce "cmd" parameter * use redis for synchronization between master and worker * Use pod instead of stateful set to run workloads * Swallow error in exec command * Preserve error from exec command
Introduce "exec" parameter to run adhoc command
Use redis for synchronization between master and worker
Use pod instead of stateful set to run the workloads. This is needed to set restartPolicy to Never.
This change is