Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch support to openmpi package #671

Merged
merged 4 commits into from Apr 19, 2018
Merged

Conversation

jiezhang
Copy link

@jiezhang jiezhang commented Apr 17, 2018

  • Introduce "exec" parameter to run adhoc command

  • Use redis for synchronization between master and worker

  • Use pod instead of stateful set to run the workloads. This is needed to set restartPolicy to Never.


This change is Reviewable

* introduce "cmd" parameter

* use redis for synchronization between master and worker
@jiezhang
Copy link
Author

/assign @jlewi

@jiezhang
Copy link
Author

cc @alsrgv

@jiezhang
Copy link
Author

/ok-to-test

@jiezhang
Copy link
Author

/retest

@jiezhang
Copy link
Author

/retest

@jlewi
Copy link
Contributor

jlewi commented Apr 18, 2018

What's the fault tolerance model? If we're using pods and a pod gets preempted (e.g. node is under memory pressure or healthy); what happens to the job? Will it exit or recover?

@jiezhang and I chatted briefly in slack and IIUC there is an internal system that will be managing the job and handling these failures. So Kubeflow might need another solution (e.g. a CRD).

I don't think we need to resolve this but would be good to open an issue to track this.

@jiezhang
Copy link
Author

@jlewi In those cases, the job would fail. We probably need a job management service to monitor job status and reschedule the job as needed.

Note that even if the container is managed by StatefulSet, if one of the worker containers gets evicted while the job is running, the job will fail and cannot recover itself after StatefulSet provisions a new container. We still need an external service to monitor job status and provide fault tolerance.

Given MPI jobs involve multiple containers, I think it would be challenging to make each container fault tolerant. It's easier to re-deploy all the components if something goes wrong.

@jiezhang
Copy link
Author

@jlewi I have opened #677 to track it.

@jlewi
Copy link
Contributor

jlewi commented Apr 19, 2018

/approve

@pdmack
Copy link
Member

pdmack commented Apr 19, 2018

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, pdmack

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 48fdc87 into kubeflow:master Apr 19, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
* Add batch support to openmpi package

* introduce "cmd" parameter

* use redis for synchronization between master and worker

* Use pod instead of stateful set to run workloads

* Swallow error in exec command

* Preserve error from exec command
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants