Add batch support to openmpi package #671

jiezhang · 2018-04-17T00:04:05Z

Introduce "exec" parameter to run adhoc command
Use redis for synchronization between master and worker
Use pod instead of stateful set to run the workloads. This is needed to set restartPolicy to Never.

This change is

* introduce "cmd" parameter * use redis for synchronization between master and worker

jiezhang · 2018-04-17T00:05:18Z

/assign @jlewi

jiezhang · 2018-04-17T00:10:06Z

cc @alsrgv

jiezhang · 2018-04-17T00:10:19Z

/ok-to-test

jiezhang · 2018-04-17T16:55:55Z

/retest

jiezhang · 2018-04-17T21:46:26Z

/retest

jlewi · 2018-04-18T19:46:06Z

What's the fault tolerance model? If we're using pods and a pod gets preempted (e.g. node is under memory pressure or healthy); what happens to the job? Will it exit or recover?

@jiezhang and I chatted briefly in slack and IIUC there is an internal system that will be managing the job and handling these failures. So Kubeflow might need another solution (e.g. a CRD).

I don't think we need to resolve this but would be good to open an issue to track this.

jiezhang · 2018-04-18T20:36:47Z

@jlewi In those cases, the job would fail. We probably need a job management service to monitor job status and reschedule the job as needed.

Note that even if the container is managed by StatefulSet, if one of the worker containers gets evicted while the job is running, the job will fail and cannot recover itself after StatefulSet provisions a new container. We still need an external service to monitor job status and provide fault tolerance.

Given MPI jobs involve multiple containers, I think it would be challenging to make each container fault tolerant. It's easier to re-deploy all the components if something goes wrong.

jiezhang · 2018-04-18T21:31:34Z

@jlewi I have opened #677 to track it.

jlewi · 2018-04-19T01:15:24Z

/approve

pdmack · 2018-04-19T14:20:14Z

/lgtm

k8s-ci-robot · 2018-04-19T14:20:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, pdmack

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi,pdmack]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Add batch support to openmpi package * introduce "cmd" parameter * use redis for synchronization between master and worker * Use pod instead of stateful set to run workloads * Swallow error in exec command * Preserve error from exec command

Add batch support to openmpi package

8172106

* introduce "cmd" parameter * use redis for synchronization between master and worker

k8s-ci-robot requested review from jimexist and wbuchwalter April 17, 2018 00:04

k8s-ci-robot added the size/L label Apr 17, 2018

k8s-ci-robot requested a review from jlewi April 17, 2018 00:04

k8s-ci-robot assigned jlewi Apr 17, 2018

Jie Zhang added 2 commits April 17, 2018 12:27

Use pod instead of stateful set to run workloads

1ed1683

Swallow error in exec command

5d11cf3

Preserve error from exec command

5b33e5b

k8s-ci-robot added the approved label Apr 19, 2018

k8s-ci-robot assigned pdmack Apr 19, 2018

k8s-ci-robot added the lgtm label Apr 19, 2018

k8s-ci-robot merged commit 48fdc87 into kubeflow:master Apr 19, 2018

jiezhang mentioned this pull request Apr 19, 2018

openmpi: make 'schedulerName' configurable to use custom schedulers. #683

Merged

goswamig mentioned this pull request Oct 11, 2018

Unable to use -x option with mca_base_env_list #1729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch support to openmpi package #671

Add batch support to openmpi package #671

jiezhang commented Apr 17, 2018 •

edited

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jlewi commented Apr 18, 2018

jiezhang commented Apr 18, 2018

jiezhang commented Apr 18, 2018

jlewi commented Apr 19, 2018

pdmack commented Apr 19, 2018

k8s-ci-robot commented Apr 19, 2018

Add batch support to openmpi package #671

Add batch support to openmpi package #671

Conversation

jiezhang commented Apr 17, 2018 • edited

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jiezhang commented Apr 17, 2018

jlewi commented Apr 18, 2018

jiezhang commented Apr 18, 2018

jiezhang commented Apr 18, 2018

jlewi commented Apr 19, 2018

pdmack commented Apr 19, 2018

k8s-ci-robot commented Apr 19, 2018

jiezhang commented Apr 17, 2018 •

edited