Allow launcher to start after workers are ready #386

alculquicondor · 2021-07-27T16:18:16Z

Part of #373. This was initially discussed in #360, but it didn't have a final conclusion.

In the v1 controller, an init-container was responsible of holding the start of the launcher until all the workers were ready.
In the v2 controller, we removed this init container for scalability and performance reasons (each init container is a controller that has to build a full cache of all the pods in the cluster).

Most of the times, this just works. The workers and launcher start at roughly the same time. ssh itself does some retries before giving up. However, there are scenarios were the launcher fails because it can't find the workers.
This problem could be consistent in the following scenario: The launcher lands in a Node where the image is already present, but the workers launch in Nodes where the image is missing.

There are 2 high level solutions to this problem:

Delay the creation of the launcher Pod. This is possible by watching the Running pods in the controller. The problem here is that we might delay the start up of a Job significantly (in some scenarios, the user has to wait for 2 image pulls to happen sequentially, instead of all the pulls to happen in parallel).
Retry the launcher. We can get this "for-free" if we use a Job for the launcher. The Job controller has exponential backoff for failures. The number of retries is configurable.

Both solutions are not exclusive (we can implement both).
Additional questions come with both solutions:

Should we make the creation delay an API option?
Should we expose the number of retries to the MPIJob user?

I propose we do 2 first, exposing the number of retries in the API, and then re-evaluate if we need to also do 1, based on user feedback after the release of a 2.0.

alculquicondor · 2021-07-27T16:18:49Z

cc @gaocegege @terrytangyuan @ahg-g @kawych

gaocegege · 2021-07-28T02:57:30Z

SGTM, I also prefer 2nd.

terrytangyuan · 2021-07-28T04:54:20Z

Agreed. The second option looks better and could leverage a Job.

ahg-g · 2021-07-28T17:25:24Z

SGTM. +1 to using the job api wherever possible.

xhejtman · 2021-08-25T13:26:15Z

the second is functional, but looks a bit nasty to me:

xhejtman · 2021-08-25T13:27:01Z

but may be it solvable just to use another type of deployment of launcher job so that it restarts in place?

alculquicondor · 2021-08-25T13:27:55Z

You can definitely use restartPolicy: OnFailure

xhejtman · 2021-08-25T13:31:36Z

could it be a default then?

alculquicondor · 2021-08-25T13:43:30Z

The default in k8s is Always, which is not allowed for Jobs. I guess having OnFailure for MPIJob is fair. Do you mind sending a PR for that?

xhejtman · 2021-08-25T13:54:04Z

wilco

xhejtman · 2021-08-26T10:57:35Z

which version should I use for base for PR? Should I change v2 code only?

alculquicondor · 2021-08-26T13:28:56Z

Yes. The change is not backwards compatible, so we can't do it for older versions.

alculquicondor · 2021-08-30T13:56:53Z

@xhejtman are you still working on that PR? I think that would be the only change pending before we can release v2.

xhejtman · 2021-08-30T14:00:19Z

I already sent it. Maybe on wrong place?

alculquicondor#1

alculquicondor · 2021-08-30T14:01:40Z

Yes, that's my fork. Do it for this repository

kpouget · 2022-01-12T14:33:18Z

FYI, here is the way I awaited for the worker Pods to be ready before launching the main container of the Launcher Pod. This avoids running into errors from the main container:

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          initContainers:
          - name: wait-hostfilename
            image: <any image>
            command:
            - bash
            - -cx
            - "[[ $(cat /etc/mpi/discover_hosts.sh | wc -l) != 1 ]] && (date; echo Ready; cat /etc/mpi/discover_hosts.sh) || (date; echo 'not ready ...'; sleep 10; exit 1) && while read host; do while ! ssh $host echo $host ; do date; echo \"Pod $host is not up ...\"; sleep 10; done; date; echo \"Pod $host is ready\"; done <<< \"$(/etc/mpi/discover_hosts.sh)\""
            volumeMounts:
            - mountPath: /etc/mpi
              name: mpi-job-config
            - mountPath: /root/.ssh
              name: ssh-auth

(see also my original post)

alculquicondor · 2022-01-12T16:03:26Z

Alternatively, you can adapt the entry-point that we have for Intel (Intel doesn't do retries, so it's absolutely necessary to wait).

https://github.com/kubeflow/mpi-operator/blob/master/examples/base/intel-entrypoint.sh

salanki · 2022-02-28T01:15:43Z

+1 to this. depending on the launcher to fail is pretty hackish

alculquicondor mentioned this issue Jul 27, 2021

Implement v2 controller that sets up SSH for communication #373

Open

16 tasks

alculquicondor mentioned this issue Aug 4, 2021

Manage launcher through k8s Job #391

Merged

alculquicondor mentioned this issue Aug 25, 2021

Mount SSH Secret directly on main container #416

Merged

xhejtman mentioned this issue Aug 30, 2021

Set OnFailure default restart policy for launcher #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow launcher to start after workers are ready #386

Allow launcher to start after workers are ready #386

alculquicondor commented Jul 27, 2021 •

edited

Loading

alculquicondor commented Jul 27, 2021

gaocegege commented Jul 28, 2021

terrytangyuan commented Jul 28, 2021

ahg-g commented Jul 28, 2021

xhejtman commented Aug 25, 2021

xhejtman commented Aug 25, 2021

alculquicondor commented Aug 25, 2021

xhejtman commented Aug 25, 2021

alculquicondor commented Aug 25, 2021

xhejtman commented Aug 25, 2021

xhejtman commented Aug 26, 2021

alculquicondor commented Aug 26, 2021

alculquicondor commented Aug 30, 2021

xhejtman commented Aug 30, 2021

alculquicondor commented Aug 30, 2021

kpouget commented Jan 12, 2022

alculquicondor commented Jan 12, 2022

salanki commented Feb 28, 2022

Allow launcher to start after workers are ready #386

Allow launcher to start after workers are ready #386

Comments

alculquicondor commented Jul 27, 2021 • edited Loading

alculquicondor commented Jul 27, 2021

gaocegege commented Jul 28, 2021

terrytangyuan commented Jul 28, 2021

ahg-g commented Jul 28, 2021

xhejtman commented Aug 25, 2021

xhejtman commented Aug 25, 2021

alculquicondor commented Aug 25, 2021

xhejtman commented Aug 25, 2021

alculquicondor commented Aug 25, 2021

xhejtman commented Aug 25, 2021

xhejtman commented Aug 26, 2021

alculquicondor commented Aug 26, 2021

alculquicondor commented Aug 30, 2021

xhejtman commented Aug 30, 2021

alculquicondor commented Aug 30, 2021

kpouget commented Jan 12, 2022

alculquicondor commented Jan 12, 2022

salanki commented Feb 28, 2022

alculquicondor commented Jul 27, 2021 •

edited

Loading