-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow launcher to start after workers are ready #386
Comments
SGTM, I also prefer 2nd. |
Agreed. The second option looks better and could leverage a Job. |
SGTM. +1 to using the job api wherever possible. |
but may be it solvable just to use another type of deployment of launcher job so that it restarts in place? |
You can definitely use |
could it be a default then? |
The default in k8s is |
wilco |
which version should I use for base for PR? Should I change v2 code only? |
Yes. The change is not backwards compatible, so we can't do it for older versions. |
@xhejtman are you still working on that PR? I think that would be the only change pending before we can release v2. |
I already sent it. Maybe on wrong place? |
Yes, that's my fork. Do it for this repository |
FYI, here is the way I awaited for the worker Pods to be ready before launching the main container of the Launcher Pod. This avoids running into errors from the main container:
(see also my original post) |
Alternatively, you can adapt the entry-point that we have for Intel (Intel doesn't do retries, so it's absolutely necessary to wait). https://github.com/kubeflow/mpi-operator/blob/master/examples/base/intel-entrypoint.sh |
+1 to this. depending on the launcher to fail is pretty hackish |
Part of #373. This was initially discussed in #360, but it didn't have a final conclusion.
In the v1 controller, an init-container was responsible of holding the start of the launcher until all the workers were ready.
In the v2 controller, we removed this init container for scalability and performance reasons (each init container is a controller that has to build a full cache of all the pods in the cluster).
Most of the times, this just works. The workers and launcher start at roughly the same time. ssh itself does some retries before giving up. However, there are scenarios were the launcher fails because it can't find the workers.
This problem could be consistent in the following scenario: The launcher lands in a Node where the image is already present, but the workers launch in Nodes where the image is missing.
There are 2 high level solutions to this problem:
Both solutions are not exclusive (we can implement both).
Additional questions come with both solutions:
I propose we do 2 first, exposing the number of retries in the API, and then re-evaluate if we need to also do 1, based on user feedback after the release of a 2.0.
The text was updated successfully, but these errors were encountered: