Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hook-image-puller: -pod-scheduling-wait-duration flag added for reliability during helm upgrades #1763

Merged

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Sep 1, 2020

PR Summary

This PR address a risk of helm upgrade failing for a bad reason - that the hook-image-puller pods fail to schedule because the nodes cannot accept more pods on them.

How helm upgrade currently can fail for a bad reason

When helm upgrade is run, certain k8s resources are rendered and deployed before others using Chart hooks. We deploy a hook-image-awaiter Pod and a hook-image-puller daemonset. The hook-image-awaiter pod awaits the pods of the hook-image-puller daemonset to become ready, as they do when they have been scheduled and pulled all images in dummy init containers.one node

The failure we can avoid is when the hook-image-awaiter pod get stuck waiting for one of the image puller pods that fails to schedule on a node, because a node is out of capacity to schedule more pods - probably mostly because a maximum pod per node constraint.

GKE has a max 110 pods per node for example, and there is no workaround for this limit as far as I know.

How this PR address the issue

This PR updates the logic of the hook-image-awaiter pod's binary which we have written ourselves in Go. Before this PR it is waiting for all image puller pods to scheduled, start, and get the main container running/ready. But in this PR, it is made to await only the pods that actually have scheduled, but only to do that after an initial duration which is configurable with a -pod-scheduling-wait-duration.

We can accomplish this by using the status set on the hook-image-puller DaemonSet resource that describes CurrentNumberScheduled, DesiredNumberScheduled, and NumberReady.

Configuration added

This duration is configurable with prePuller.hook.podSchedulingWaitDuration and defaults to -1. It can be configured to be an infinite duration with -1.

It can be a problem to require a pod to be successfully scheduled on all
nodes. If we require this then chart upgrades when any node has
insufficient pods will fail, not because --wait part of the Helm upgrade
requires daemonset pods to be scheduled and ready because it doesn't,
but because the image-awaiter pod requires the desired number of pods to
be scheduled are scheduled and running.

This PR introduces a flag and sets a default value of 10 for it. Like
this, the criteria to successfully have awaited image pulling is to in
the initial ten seconds either have as many ready image puller pods as
is desired. The difference is that this will change after this initial
duration to something new, namely to wait only for the current number of
scheduled pods to become ready.
@consideRatio consideRatio changed the title hook-image-puller: -strict-check-duration flag added Relax hook-image-puller for reliability: -strict-check-duration flag added Sep 9, 2020
Copy link
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, but otherwise LGTM. Thanks for working on this, @consideRatio

images/image-awaiter/main.go Outdated Show resolved Hide resolved
jupyterhub/values.yaml Outdated Show resolved Hide resolved
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks super nice @consideRatio 👍 🎉 ❤️

Is it possible for other pods to get scheduled after the strict-check-duration runs out and the hook-image-awaiter gets "unblocked"?

@consideRatio
Copy link
Member Author

@yuvipanda @GeorgianaElena thank you for your reviews ❤️ 🎉

@GeorgianaElena yes a pod that was pending for say 30 seconds can still get scheduled. The scheduling of these pods are done by a kubernetes pod scheduler and isn't influenced by the image-awaiter pod which only monitors the status of the daemonset object, which in turn contains information about: the number of scheduled pods (scheduled), the number of pods that should be scheduled (desired), the number of ready pods (ready).

So assuming a strict-check-duration of 10, what could happen is...

  1. After 5 seconds --- desired: 5, scheduled: 4, ready: 4 --- keep waiting
  2. After 9 seconds --- desired: 5, scheduled: 5, ready: 4 --- keep waiting
  3. After 10 seconds --- image-awaiter will now accept scheduled==ready rather than desired==ready
  4. After 15 seconds --- desired: 6, scheduled: 5, ready: 4 --- keep waiting
  5. After 30 seconds --- desired: 6, scheduled: 5, ready: 5 --- completed!

@consideRatio consideRatio changed the title Relax hook-image-puller for reliability: -strict-check-duration flag added Relax hook-image-puller for reliability: -pod-scheduling-wait-duration flag added Sep 23, 2020
@consideRatio consideRatio changed the title Relax hook-image-puller for reliability: -pod-scheduling-wait-duration flag added hook-image-puller: -pod-scheduling-wait-duration flag added for reliability during helm upgrades Sep 23, 2020
@consideRatio consideRatio merged commit f000e94 into jupyterhub:master Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants