Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
kubeadm: add a upgrade health check that deploys a Job #81319
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: neolit123
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing
Thanks @neolit123 !
we can hold for N seconds and see if the Pods are still running after this period, but then the crash loop period is unknown, so this is yet another arbitrary constant.
IMO, CNI demands for separate check that ensures that the Pods have networking.
then again if the liveness probes are working as expected the Pods should be in a Running state.
i must admit my biggest concern is not that this change is checking a Running state, but rather that it's checking the Running state on all CP Pods, which can break potential parallel CP node upgrades.
@rosti do you have a proposal on how to better approach the security audit ticket or are your comments suggesting that you do not agree with it and we should take no action?
That sounds a bit like reimplementing the liveness probes in a different way.
That's my point. We should see what we need to check and what not. Obviously, we cannot ensure a fully operational cluster and it's not our job to verify that.
If a liveness probe fails, it will restart the container in question according to the restart policy. The pod will be in running state if there is at least one container, that's starting, running or restarting.
+1 to that.
Indeed, only a single instance of every component is required at a minimum, to claim that we are operational.
Actually I deliberated quite a lot on this one and I couldn't find a solution, that seemed reliable enough to me.
Thanks @neolit123 !
I don't like the idea of using a daemon set, a job would be much better in my opinion. We just want to verify that we can start something on the cluster.
A simple Job, like this one, should do the trick:
apiVersion: batch/v1 kind: Job metadata: name: health-check namespace: kube-system spec: backoffLimit: 0 activeDeadlineSeconds: 30 template: spec: restartPolicy: Never terminationGracePeriodSeconds: 0 securityContext: runAsUser: 999 runAsGroup: 999 runAsNonRoot: true tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule containers: - name: health-check image: k8s.gcr.io/pause:3.1 args: ["-v"]
We may also want to have the health checks with a random suffix, to ensure name uniqueness.
wish you expressed your comments during our discussion with @fabriziopandini during the office hours, last week (was it).
- Add a new preflight check for upgrade that runs the pause container with -v in a Job. - Wait for the Job to complete and return an error after N seconds. - Manually clean the Job because we don't have the TTL controller enabled in kubeadm yet (it's still alpha).