Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

yuqi-zhang · 2021-02-09T23:13:25Z

Incorporates #2294 and #2290, such that a kubelet cert signer rotation causes a "none" action.

In 4.7 we introduced non-reboot updates for select cases, but still drained nodes for safety. This is still a major workload disruption, and non-reboot updates do not require draining for correctness. This PR will also introduce the ability to not drain at all for non-reboot updates, and in case of error, return the error directly instead of rebooting in an attempt to fix it.

This will cause updates to effectively to not cordon nodes at all, and on AWS a change to e.g. kubelet ca would talk ~20 seconds to apply to a 3-node pool.

One last thing to consider is what behaviour we want for changes to /etc/containers/registries.conf, since not draining may cause a mismatch of image reference between existing and new containers.

yuqi-zhang · 2021-02-09T23:14:26Z

#2294 and #2290 can still be merged separately, this exists as a consolidated PR if we want to merge altogether. The commits are still separate for cherry-picking, if needed.

/hold pending investigation of the last point above

yuqi-zhang · 2021-02-17T05:10:43Z

After some discussion and consideration, it is likely better for us to still drain on registries.conf changes (crio reload), to surface errors quicker if the file is bad. See commit message for details.

/hold cancel

sinnykumari · 2021-02-17T09:59:51Z

pkg/daemon/update.go

@@ -132,8 +132,7 @@ func (dn *Daemon) performPostConfigChangeAction(postConfigChangeActions []string
 			if dn.recorder != nil {
 				dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeWarning, "FailedServiceReload", fmt.Sprintf("Reloading %s service failed. Error: %v", serviceName, err))
 			}
-			dn.logSystem("Reloading %s configuration failed, node will reboot instead. Error: %v", serviceName, err)
-			dn.reboot(fmt.Sprintf("Node will reboot into config %s", configName))
+			return fmt.Errorf("Could not apply update: reloading %s configuration failed. Error: %v", serviceName, err)


I am not convinced why we shouldn't reboot to try to auto fix the problem. These errors may not be user initiated, user will be anyway clueless and will end up creating bug report or rebooting node explicitly to fix. Since MCO has already drained the node, attempting to auto-recover by reboot makes sense and if problem still remains we will at least have logs that rebooting doesn't help and we can investigate further.

So I have two considerations here although I do not feel super strongly either way:

In terms of failure conditions of the reload, it could be due to
1a) a bad config provided by the user causing the reload to fail, in which case rebooting does not help anyways
1b) a transient issue with e.g. systemd, but since our logic keeps on retrying through a new sync, it will resolve itself eventually if its just a transient issue
1c) an issue that isn't transient but resolvable by a reboot. This would be covered by a reboot but I'm not sure what an example would be

Since we are saying this change is non-reboot, a customer may be confused if it rebooted (due to an error) and not necessarily have insight to what the error was. Since we make calls to systemctl elsewhere in the MCD anyways, I thought it would be consistent to display the same failure mode (returns error, logs it, but will retry upon a later sync)

thanks Jerry for providing the reasoning, I am convinced now 🙂

sinnykumari

This looks good to me, will let others to review as well.

Since #2290 and #2294 are open, we may want to bring everyone on this PR or get those PRs merged first

This file should contain the 10-year root CA, but currently it also syncs the 1-year kube-apiserver-to-kubelet-signer cert. Remove the kube-apiserver-to-kubelet-signer cert, since it is not needed in this file.

This file syncs contents of kube-apiserver-to-kubelet-signer cert manged by the openshift-kubeapi operator. This is a 1 year cert that is auto-rotated, which normally would cause an a non user-triggered update + reboot in live clusters, which is non-desirable in many customer cases. Since the kubelet has the ability to live reload the cert when the file is changed, add this file to "None" action post config change, thus avoiding an automatic reboot every time the cert is rotated.

In 4.7 we introduced non-reboot updates for select cases, but still drained nodes for safety. This is still a major workload disruption, and non-reboot updates do not require draining for correctness. This commit will instead not drain at all for postConfigChangeActionNone updates, and in case of error, return the error directly instead of rebooting in an attempt to fix it. Note that a change to /etc/containers/registries.conf, i.e. postConfigChangeActionReloadCrio triggering a crio reload, will still cause a drain. This is to reflect registries.conf errors immediately, since if we did not perform the drain, existing containers will not attempt any new image pulls. A "bad" registries.conf then won't cause any errors until a future update, and the source of the error will be harder to find. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

yuqi-zhang · 2021-02-17T22:48:20Z

I've updated the commit messages (keeping original authors) so the two commits are easier understood.

0561404
3aabfec

If the original authors are ok, we can go ahead with this as the main PR. Will ask on the separate PRs.

rphillips · 2021-02-18T22:41:36Z

/approve
Thanks!

yuqi-zhang · 2021-02-22T15:29:50Z

/retest

did some manual testing, the behaviour is as expected

sinnykumari · 2021-02-22T16:10:38Z

/lgtm
/retest

openshift-bot · 2021-02-22T17:45:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2021-02-22T18:00:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: darkmuggle, rphillips, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [darkmuggle,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-02-22T19:56:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-22T22:05:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-22T22:18:47Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-02-22T23:24:51Z

@yuqi-zhang: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-workers-rhel7	`22da53e`	link	`/test e2e-aws-workers-rhel7`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot requested review from kikisdeliveryservice and sinnykumari February 9, 2021 23:13

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2021

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021

kikisdeliveryservice added the team-mco label Feb 11, 2021

yuqi-zhang force-pushed the suppress-cert-rotation-drains branch from 53f1598 to 5c35983 Compare February 17, 2021 05:09

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2021

sinnykumari reviewed Feb 17, 2021

View reviewed changes

deads2k and others added 5 commits February 17, 2021 17:40

tighten the /etc/kubernetes/ca.crt

0561404

This file should contain the 10-year root CA, but currently it also syncs the 1-year kube-apiserver-to-kubelet-signer cert. Remove the kube-apiserver-to-kubelet-signer cert, since it is not needed in this file.

test/unit: add basic test for kubelet CA change

dfb9067

Add rebootless update docs

22da53e

yuqi-zhang force-pushed the suppress-cert-rotation-drains branch from 0068b68 to 22da53e Compare February 17, 2021 22:46

This was referenced Feb 17, 2021

OCPNODE-454: kubelet: ignore reboots on kubelet CA change #2290

Closed

tighten the /etc/kubernetes/ca.crt #2294

Closed

deads2k mentioned this pull request Feb 22, 2021

avoid draining if no action is required #2295

Closed

openshift-ci-robot assigned sinnykumari Feb 22, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2021

darkmuggle approved these changes Feb 22, 2021

View reviewed changes

openshift-merge-robot merged commit 7899510 into openshift:master Feb 22, 2021

yuqi-zhang mentioned this pull request Mar 15, 2021

Bug 1939278: Backport kubelet CA no-drain rotation #2464

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

yuqi-zhang commented Feb 9, 2021

yuqi-zhang commented Feb 9, 2021

yuqi-zhang commented Feb 17, 2021

sinnykumari Feb 17, 2021

yuqi-zhang Feb 17, 2021

sinnykumari Feb 17, 2021

sinnykumari left a comment

yuqi-zhang commented Feb 17, 2021

rphillips commented Feb 18, 2021

yuqi-zhang commented Feb 22, 2021

sinnykumari commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-ci-robot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-ci bot commented Feb 22, 2021

Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

Conversation

yuqi-zhang commented Feb 9, 2021

yuqi-zhang commented Feb 9, 2021

yuqi-zhang commented Feb 17, 2021

sinnykumari Feb 17, 2021

Choose a reason for hiding this comment

yuqi-zhang Feb 17, 2021

Choose a reason for hiding this comment

sinnykumari Feb 17, 2021

Choose a reason for hiding this comment

sinnykumari left a comment

Choose a reason for hiding this comment

yuqi-zhang commented Feb 17, 2021

rphillips commented Feb 18, 2021

yuqi-zhang commented Feb 22, 2021

sinnykumari commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-ci-robot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-bot commented Feb 22, 2021

openshift-ci bot commented Feb 22, 2021