Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubelet CA to no-reboot action list/Do not drain for non-reboot actions #2398

Conversation

yuqi-zhang
Copy link
Contributor

Incorporates #2294 and #2290, such that a kubelet cert signer rotation causes a "none" action.

In 4.7 we introduced non-reboot updates for select cases, but still drained nodes for safety. This is still a major workload disruption, and non-reboot updates do not require draining for correctness. This PR will also introduce the ability to not drain at all for non-reboot updates, and in case of error, return the error directly instead of rebooting in an attempt to fix it.

This will cause updates to effectively to not cordon nodes at all, and on AWS a change to e.g. kubelet ca would talk ~20 seconds to apply to a 3-node pool.

One last thing to consider is what behaviour we want for changes to /etc/containers/registries.conf, since not draining may cause a mismatch of image reference between existing and new containers.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2021
@yuqi-zhang
Copy link
Contributor Author

#2294 and #2290 can still be merged separately, this exists as a consolidated PR if we want to merge altogether. The commits are still separate for cherry-picking, if needed.

/hold pending investigation of the last point above

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2021
@yuqi-zhang
Copy link
Contributor Author

After some discussion and consideration, it is likely better for us to still drain on registries.conf changes (crio reload), to surface errors quicker if the file is bad. See commit message for details.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2021
@@ -132,8 +132,7 @@ func (dn *Daemon) performPostConfigChangeAction(postConfigChangeActions []string
if dn.recorder != nil {
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeWarning, "FailedServiceReload", fmt.Sprintf("Reloading %s service failed. Error: %v", serviceName, err))
}
dn.logSystem("Reloading %s configuration failed, node will reboot instead. Error: %v", serviceName, err)
dn.reboot(fmt.Sprintf("Node will reboot into config %s", configName))
return fmt.Errorf("Could not apply update: reloading %s configuration failed. Error: %v", serviceName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced why we shouldn't reboot to try to auto fix the problem. These errors may not be user initiated, user will be anyway clueless and will end up creating bug report or rebooting node explicitly to fix. Since MCO has already drained the node, attempting to auto-recover by reboot makes sense and if problem still remains we will at least have logs that rebooting doesn't help and we can investigate further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have two considerations here although I do not feel super strongly either way:

  1. In terms of failure conditions of the reload, it could be due to
    1a) a bad config provided by the user causing the reload to fail, in which case rebooting does not help anyways
    1b) a transient issue with e.g. systemd, but since our logic keeps on retrying through a new sync, it will resolve itself eventually if its just a transient issue
    1c) an issue that isn't transient but resolvable by a reboot. This would be covered by a reboot but I'm not sure what an example would be
  2. Since we are saying this change is non-reboot, a customer may be confused if it rebooted (due to an error) and not necessarily have insight to what the error was. Since we make calls to systemctl elsewhere in the MCD anyways, I thought it would be consistent to display the same failure mode (returns error, logs it, but will retry upon a later sync)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Jerry for providing the reasoning, I am convinced now 🙂

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, will let others to review as well.

Since #2290 and #2294 are open, we may want to bring everyone on this PR or get those PRs merged first

deads2k and others added 5 commits February 17, 2021 17:40
This file should contain the 10-year root CA, but currently it
also syncs the 1-year kube-apiserver-to-kubelet-signer cert.

Remove the kube-apiserver-to-kubelet-signer cert, since it is
not needed in this file.
This file syncs contents of kube-apiserver-to-kubelet-signer cert
manged by the openshift-kubeapi operator. This is a 1 year cert that
is auto-rotated, which normally would cause an a non user-triggered
update + reboot in live clusters, which is non-desirable in many
customer cases.

Since the kubelet has the ability to live reload the cert when the
file is changed, add this file to "None" action post config change,
thus avoiding an automatic reboot every time the cert is rotated.
In 4.7 we introduced non-reboot updates for select cases, but
still drained nodes for safety. This is still a major workload
disruption, and non-reboot updates do not require draining for
correctness. This commit will instead not drain at all for
postConfigChangeActionNone updates, and in case of error, return
the error directly instead of rebooting in an attempt to fix it.

Note that a change to /etc/containers/registries.conf, i.e.
postConfigChangeActionReloadCrio triggering a crio reload, will
still cause a drain. This is to reflect registries.conf errors
immediately, since if we did not perform the drain, existing
containers will not attempt any new image pulls. A "bad"
registries.conf then won't cause any errors until a future
update, and the source of the error will be harder to find.

Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
@yuqi-zhang
Copy link
Contributor Author

I've updated the commit messages (keeping original authors) so the two commits are easier understood.

0561404
3aabfec

If the original authors are ok, we can go ahead with this as the main PR. Will ask on the separate PRs.

@rphillips
Copy link
Contributor

/approve
Thanks!

@yuqi-zhang
Copy link
Contributor Author

/retest

did some manual testing, the behaviour is as expected

@sinnykumari
Copy link
Contributor

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: darkmuggle, rphillips, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [darkmuggle,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 22, 2021

@yuqi-zhang: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 22da53e link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. team-mco
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants