Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BZ1886712 - Document disabling autoreboot after MCO update #26829

Merged
merged 1 commit into from Oct 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
112 changes: 112 additions & 0 deletions modules/troubleshooting-disabling-autoreboot-mco.adoc
@@ -0,0 +1,112 @@
// Module included in the following assemblies:
//
// * support/troubleshooting/troubleshooting-operator-issues.adoc

[id="troubleshooting-disabling-autoreboot-mco_{context}"]
= Disabling Machine Config Operator from automatically rebooting

When configuration changes are made by the Machine Config Operator, {op-system-first} must reboot for the changes to take effect. Whether the configuration change is automatic, such as when a `kube-apiserver-to-kubelet-signer` CA is rotated, or manual, such as when a registry or SSH key is updated, an {op-system} node reboots automatically unless is is paused.

To avoid unwanted disruptions, you can modify the machine config pool to prevent automatic rebooting after the Operator makes changes to the machine config.
bobfuru marked this conversation as resolved.
Show resolved Hide resolved

[NOTE]
====
Pausing a machine config pool pauses all system reboot processes and all configuration changes from being applied.
====

.Prerequisites

* You have access to the cluster as a user with the `cluster-admin` role.
* You have installed the OpenShift CLI (`oc`).
* You have root access in {product-title}.

.Procedure
. To pause the autoreboot process after machine config changes are applied:

* As root, update the `spec.paused` field to `true` in the MachineConfigPool CustomResourceDefinition (CRD).
+
.Control plane (master) nodes
[source,terminal]
----
# oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/master
----
+
.Worker nodes
[source,terminal]
----
# oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
----

. To verify that the machine config pool is paused:
+
.Control plane (master) nodes
[source,terminal]
----
# oc get machineconfigpool/master --template='{{.spec.paused}}'
----
+
.Worker nodes
[source,terminal]
----
# oc get machineconfigpool/worker --template='{{.spec.paused}}'
----
+
The `spec.paused` field is `true` and the the machine config pool is paused.

. Alternatively, to unpause the autoreboot process:

* As root, update the `spec.paused` field to `false` in the MachineConfigPool CustomResourceDefinition (CRD).
+
.Control plane (master) nodes
[source,terminal]
----
# oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/master
----
+
.Worker nodes
[source,terminal]
----
# oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker
----
+
[NOTE]
====
By unpausing a machine config pool, all paused changes are applied at reboot.
====
+
. To verify that the machine config pool is unpaused:
+
.Control plane (master) nodes
[source,terminal]
----
# oc get machineconfigpool/master --template='{{.spec.paused}}'
----
+
.Worker nodes
[source,terminal]
----
# oc get machineconfigpool/worker --template='{{.spec.paused}}'
----
+
The `spec.paused` field is `false` and the the machine config pool is unpaused.

. To see if the machine config pool has pending changes:
+
[source,terminal]
----
# oc get machineconfigpool
----
+
.Example output
----
NAME CONFIG UPDATED UPDATING
master rendered-master-546383f80705bd5aeaba93 True False
worker rendered-worker-b4c51bb33ccaae6fc4a6a5 True False
----
+
When `UPDATED` is `True` and `UPDATING` is `False`, there are no pending changes, and vice versa.

[IMPORTANT]
====
It is recommended to schedule a maintenance window for a reboot as early as possible by setting `spec.paused` to `false` so that the queued changes since last reboot will take effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also document how to check if a machineconfigpool has pending changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. I added a step that I borrowed from another machineconfigpool procedure. @palonsoro PTAL, I'm not sure if it's the right command and example output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command is correct but it needs some explanation on how to interpret it. It would be important to explicitly explain that having True in UPDATING column and False in UPDATED column means that there are pending changes and vice-versa (having True in UPDATED column and False in UPDATING column means there are no pending changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

====
2 changes: 1 addition & 1 deletion modules/understanding-control-plane.adoc
Expand Up @@ -9,4 +9,4 @@ The control plane, which is composed of master machines, manages the
{product-title} cluster. The control plane machines manage workloads on the
compute machines, which are also known as worker machines. The cluster itself manages all upgrades to the
machines by the actions of the Cluster Version Operator, the
Machine Config Operator, and set of individual Operators.
Machine Config Operator, and a set of individual Operators.
15 changes: 10 additions & 5 deletions modules/understanding-machine-config-operator.adoc
Expand Up @@ -19,15 +19,15 @@ constructs. They include:
plane. It monitors all of the cluster nodes and orchestrates their configuration
updates.
* The `machine-config-daemon` DaemonSet, which runs on
each node in the cluster and updates a machine to configuration defined by
MachineConfig as instructed by the MachineConfigController. When the node sees
each node in the cluster and updates a machine to configuration as defined by
MachineConfig and as instructed by the MachineConfigController. When the node detects
a change, it drains off its pods, applies the update, and reboots. These changes
come in the form of Ignition configuration files that apply the specified
machine configuration and control kubelet configuration. The update itself is
delivered in a container. This process is key to the success of managing
{product-title} and {op-system} updates together.
* The `machine-config-server` DaemonSet, which provides the Ignition config files
to master nodes as they join the cluster.
to control plane nodes as they join the cluster.

The machine configuration is a subset of the Ignition configuration. The
`machine-config-daemon` reads the machine configuration to see if it needs to do
Expand All @@ -36,5 +36,10 @@ configuration changes, or other changes to the operating system or {product-titl
configuration.

When you perform node management operations, you create or modify a
KubeletConfig Custom Resource (CR).
//See https://github.com/openshift/machine-config-operator/blob/master/docs/KubeletConfigDesign.md[KubeletConfigDesign] for details.
KubeletConfig custom resource (CR).
//See https://github.com/openshift/machine-config-operator/blob/master/docs/KubeletConfigDesign.md[KubeletConfigDesign] for details.

[IMPORTANT]
====
To prevent control plane nodes from autorebooting after machine config changes are applied, you must pause the autoreboot process by setting the `spec.paused` field to `true` in the machine pool config.
====
5 changes: 4 additions & 1 deletion support/troubleshooting/troubleshooting-operator-issues.adoc
Expand Up @@ -5,7 +5,7 @@ include::modules/common-attributes.adoc[]

toc::[]

Operators are a method of packaging, deploying, and managing an {product-title} application. They act like an extension of the software vendor’s engineering team, watching over an {product-title} environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, like skipping a software backup process to save time.
Operators are a method of packaging, deploying, and managing an {product-title} application. They act like an extension of the software vendor’s engineering team, watching over an {product-title} environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, such as skipping a software backup process to save time.

{product-title} {product-version} includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).

Expand All @@ -24,3 +24,6 @@ include::modules/querying-operator-pod-status.adoc[leveloffset=+1]

// Gathering Operator logs
include::modules/gathering-operator-logs.adoc[leveloffset=+1]

// Disabling Machine Config Operator from autorebooting
include::modules/troubleshooting-disabling-autoreboot-mco.adoc[leveloffset=+1]