OSDOCS-2427: Update best practices

openshift · Dec 6, 2023 · 63ffbfd · 63ffbfd
1 parent 8f1c0b3
commit 63ffbfd
Show file tree

Hide file tree

Showing 2 changed files with 67 additions and 1 deletion.
diff --git a/modules/update-best-practices.adoc b/modules/update-best-practices.adoc
@@ -0,0 +1,63 @@
+// Module included in the following assemblies:
+//
+// * updating/preparing_for_updates/updating-cluster-prepare.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="update-best-practices_{context}"]
+= Best practices for cluster updates
+
+{product-title} is designed to provide a robust update experience that allows clusters to update with minimal disruptions to workloads.
+Updates will not begin unless the cluster is determined to be in an upgradeable state at the time of the update request.
+
+This design ensures that some key conditions are met before initiating an update, but there are a number of actions you can take to increase your chances of a successful cluster update.
+
+[discrete]
+[id="recommended-versions_{context}"]
+=== Choose versions recommended by the OpenShift Update Service
+
+The OpenShift Update Service (OSUS) provides update recommendations based on cluster characteristics such as the cluster's subscribed channel, which are then saved by the Cluster Version Operator as either recommended or conditional updates.
+While it is possible to attempt an update to a version that is not recommended by OSUS, following a recommended update path protects users from encountering known issues or unintended consequences on the cluster.
+
+Choose only update targets that are recommended by OSUS to ensure a successful update.
+
+[discrete]
+[id="critical-alerts_{context}"]
+=== Address all critical alerts on the cluster
+
+Critical alerts must always be addressed as soon as possible, but it is especially important to address these alerts and resolve any problems before initiating a cluster update.
+Failing to address critical alerts before beginning an update can cause problematic conditions for the cluster.
+
+[discrete]
+[id="cluster-upgradeable_{context}"]
+=== Ensure that the cluster is in an Upgradable state
+
+When one or more Operators have not reported their `Upgradeable` condition as `True` for more than an hour, the `ClusterNotUpgradeable` warning alert is triggered in the cluster.
+In most cases patch updates are not blocked by this alert, but you cannot perform a minor version update until this alert is resolved and all Operators report `Upgradeable` as `True`.
+
+[discrete]
+[id="nodes-ready_{context}"]
+=== Ensure that enough spare nodes are available
+
+// Completely guessing the explanation in this section just to have something to start with when this is reviewed by an SME.
+A cluster should not be running with little to no spare node capacity, especially when initiating a cluster update.
+Nodes that are not running and available may limit a cluster's ability to perform an update with minimal disruption to cluster workloads.
+
+Depending on the configured value of the cluster's `maxUnavailable` spec, an unavailable node can also prevent itself and other nodes from having machine configuration changes applied during a cluster update.
+Additionally, if compute nodes do not have enough spare capacity, workloads might not be able to temporarily shift to another node while the first node is taken offline for an update.
+
+Make sure that you have enough available nodes in each worker pool, as well as enough spare capacity on your compute nodes, to increase the chance of successful node updates.
+
+[discrete]
+[id="pod-disruption-budget_{context}"]
+=== Ensure that the cluster's PodDisruptionBudget is properly configured
+
+The `PodDisruptionBudget` object allows you to define the minimum number or percentage of pod replicas that must be available at any given time.
+This configuration allows workloads to be protected from disruptions during maintenance tasks such as cluster updates.
+
+However, it is possible to configure the `PodDisruptionBudget` for a given topology in a way that prevents nodes from being drained and updated during a cluster update.
+
+When planning a cluster update, check the configuration of the `PodDisruptionBudget` object for the following factors:
+
+* For highly available workloads, make sure there are replicas that can be temporarily taken offline without being prohibited by the `PodDisruptionBudget`.
+
+* For workloads that aren't highly available, make sure they are either not protected by a `PodDisruptionBudget` or have some alternative mechanism for draining these workloads eventually, such as periodic restart or guaranteed eventual termination.
diff --git a/updating/preparing_for_updates/updating-cluster-prepare.adoc b/updating/preparing_for_updates/updating-cluster-prepare.adoc
@@ -51,4 +51,7 @@ include::modules/update-preparing-conditional.adoc[leveloffset=+1]
 
 [role="_additional-resources"]
 .Additional resources
-* xref:../../updating/understanding_updates/how-updates-work.adoc#update-evaluate-availability_how-updates-work[Evaluation of update availability]
+* xref:../../updating/understanding_updates/how-updates-work.adoc#update-evaluate-availability_how-updates-work[Evaluation of update availability]
+
+// Best practices for cluster updates
+include::modules/update-best-practices.adoc[leveloffset=+1]