This repository has been archived by the owner on Apr 30, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Document strategy for operator-based upgrades
Signed-off-by: John Strunk <jstrunk@redhat.com>
- Loading branch information
1 parent
624533c
commit fba1dc0
Showing
1 changed file
with
146 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
The operator is responsible for incorporating the domain-specific expertise | ||
necessary to perform upgrades of the Gluster cluster. An administrator, either | ||
directly or through an upgrade service, will choose the desired version of | ||
Gluster (GCS) that should be running on the cluster, and the operator must | ||
properly and autonomously upgrade all components to match that request. | ||
|
||
# Problem overview | ||
|
||
A version of GCS is comprised of a specific set of container images plus the | ||
associated object manifests required to deploy them. In order to upgrade, the | ||
container images must be changed to those in a subsequent release. These | ||
updates, however, must be mindful of version compatibility between the | ||
components. Particularly when upgrading across a series of releases, it may be | ||
necessary to take intermediate upgrade steps to ensure client (CSI driver) | ||
versions remain compatible with server (Gluster container) versions. | ||
|
||
## Interaction with Operator Lifecycle Manager | ||
|
||
[OLM](https://github.com/operator-framework/operator-lifecycle-manager) is | ||
designed to manage the life cycle of operators (i.e., ensure dependencies, | ||
install, upgrade, etc.). It accomplishes this by maintaining a set of | ||
ClusterServiceVersion objects that describe the available versions of an | ||
operator. These objects are then assembled to create an upgrade path for a | ||
given operator. | ||
|
||
OLM attempts to install the latest version of an operator by walking the CSV | ||
versions. It does this by replacing the running operator with the next version, | ||
waiting for it to become ready, then repeating until the operator is | ||
up-to-date. This rapid-succession replacement does not have a mechanism to | ||
pause on a version while underlying services are updated. The implication of | ||
this is: | ||
|
||
- The operator is required to support upgrading its internal state from version | ||
*n-1* to version *n* only. (OLM walks through each version.) | ||
- Every version of the operator must be capable of upgrading the Gluster | ||
cluster from an arbitrarily old version to the current version. (OLM will not | ||
wait for lower-level upgrades.) | ||
|
||
Because of the above issues, we must have a method for representing upgrade and | ||
deployment actions that can be maintained indefinitely as a part of the | ||
operator's logic. | ||
|
||
# Proposed solution | ||
|
||
The operator will contain a manifest of all old versions and legal version | ||
transitions. The operator will drive the system toward the most current (aka | ||
highest) version in the manifest by stepping through them. The format of the | ||
manifest must remain in sync with the operator code-base, but it could be either | ||
in the operator container or in a sidecar container within the same pod. | ||
|
||
The version sequence manifest would look something like: | ||
|
||
```yaml | ||
--- | ||
|
||
# List of versions we know about | ||
- version: 5.0 # describes GCS v5 | ||
# The components that go into the GCS "system" | ||
components: | ||
- name: gluster-csi-file | ||
# Template is probably a Deployment or DaemonSet, etc. | ||
template: ... | ||
- name: ... | ||
# The minimum version that we can upgrade from directly | ||
upgradeMinVersion: 4.0 | ||
- version: 4.1 | ||
components: | ||
upgradeMinVersion: 3.0 | ||
- version: 4.0 | ||
components: | ||
upgradeMinVersion: 3.0 | ||
... | ||
``` | ||
|
||
In the above example, there are 3 versions defined: 4.0, 4.1, and 5.0. If 3.0 | ||
is currently deployed, the operator could upgrade from 3.0 to either 4.0 or | ||
4.1, then to 5.0. Each new release of the operator would add the latest version | ||
to the manifest and retain all old ones to preserve the upgrade path. | ||
|
||
The information necessary to deploy a given component resides in | ||
`components.template`. This could be a Deployment, DaemonSet, or other | ||
information. It is up to the operator implementation to determine exactly what | ||
is necessary, but anything that is likely to change related to how the | ||
component is deployed should be reflected here instead of buried in the | ||
operator source code. | ||
|
||
## Tracking upgrades | ||
|
||
The operator may get killed, crash, or be upgraded while it is in the process | ||
of upgrading resources. When this happens, the restarted operator may have | ||
difficulty recognizing the current version of the cluster and picking up at the | ||
correct place in the upgrade process. To account for this, we will perform | ||
intent logging in the form of an entry in the Cluster CRD status field that | ||
contains the version that should be applied. This tag would be updated once the | ||
operator decides it is OK to upgrade to a given version, but before any upgrade | ||
actions have taken place. | ||
|
||
Example: | ||
|
||
```yaml | ||
# Status field in the top-level Cluster CRD | ||
status: | ||
clusterVersion: 4.1 | ||
``` | ||
|
||
Once the `clusterVersion` field has been updated, the operator is committed to | ||
rolling out that version of the system. As such, part of the natural | ||
reconciliation loop of the operator is to confirm that all containers are | ||
running the Deployments/images/etc. that correspond to this release version in | ||
the manifest. This tag will naturally remain at the last applied version and | ||
serve as an easy starting point for future upgrade resolution. | ||
|
||
## Disadvantages | ||
|
||
By embedding the version manifest in the operator container(s), it requires a | ||
new release of the operator any time a new release of any GCS sub-component is | ||
created. | ||
|
||
Note that for development, symbolic image tags could be used (e.g., `:dev` or | ||
`:latest`) to avoid having to repeatedly build the operator container just to | ||
update the manifest. | ||
|
||
# Alternative solutions considered | ||
|
||
Below is a list of other approaches that were considered. | ||
|
||
## Container versions in CRD | ||
|
||
The desired version of each container could be directly placed into the | ||
operator CRDs. This would simplify the operator logic as it would simply need | ||
to deploy the given image/version. The unfortunate consequence is that an | ||
administrator would be forced to update the appropriate fields on each upgrade, | ||
making sure to choose versions that are compatible. This would need to be done | ||
multiple times, in sequence, if upgrading several versions. There is also no | ||
way of saying "take me to release *X*." | ||
|
||
## Manifest as ConfigMap | ||
|
||
The manifest described above could be represented as a ConfigMap to provide the | ||
upgrade sequencing to the operator. In order to fit with OLM, a "version" must | ||
be able to be expressed via a CSV. Currently the CSV expects to describe the | ||
operator as a Deployment, and there appears to be no facility to add a proper | ||
ConfigMap object. This means the version payload needs to be a part of the | ||
operator container(s). Additionally, since the format and contents of the | ||
manifest are tied to the operator version, separating them introduces a source | ||
of error. |