Skip to content
This repository has been archived by the owner on Apr 30, 2020. It is now read-only.

Commit

Permalink
Document strategy for operator-based upgrades
Browse files Browse the repository at this point in the history
Signed-off-by: John Strunk <jstrunk@redhat.com>
  • Loading branch information
JohnStrunk committed Aug 3, 2018
1 parent 624533c commit fba1dc0
Showing 1 changed file with 146 additions and 0 deletions.
146 changes: 146 additions & 0 deletions docs/Developers/Design/Upgrade_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
The operator is responsible for incorporating the domain-specific expertise
necessary to perform upgrades of the Gluster cluster. An administrator, either
directly or through an upgrade service, will choose the desired version of
Gluster (GCS) that should be running on the cluster, and the operator must
properly and autonomously upgrade all components to match that request.

# Problem overview

A version of GCS is comprised of a specific set of container images plus the
associated object manifests required to deploy them. In order to upgrade, the
container images must be changed to those in a subsequent release. These
updates, however, must be mindful of version compatibility between the
components. Particularly when upgrading across a series of releases, it may be
necessary to take intermediate upgrade steps to ensure client (CSI driver)
versions remain compatible with server (Gluster container) versions.

## Interaction with Operator Lifecycle Manager

[OLM](https://github.com/operator-framework/operator-lifecycle-manager) is
designed to manage the life cycle of operators (i.e., ensure dependencies,
install, upgrade, etc.). It accomplishes this by maintaining a set of
ClusterServiceVersion objects that describe the available versions of an
operator. These objects are then assembled to create an upgrade path for a
given operator.

OLM attempts to install the latest version of an operator by walking the CSV
versions. It does this by replacing the running operator with the next version,
waiting for it to become ready, then repeating until the operator is
up-to-date. This rapid-succession replacement does not have a mechanism to
pause on a version while underlying services are updated. The implication of
this is:

- The operator is required to support upgrading its internal state from version
*n-1* to version *n* only. (OLM walks through each version.)
- Every version of the operator must be capable of upgrading the Gluster
cluster from an arbitrarily old version to the current version. (OLM will not
wait for lower-level upgrades.)

Because of the above issues, we must have a method for representing upgrade and
deployment actions that can be maintained indefinitely as a part of the
operator's logic.

# Proposed solution

The operator will contain a manifest of all old versions and legal version
transitions. The operator will drive the system toward the most current (aka
highest) version in the manifest by stepping through them. The format of the
manifest must remain in sync with the operator code-base, but it could be either
in the operator container or in a sidecar container within the same pod.

The version sequence manifest would look something like:

```yaml
---

# List of versions we know about
- version: 5.0 # describes GCS v5
# The components that go into the GCS "system"
components:
- name: gluster-csi-file
# Template is probably a Deployment or DaemonSet, etc.
template: ...
- name: ...
# The minimum version that we can upgrade from directly
upgradeMinVersion: 4.0
- version: 4.1
components:
upgradeMinVersion: 3.0
- version: 4.0
components:
upgradeMinVersion: 3.0
...
```

In the above example, there are 3 versions defined: 4.0, 4.1, and 5.0. If 3.0
is currently deployed, the operator could upgrade from 3.0 to either 4.0 or
4.1, then to 5.0. Each new release of the operator would add the latest version
to the manifest and retain all old ones to preserve the upgrade path.

The information necessary to deploy a given component resides in
`components.template`. This could be a Deployment, DaemonSet, or other
information. It is up to the operator implementation to determine exactly what
is necessary, but anything that is likely to change related to how the
component is deployed should be reflected here instead of buried in the
operator source code.

## Tracking upgrades

The operator may get killed, crash, or be upgraded while it is in the process
of upgrading resources. When this happens, the restarted operator may have
difficulty recognizing the current version of the cluster and picking up at the
correct place in the upgrade process. To account for this, we will perform
intent logging in the form of an entry in the Cluster CRD status field that
contains the version that should be applied. This tag would be updated once the
operator decides it is OK to upgrade to a given version, but before any upgrade
actions have taken place.

Example:

```yaml
# Status field in the top-level Cluster CRD
status:
clusterVersion: 4.1
```

Once the `clusterVersion` field has been updated, the operator is committed to
rolling out that version of the system. As such, part of the natural
reconciliation loop of the operator is to confirm that all containers are
running the Deployments/images/etc. that correspond to this release version in
the manifest. This tag will naturally remain at the last applied version and
serve as an easy starting point for future upgrade resolution.

## Disadvantages

By embedding the version manifest in the operator container(s), it requires a
new release of the operator any time a new release of any GCS sub-component is
created.

Note that for development, symbolic image tags could be used (e.g., `:dev` or
`:latest`) to avoid having to repeatedly build the operator container just to
update the manifest.

# Alternative solutions considered

Below is a list of other approaches that were considered.

## Container versions in CRD

The desired version of each container could be directly placed into the
operator CRDs. This would simplify the operator logic as it would simply need
to deploy the given image/version. The unfortunate consequence is that an
administrator would be forced to update the appropriate fields on each upgrade,
making sure to choose versions that are compatible. This would need to be done
multiple times, in sequence, if upgrading several versions. There is also no
way of saying "take me to release *X*."

## Manifest as ConfigMap

The manifest described above could be represented as a ConfigMap to provide the
upgrade sequencing to the operator. In order to fit with OLM, a "version" must
be able to be expressed via a CSV. Currently the CSV expects to describe the
operator as a Deployment, and there appears to be no facility to add a proper
ConfigMap object. This means the version payload needs to be a part of the
operator container(s). Additionally, since the format and contents of the
manifest are tied to the operator version, separating them introduces a source
of error.

0 comments on commit fba1dc0

Please sign in to comment.