Skip to content
Browse files
Merge pull request #67 from batmat/JENKINS-49406-JEP-submission
[JENKINS-49406] Evergreen snapshotting data safety system JEP
  • Loading branch information
rtyler committed Mar 22, 2018
2 parents b5b57a9 + 825ee33 commit 949cbdb6bb2823a0a780e1005cf86a9b815f48b6
Showing with 327 additions and 0 deletions.
  1. +324 −0 jep/302/README.adoc
  2. +3 −0 jep/README.adoc
@@ -0,0 +1,324 @@
= JEP-302: Evergreen snapshotting data safety system
:toc: preamble
:toclevels: 3
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:

| 302

| Title
| Evergreen snapshotting data safety system

| Sponsor

// Use the script `set-jep-status <jep-number> <status>` to update the status.
| Status
| Draft :speech_balloon:

| Type
| Standards

| Created
| 2018-03-21
// Uncomment if there is an associated placeholder JIRA issue.
// Uncomment if there will be a BDFL delegate for this JEP.
//| BDFL-Delegate
//| :bulb: Link to github user page :bulb:
// Uncomment if discussion will occur in forum other than jenkinsci-dev@ mailing list.
//| Discussions-To
//| :bulb: Link to where discussion and final status announcement will occur :bulb:
// Uncomment if this JEP depends on one or more other JEPs.
| Requires
| JEP-300, JEP-301
// Uncomment and fill if this JEP is rendered obsolete by a later JEP
//| Superseded-By
//| :bulb: JEP-NUMBER :bulb:
// Uncomment when this JEP status is set to Accepted, Rejected or Withdrawn.
//| Resolution
//| :bulb: Link to relevant post in the jenkinsci-dev@ mailing list archives :bulb:


== Abstract

link:[Jenkins Essentials], and more specifically its link:[Evergreen] _component_, aims at providing an link:[automatically updating distribution] of Jenkins.

Continuous Delivery is about making small incremental changes, making failures much more easily recoverable. In the context here, it means Jenkins Essentials must to be able to seamlessly upgrade Jenkins, _but also_ roll back to the previously running version if an upgrade goes wrong.
As Jenkins does not support downgrading alone, this document introduces the snapshotting system which enables that auto-downgrade capability.

== Specification

_Evergreen_ works with two main components: Jenkins itself, and the _evergreen-client_.

=== Upgrading and downgrading

Once an _evergreen client_ has been instructed to perform an upgrade, it is responsible for the following operations:

. (If needed) Initialize the git repository.
. Stop Jenkins
. Take a snapshot (see <<snapshot>> below)
. Perform the instructed upgrade to the given Essentials BOM footnote:[Bill Of Materials: this format is currently being designed, but will list everything constituting a version of Essentials: WAR and exact versions of all plugins]
. Start new Jenkins version and check Jenkins state (see below <<healthcheck>>).
. If a rollback is decided:
.. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot].
.. Roll back to the previous data snapshot and Essentials BOM version.
Doing so, we will create an actual new commit using `revert` (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`.
.. Start (previous) Jenkins version
// what if starting the previous version doesn't work either?
. Report the outcome to the Essentials backend.

==== Take a snapshot

Behind the scenes, this system uses `git` for the purpose of snapshotting, excluding everything not ignored by `.gitignore` in Git's index.
In other words, `git status` just after a **snapshot** action, would read: `nothing to ocmmit. working tree clean`

This system is **not** responsible for snapshotting data which is too large or is not sensible to store.
To support snapshotting, Jenkins is configured in a way that physically separates as much as possible between those files which must be snapshotted, and those which must not. See <<data_segregation>> for more details below.

I think .gitignore content must be designed to be able to evolve over time.
To allow more flexibility, I think the content should be associated between an essentials release/bom to a given .gitignore content.

An manual example of the process can be visualized with the following:

. Update `.gitignore` content with current Essentials release.
. `git add --all`
. `git commit -m '[Upgrade] From BOM x.y.z to a.b.c'`
The commit log should ideally be made understandable for humans.
We will use tags to be able to revert/switch between snapshots in a programmatic reliable way.
Each tag name should be designed so that it is clear and easy to link it to a given version of Essentials.
We need to finish up the work on the BOM to be more precise here.

===== Segregating configuration from binaries, build data, logs, etc

To support snapshotting only appropriate files, Jenkins must be configured with a few non-default values to better separate the more static, and critically important, files which must be preserved between upgrades.
These files include:

* Jenkins global configuration
* Job/Pipeline configuration

Compared to the more ephemeral files, such as: build data, workspaces, exploded plugin files, and exploded core files.
To keep things simple, Evergreen uses a single Docker volume, but introduces an additional level to separate files that must be snapshotted, and those which don't require snapshotting.
Incidentally, this keeps `.gitignore` short.

Basically, instead of the typical `/var/jenkins_home`, Evergreen introduces two subdirectories under `/evergreen`, referred to hereafter as `$EVERGREEN_HOME`.

* `/evergreen/jenkins/home` (=`$JENKINS_HOME`) for to-be-snapshotted content, and
* `/evergreen/jenkins/var` for the rest.

On the filesystem, for example, this would be laid out as such:

├── home
│   ├── jobs
│   │   └── the_job # configuration file only
│   ├── nodes
│   ├── plugins
│   ├── secrets
│   ├── updates
│   ├── userContent
│   └── users
└── var
   ├── logs # JENKINS-50291
   │   └── tasks
├── plugins # exploded plugins, using --pluginroot switch
├── jobs # JENKINS-50164
│   └── the_job
│   ├── builds
│   └── workspace
└── war # using --webroot
├── ...

===== Files to store

Using the data segregation explained above, Evergreen snapshots _almost_ everything under `/evergreen/jenkins/home`.

Evergreen must have a `.gitignore` file for some files that either cannot be moved elsewhere, or that should not be stored in the Git repository.
As mentioned above, this file will likely need to be iterated upon as needs change:


Regarding `$JENKINS_HOME/plugins`, this directory contains the hpi/jpi files before extraction.
Ideally, Evergreen would move this elsewhere under `$EVERGREEN_HOME/jenkins/var/plugins`, but this is currently not yet doable, as
`--pluginsroot` only configures a different location for exploded plugins.

=== Checking Jenkins health

From the perspective of this proposal, health checking Jenkins itself is out of scope.
But the _driver_ of the upgrade, _evergreen client_, requires a way to determine whether or not a rollback should be executed.

For reference, the Jira issue tracking this design work is: link:[JENKINS-50294].

== Motivation

Jenkins has never supported downgrading by itself, and it's unlikely the core constructs will change in this regard anytime soon.
The official way to revert an upgrade if something went wrong is to restore a previous backup.

In the context of _Essentials_, it cannot rely on external backups to revert to the _N-1_ version as this would require regular manual user intervention, which is clearly not the desired user experience.

== Reasoning

=== Scope of the data snapshotting

Snapshotting data is **not** a backup system.

The practical time frame where the snapshots are designed to be used is within the seconds or minutes after an upgrade has been initiated.
If Jenkins, after it has been restarted, is deemed unhealthy, then an auto-rollback _can_ be initiated.

If a version is determined to be problematic after a few days, the data snapshotting system will **not** be used.
After a longer time period, where Jenkins has executed user-motivated workloads, generating new data, the snapshots can no longer be treated as a source of truth.
Therefore rolling back outside of the "upgrade window" would risk data loss.

Errors discovered outside of this "upgrade window" should instead be resolved by new changes to Jenkins core, or an erring plugin, in order to solve the user's issue.

=== Why Git

Using filesystem-level tools offering a snapshotting feature, like LVM, ZFS or btrfs to give a few examples, was considered.
But this was discounted because _Essentials_ vision is about providing an link:["_easier to use_ and _easier to manage_ Jenkins environment"].
As per the link:[targeted audience], we obviously do not want to expect _Essentials_ users to be system experts able to set up a dedicated filesystem to operate Jenkins.
And even with system expert, doing so would not make Essentials a very easy and quick to use distribution of Jenkins.

Git offers in this matter a powerful user-space tool that allows Evergreen to version,
and quickly roll back to some previous state if need be.

Git is also a very common tool nowadays for developers,
hence it makes Evergreen more accessible to contributors.

=== Why not use compatibleSinceVersion metadata

For context, a plugin can indicate a link:[`compatibleSinceVersion`] information, i.e. what is "the oldest version [...] configuration-compatible with.". For example:

* a plugin is being upgraded from version `1.4` to `1.5`
* it specifies `compatibleSinceVersion`=`1.5`

In such case, *if* this plugin wrote configuration files, this means you cannot safely roll back to the `1.4` version of the plugin.

Conversely, with the following situation:

* a plugin is being upgraded from version `1.4` to `1.5`
* `compatibleSinceVersion` is `1.4` or less, or absent.

In such case, _even_ if the plugin did write its updated configuration files on the disk, we can expect being able to safely rollback the plugin to the previous `1.4` version, _while leaving_ the configuration file content that was just updated for `1.5` version.

This situation is not specifically handled in this design.
In other words, Evergreen *will* also roll back those files.

For two reasons:

* this looks like an _optimization_.
Hence as such, this is probably premature to try and be very smart with the way the downgrade will work ;
* First, work must be done on the link:[JEP to define criteria for selecting plugins to include in Jenkins Essentials], so that there is a clear process and automated tests in place to check for correct `compatibleSinceVersion` usage.

== Backwards Compatibility

There are no backwards compatibility concerns related to this proposal.

== Security

=== Secrets

Versioning secrets should not be an issue per se, as the data snapshotting system is designed to be local to the running instance.
The Git repository data will never be pushed _outside_ by the _Essentials_ code, so no data leak is normally expected from this side.

As users may have the unfortunate idea to push that repository elsewhere, not being aware they could leak secrets, Evergreen conservatively adds `secrets/master.key` to the `.gitignore` file.

=== Man In The Middle

The main issue here is that an attacker could for instance instruct the _evergreen client_ to ignore everything (by putting `*` in `.gitignore`), hence make it impossible to roll back.

But this would mean someone was able to talk with connected instances.
So even if this is a valid concern, this is considered a larger scope issue that will be addressed through link:[JENKINS-49844].

Hence there are no *specific* security risks related to this proposal.

== Infrastructure Requirements

There are no new infrastructure requirements related to this proposal.

== Testing

We must create an image of _Essentials_ preconfigured with a complete set of representative data.

Creating/defining this data clearly requires human work, but the following checks are deemed automatable.

=== Upgrading/downgrading

Before delivering updates on real connected instances, testing must occur in at least the following scenarios:

* Apply the upgrade or downgrade, then check the instance is _running fine_
footnote:[See again <<healthcheck>>]

Ad-hoc testing tools should be developed to be able to automatically assess the health of a Jenkins Essentials instance after an upgrade or a downgrade.

Automatically giving some kind of health grade to a running instance is definitely a critical part of Jenkins Essentials.
Detailing this here is out of scope for this proposal.
This logic however, should be centralized and used in both during automated tests, and in production for the _evergreen-client_ to automatically analyze if a product instance is healthy or is not (and decide to roll back or not, for the current matter here).

Evergreen should leverage the link:[Jenkins Acceptance Test Harness project] for this purpose.

=== Leveraging Telemetry and live instances data

_Essentials_ is a link:[connected] system.
That means we are able to know exactly what versions are running in production.
This information must be used to test the *actual* possible upgrade paths.

Along the way, that also means Evergreen should continuously be able to adjust and enrich what is reported by the __Evergreen client__s from live instances to improve the associated combinations of tests we run.

== Prototype Implementation

A prototype implementation is available in the link:[jenkins-infra/evergreen] repository.

== References

* link:[JEP-300: _Jenkins Essentials_]
* link:[JEP-301: Evergreen packaging for _Jenkins Essentials_]
* Threads on the dev mailing list about this
link:[1] and
@@ -25,6 +25,9 @@
| Draft :speech_balloon:
| link:301/[JEP-301: Evergreen packaging for Jenkins Essentials]

| Draft :speech_balloon:
| link:302/[JEP-302: Evergreen snapshotting data safety system]

| Draft :speech_balloon:
| link:400/[JEP-400: Jenkins X: Jenkins for Kubernetes CD]

0 comments on commit 949cbdb

Please sign in to comment.