Skip to content
Permalink
Browse files

[JENKINS-49406] Evergreen snapshotting data safety system JEP

  • Loading branch information
batmat committed Mar 8, 2018
1 parent 701cf86 commit 6773edbc06488de4c2fa7371f54c79df38672861
Showing with 324 additions and 0 deletions.
  1. +324 −0 jep/0000/README.adoc
@@ -0,0 +1,324 @@
= JEP-0000: Evergreen snapshotting data safety system
:toc: preamble
:toclevels: 3
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]

.Metadata
[cols="2"]
|===
| JEP
| 0000

| Title
| Evergreen snapshotting data safety system

| Sponsor
| https://github.com/batmat

// Use the script `set-jep-status <jep-number> <status>` to update the status.
| Status
| Not Submitted :information_source:

| Type
| Standards

| Created
| 2018-03-21
//
//
// Uncomment if there is an associated placeholder JIRA issue.
| JIRA
| https://issues.jenkins-ci.org/browse/JENKINS-49406[JENKINS-49406]
//
//
// Uncomment if there will be a BDFL delegate for this JEP.
//| BDFL-Delegate
//| :bulb: Link to github user page :bulb:
//
//
// Uncomment if discussion will occur in forum other than jenkinsci-dev@ mailing list.
//| Discussions-To
//| :bulb: Link to where discussion and final status announcement will occur :bulb:
//
//
// Uncomment if this JEP depends on one or more other JEPs.
| Requires
| JEP-300, JEP-301
//
//
// Uncomment and fill if this JEP is rendered obsolete by a later JEP
//| Superseded-By
//| :bulb: JEP-NUMBER :bulb:
//
//
// Uncomment when this JEP status is set to Accepted, Rejected or Withdrawn.
//| Resolution
//| :bulb: Link to relevant post in the jenkinsci-dev@ mailing list archives :bulb:

|===

== Abstract

link:https://github.com/jenkinsci/jep/tree/master/jep/300:[Jenkins Essentials], and more specifically its link:https://github.com/jenkinsci/jep/tree/master/jep/301:[Evergreen] _component_, aims at providing an link:https://github.com/jenkinsci/jep/tree/master/jep/300#auto-update[automatically updating distribution] of Jenkins.

Continuous Delivery is about making small incremental changes, making failures much more easily recoverable. In the context here, it means we need to be able to seamlessly upgrade Jenkins, _but also_ roll back to the previously running version if an upgrade goes wrong.
As Jenkins does not support downgrading alone, we need to introduce a system that will enable that auto-downgrade capability.

This document outlines the design of the snapshotting system we plan to use.

== Specification

_Evergreen_ works with two main components: Jenkins itself, and what we call the evergreen-client.

=== Upgrading & downgrading

Once an evergreen client has been instructed to perform an upgrade, it is responsible for the following operations:

1. (If needed) Initialize the git repository.
2. Stop Jenkins
3. Take a snapshot (see <<snapshot>> below)
4. Perform the instructed upgrade to the given Essentials BOM
footnote:[Bill Of Materials: this format is currently being designed, but will list everything constituting a version of Essentials: WAR and exact versions of all plugins]
5. Start new Jenkins version and check Jenkins state (see below <<healthcheck>>).
6. If a rollback is decided:
.. Take a snapshot footnote:[this way, if new files were created, we don't just delete them in an unrecoverable way when going back to the previous snapshot].
.. Roll back to the previous data snapshot and Essentials BOM version.
Doing so, we will create an actual new commit using `revert` (i.e. avoid `git reset --hard HEAD~`), to keep a durable track of where we went through, accessible through `git log`.
.. Start (previous) Jenkins version
// what if starting the previous version doesn't work either?
7. Report the outcome to the Essentials backend.

[[snapshot]]
==== Take a snapshot

We use `git` for this purpose.
The intent is to have everything not ignored in the index.
In other words, `git status` just after a SNAPSHOT should be in the `nothing to commit, working tree clean` state.

We will *not* snapshot data that is too big or does not make sense to store.
We will configure Jenkins in a way that physically separates as much as possible the things that must be snapshotted, and things that will not (see <<data_segregation>> for details).

////
I think .gitignore content must be designed to be able to evolve over time.
To allow more flexibility, I think the content should be associated between an essentials release/bom to a given .gitignore content.
////

. Update `.gitignore` content with current Essentials release.
. `git add --all`
. `git commit -m '[Upgrade] From BOM x.y.z to a.b.c'`
+
[NOTE]
====
The commit log should ideally be made understandable for humans.
We will use tags to be able to revert/switch between snapshots in a programmatic reliable way.
Each tag name should be designed so that it is clear and easy to link it to a given version of Essentials.
////
We need to finish up the work on the BOM to be more precise here.
////
====

[[data_segregation]]
===== Segregate configuration from binaries, variable build data, logs...

We will configure Jenkins to use non default values to better separate the more static and critically important things that must be snapshotted, like the configuration for jobs, or for Jenkins itself, from the more fluctuating data like build data, workspaces, exploded content for WAR or plugins, etc.

To keep things simple, we will keep having a single Docker volume, but introduce an additional level to separate things that must be snapshotted, and things that do not need to.
Incidentally, this will help keep `.gitignore` short.

Basically, instead of the now usual `/var/jenkins_home`, we will introduce two subdirectories under `/var/jenkins` (which we will refer to as `$JENKINS_DIR`):

* `/var/jenkins/home` (=`$JENKINS_HOME`) for to-be-snapshotted part,
* and `/var/jenkins/var` for the rest.

////
I'm not a 100% sure about /var/jenkins/var. We could put all those directly under /var/jenkins, but I fear it becomes a bit dirty over time.
////

Here is what it will look like:

```
/var/jenkins/
├── home
│   ├── jobs
│   │   └── the_job # configuration file only
│   ├── nodes
│   ├── plugins
│   ├── secrets
│   ├── updates
│   ├── userContent
│   └── users
└── var
   ├── logs # JENKINS-50291
   │   └── tasks
├── plugins # exploded plugins, using --pluginroot switch
├── jobs # JENKINS-50164
│   └── the_job
│   ├── builds
│   └── workspace
└── war # using --webroot
├── META-INF
├── WEB-INF
├── ...
```

===== What to back up

Thanks to the data segregation explained above, we will be snapshotting (almost) everything under `/var/jenkins/home`.

We still need to have a `.gitignore` file for some things that either cannot be moved elsewhere, or that we do not want to store in the Git repository.
As said above, this will likely be improved as we go.

[source,gitignore,title=.gitignore]
----
plugins/
updates/
secrets/master.key
----

====== about `/var/jenkins/home/plugins`

This directory contains the hpi/jpi files before extraction.
Ideally, we should be moving this elsewhere under `/var/jenkins/var/plugins`, but it's not doable yet currently (`--pluginsroot` only configures a different location for exploded plugins).

[[healthcheck]]
=== Checking Jenkins health

From the perspective of this proposal, this is out of scope.
But the outer _controller_ of the upgrade, the evergreen client, will need a way to decide if a rollback must be triggered or not.

For reference, the dedicated JIRA issue for this is link:https://issues.jenkins-ci.org/browse/JENKINS-50294[JENKINS-50294].

== Motivation

Jenkins has never supported downgrading by itself, and it's unlikely the core constructs will change in this regard anytime soon.
The official way to revert an upgrade if something went wrong is to restore a previous backup.

In the context of _Essentials_, we cannot rely on external backups to revert to the _N-1_ version: this would require some manual user intervention, which is clearly not the user experience _Essentials_ wants to provide.

== Reasoning

=== Scope of the data snapshotting: not a backup system

The practical timeframe where this system is designed to be used is in the next seconds or minutes after an upgrade occurred.
If Jenkins, after it has been restarted, is deemed unhealthy, then an auto-rollback _can_ be initiated.

If a version is proved to be problematic after a few days, the data snapshotting system will **not** be used.

This would be quite impractical because the instance probably generated actual work items during this timeframe.
So rolling back that much later would risk data loss.

The way we will correct things discovered later will instead be by delivering a new version of Jenkins core or the problematic plugin to fix the issue, thereby leveraging the main goal of _Jenkins Essentials_ to make upgrades seamless.

=== Why Git

Using filesystem-level tools offering a snapshotting feature, like LVM, ZFS or btrfs to give a few examples, was considered.
But this was discounted because _Essentials_ vision is about providing an link:https://github.com/jenkinsci/jep/tree/71d9391744c8cc7d6595805f7fdd327eedf6811a/jep/300#automatically-updated-distribution["_easier to use_ and _easier to manage_ Jenkins environment"].
As per the link:https://github.com/jenkinsci/jep/tree/71d9391744c8cc7d6595805f7fdd327eedf6811a/jep/300#target-audience[targeted audience], we obviously do not want to expect _Essentials_ users to be system experts able to set up a dedicated filesystem to operate Jenkins.
And even with system expert, doing so would not make Essentials a very easy and quick to use distribution of Jenkins.

Git offers in this matter a powerful user-space tool that allows us to version,
and quickly roll back to some previous state if need be.

Git is also a very common tool nowadays for developers,
hence it will help making Essentials more accessible to contributors.

=== Why not use compatibleSinceVersion metadata

[TIP]
====
A given plugin can indicate a link:https://jenkinsci.github.io/maven-hpi-plugin/hpi-mojo.html#compatibleSinceVersion[`compatibleSinceVersion`] information, i.e. what is "the oldest version [...] configuration-compatible with.". For example:
* a plugin is being upgraded from version `1.4` to `1.5`
* it specifies `compatibleSinceVersion`=`1.5`
In such case, *if* this plugin wrote configuration files, this means you cannot safely roll back to the `1.4` version of the plugin.
====

Conversely, with the following situation:

* a plugin is being upgraded from version `1.4` to `1.5`
* `compatibleSinceVersion` is `1.4` or less, or absent.

In such case, _even_ if the plugin did write its updated configuration files on the disk, we can expect being able to safely rollback the plugin to the previous `1.4` version, _while leaving_ the configuration file content that was just updated for `1.5` version.

We decided to not specifically handle this situation for now.
In other words, we *will* also roll back those files.

For two reasons:

* this looks like an _optimization_.
Hence as such, this is probably premature to try and be very smart with the way the downgrade will work ;
* we need to first work on the link:https://issues.jenkins-ci.org/browse/JENKINS-49806[JEP to define criteria for selecting plugins to include in Jenkins Essentials], so that we have clear process and automated tests in place to check for correct `compatibleSinceVersion` usage.

== Backwards Compatibility

There are no backwards compatibility concerns related to this proposal.

== Security

=== Secrets

Versioning secrets should not be an issue per se, as the data snapshotting system is designed to be local to the running instance.
In other words, the Git repository data will never be pushed _outside_ by the _Essentials_ code, so no data leak is normally expected from this side.

But as users may have the unfortunate idea to push that repository elsewhere, not being aware they are leaking all secrets, we will conservatively add `secrets/master.key` to the `.gitignore` file.

=== Man In The Middle

The main issue here is that an attacker could for instance instruct the evergreen client to ignore everything (by putting `*` in `.gitignore`), hence make it impossible to roll back.

But this would mean someone was able to talk with connected instances.
So even if this is a valid concern, this is considered a larger scope issue that will be addressed through link:https://issues.jenkins-ci.org/browse/JENKINS-49844[JENKINS-49844].

Hence there are no *specific* security risks related to this proposal.

== Infrastructure Requirements

There are no new infrastructure requirements related to this proposal.

== Testing

We must create an image of _Essentials_ preconfigured with a complete set of representative data.

Creating/defining this data clearly requires human work, but the following checks are deemed automatable.

=== Upgrading/downgrading

Before delivering updates on real connected instances, we must test at least the following scenarios.

* Apply the upgrade or downgrade, then check the instance is _running fine_
footnote:[See again <<healthcheck>>]

We will need to develop ad-hoc testing tools to be able to automatically assess the health of a Jenkins Essentials instance after an upgrade or a downgrade.

Automatically giving some kind of health grade to a running instance is definitely a critical part of Jenkins Essentials.
Detailing this here is out of scope for this proposal.
It is however highly desirable that we centralize this logic and use it both during automated tests, and in production for the evergreen-client to automatically analyze if a product instance is healthy or is not (and decide to roll back or not, for the current matter here).

We will leverage the link:https://github.com/jenkinsci/acceptance-test-harness[Jenkins Acceptance Test Harness project] for this purpose.

=== Leveraging Telemetry and live instances data

_Essentials_ is a link:https://github.com/jenkinsci/jep/tree/master/jep/300#connected[connected] system.
That means we are able to know exactly what versions are running in production.
We will leverage this to test the *actual* possible upgrade paths.

Along the way, that also means we will continuously be able to adjust and enrich what is reported by the __Evergreen client__s from live instances to improve the associated combinations of tests we run.

== Prototype Implementation

This will be implemented in https://github.com/jenkins-infra/evergreen.

== References

* link:https://github.com/jenkinsci/jep/tree/master/jep/300[JEP-300: _Jenkins Essentials_]
* link:https://github.com/jenkinsci/jep/tree/master/jep/300[JEP-301: Evergreen packaging for _Jenkins Essentials_]
* Threads on the dev mailing list about this
link:https://groups.google.com/d/msg/jenkinsci-dev/XdXuMFLXKPw/GM9T-jGbAgAJ[1] and
link:https://groups.google.com/d/msg/jenkinsci-dev/xiaHpfGPTZ8/ifABXq7yAgAJ[2]

0 comments on commit 6773edb

Please sign in to comment.
You can’t perform that action at this time.