Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.adoc

README.adoc

JEP-306: Evergreen Instance Client Health Checking

Abstract

The first pillar of Jenkins Evergreen is that it is an Automatically Updated Distribution.

To be able to achieve this goal in a durable way, we need to be able to automatically assess the health of a given instance. The scope of this proposal is to design the way we decide if we automatically roll back or not.

It will also regularly be fed back to the backend so that we can compute global health statistics for a given setup, but that is out of scope for the current document.

Specification

We do expect to evolve the health-checking process as we learn, but as the local healthcheck is a critical part of the overall Evergreen story, we want to start small on purpose. Once we deem to have learned enough, we will create new proposals to discuss and document the new checks we want to add.

We will check two URLs:

  • the /instance-identity/ page

  • the /metrics/evergreen/healthcheck

Instance Identity URL

We check that:

  • it is reachable,

  • and returns a 200 HTTP status code.

/metrics/evergreen/healthcheck URL

We configure the Metrics Jenkins plugin to provide a healthcheck under the specified URL. The prettified returned format is the following

/metrics/evergreen/healthcheck URL output
{
  "disk-space": {
    "healthy": true
  },
  "plugins": {
    "healthy": true,
    "message": "No failed plugins"
  },
  "temporary-space": {
    "healthy": true
  },
  "thread-deadlock": {
    "healthy": true
  }
}

From this URL, we check that:

  • it returns a 200 HTTP status code

  • On the produced JSON

    • it is valid JSON

    • plugins.healthy attribute is true

    • thread-deadlock.healthy attribute is true

We are not checking the space related attributes on purpose, at least for now. The rationale being that the upgrade to a new Evergreen BOM [1] could consume a bit more disk space, and trigger a disk space warning. We probably do not want to wholly revert an upgrade because of this.

Absence of the metrics plugin

Making this plugin a part of the healthchecking story obviously makes it a required plugin. So the evergreen-client should make sure it is always present and active when upgrading. For instance, if it is disabled, or removed from the disk, it must be forcefully reinstalled and enabled automatically next time.

If for some reason, the plugin fails to start, then the healthcheck should fall back to only check the /instance-identity/, and report this issue as critical to the backend.

Metrics plugin Configuration

The plugin is configured using the Configuration As Code Jenkins plugin, using the following syntax:

Evergreen Configuration-as-code file
---
jenkins:
  # [snip other configurations]
  metricsaccesskey:
    accessKeys:
      - key:            "evergreen"
        description:    "Key for evergreen health-check"
        canHealthCheck: true
        canPing:        false
        canThreadDump:  false
        canMetrics:     false
        origins:        "*"

Motivation

There is nothing existing in this area.

Reasoning

Why not leverage the error logging

In the JEP-304 on Evergreen Client Error Telemetry Logging, we describe how the Jenkins instance is publishing its error logging.

We are not going to use those logs for now for the reason stated previously: we do no think we know enough how to use them correctly yet. So we are taking a careful path here: anyway, those logs are going to be sent to the backend as a one of the data points for assessing quality of given releases.

Over time, once we have a better idea of what they typically are, and how to use them, this is likely we will design a new proposal to enrich the way we do the healthchecking process from the evergreen-client.

Backwards Compatibility

There are no backwards compatibility concerns related to this proposal.

Security

Accessing the /metrics/evergreen/healthcheck URL from outside the container

Though this is probably not a problematic data leak that it is accessible to anyone who would already be able to reach the server, we plan to use the origins field to restrict requesters to be localhost so that only the evergreen-client can access it.

🔥
Seems like this field is actually not designed for source IP filtering. If so, we will either add this feature to the metrics plugin or adjust the proposal to confirm the sentence above: that we don’t deem it critical that this URL is accessible from outside the container for security.

Using evergreen as the metrics access key

Normally, a metrics plugin healthcheck URL is of the format SERVER/metrics/<access-key>/healthcheck.

We set the the accesskey value for clarity and simplicity: this makes it unnecessary to write some logic to initialize a random access key, and have the client store or access it from somewhere.

Once the healthcheck endpoint access will be restricted to localhost only, that is deemed to not an issue anymore.

Absence of the metrics plugin

An attacker could try to make the plugin fail, for instance by implementing an extension in a bad way.

If this ends up making the plugin fail to start, this should be detected by the evergreen-client and it will fall back to the simpler mode when only the /instance-identity/ URL is checked.

Infrastructure Requirements

There are no new infrastructure requirements related to this proposal.

Testing

This component and any change to it should be tested very aggressively, as it could trigger unneeded rollbacks in production or worse if broken.

There should particularly be a testcase to check the behaviour in absence of the Metrics plugin, or generally with failed plugins.

Prototype Implementation

References

❗️

When moving this JEP from a Draft to "Accepted" or "Final" state, include links to the pull requests and mailing list discussions which were involved in the process.


1. Bill Of Materials: the configuration file describing what an Evergreen release is made of: what exact WAR version, which plugins, etc.
You can’t perform that action at this time.