Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.adoc

README.adoc

JEP-308: Evergreen Error Telemetry API

Abstract

One critical aspect of Jenkins Evergreen is that the connected instances must be upgraded automatically in a safe way. To that end, we detect the errors that occur in these instance, and push those to a central place to participate to the assessment of the health level of a given instance, and by transitivity of a given Evergreen update level.

The current document describes what the server side API for this error logging looks like [1].

Specification

Authentication and linking log entries to an instance

It is obviously required that the instance is registered and authenticated to be allowed to push data to this endpoint, as described in JEP-303.

HTTP status code will hence be the ones defines in that specification.

Endpoint and expected format

We are taking a simplistic approach defining a single endpoint for error logging.

The /telemetry/error endpoint will simply receive a JSON message sent via a POST request, with a single log entry, containing in turn the JSON structure emitted locally as defined in JEP-304.

No other verb than POST will be accepted.

Error telemetry payload
{
  "log": {
    "version": 1,
    "timestamp": 1522840762769,
    "name": "io.jenkins.plugins.SomeTypicalClass",
    "level": "WARNING",
    "message": "the message\nand another line",
    "exception": {
      "raw": "serialized exception\n many \n many \n lines"
    }
  }
}
Registration Headers
Content-Type: application/json

When things are sent as expected, then the endpoint will answer with an HTTP-201 and the following payload:

Error Telemetry successful creation payload
{
  "status": "OK"
}
ℹ️
the log is not sent back as this often can be seen to avoid sending big logs back and forth when the client is most likely not interested by what it just sent.

Maximum message size

The maximum size of a the global JSON message is 1 million characters (~1 MB). This should be enough to allow receiving enough information when a stack trace is included, and defining a maximum size seems critical to operate the service. Hence the sender is expected to truncate the message if it is larger than this.

If the received message exceeds that maximum, the HTTP status code will be an HTTP-413.

Error Telemetry failed creation payload
{
  "status": "ERROR"
}

Malformed data

The client will receive a more generic HTTP-400. An optional message field can be provided by the server in the payload to help understand what is wrong.

Error Telemetry failed creation payload
{
  "status": "ERROR",
  "message": "(optional) message to explain what made the server reject this message."
}

Meta Error Logging

When the server rejects data, be it because of exceeded message size, or malformed message, it should reject it in some non silent way. The message can be dropped, but the fact there was a rejection shall be logged on the server side.

This is important to detect attacks, but also more simply bugs that might have made the reporting system broken and we need to fix expeditely.

Rate limiting

We may define in the future the use of rate limiting. In that case, the server will send an HTTP-429.

If so, the client is expected to retry later (the exact meaning of later will be clarified if we decide to go that path).

Motivation

There is no existing code base or process for this feature.

Reasoning

A dedicated /telemetry/error endpoint only for error logging

Despite we will define in the future endpoints for reporting other telemetry types, like metrics telemetry, for instance like Pipeline related metrics, we are defining a dedicated entrypoint for error logging, and will define others for other types.

We are not using the same endpoint, for instance using a type field as those different Telemetry communications are very likely to be very different, and it will make this easier to define router-level rules if needed.

Reliability concerns

Though the service is expected to be always available, the client should be designed to handle a temporary unavailability.

Send logs one by one

For the current design, the client will use a single POST HTTP request for each log entry to send. We expect that the number of error or warning logs emitted from the Jenkins instance to be rare (i.e. less than a few dozens per day).

So, at that stage of the project, we keep things simple. If it proves wrong, we will be able to evolve the API to accept for instance either log as currently, or logs to directly accept an array of multiple logs in one go.

Backwards Compatibility

As the log field is somehow an opaque blob content, the compatibility concerns are more the same as defined in the JEP 304 logging format section. But as also discussed there, using the version field of the message should be enough to accomadate any schema evolution.

Security

There are no security risks related to this proposal.

Infrastructure Requirements

That service will need to be integrated and operated in the current Jenkins Infrastructure.

This will most likely be integrated with the existing setup for error logging, but that aspect will need more prototyping to make this clearer.

Testing

Rejecting bad data

We must check that the backend does reject exceedingly big messages, or malformed logs.

Load Testing

The system must be tested against a reasonable amount of data, by evaluating the expected volume in 3 to 6 months that the service is likely to receive. This should especially be done by sending the right amount in number, but also in sizes (mimicking clients that would be sending a lot of stack traces for example).

Store UUID with logs

It is critical to the quality of the telemetry system to be able to find and remove some logs originating from a rogue instance. Be it because it is controlled by an attacker, or for any other valid reasons.

So, though not a pure API contract concern, it is important that the API stores a way to link back a log entry to its origin.

It is recommended to store the UUID, so that the log can be linked back to not only a given instance, but a period of time where that instance was connected.


1. Basically sending the Jenkins logs defined in the JEP-304
You can’t perform that action at this time.