JEP-210: External log storage for Pipeline
External log storage for Pipeline
JEP-206: UTF-8 for Pipeline logs
- Backwards Compatibility
- Infrastructure Requirements
- Prototype Implementation
This project overhauls the storage of logs for Jenkins Pipeline builds. (The log can be seen from the Console link, but also from Blue Ocean, Pipeline Steps, and REST and CLI.)
There are four stages of changes:
Minimize Remoting channel usage for durable task (
Replace aggregation of per-step log files with a single log stream annotated by step ID.
Make the traditional Console view use semantic markup of steps and CSS.
Permit the entire log to be sent to a pluggable external service: in the reference implementation, Amazon CloudWatch Logs via Fluentd.
#1 is implemented as a set of independently releasable branches. (It is tracked by JENKINS-52165.) #2, #3, and #4 constitute the complete change, and depend on #1 as well as JEP-206 for use of UTF-8 in Pipeline log files.
The scope does not include other project types (freestyle), or other types of logging (system, branch indexing). JEP-207 covers at least freestyle build logs, and is expected to be integrated with these changes via this unsubmitted JEP.
Together, the effect is that the
TaskListener associated with the overall build,
as well as with each individual step,
is potentially a “remotable” object which can autonomously stream log messages
directly from an agent computer to a cloud log service.
If and when a Jenkins-based UI needs to observe build or step logs,
the Jenkins master will retrieve messages from that service,
and either filter them by step,
or collate them by timestamp into a whole-build log annotated with step metadata.
Invert control of durable task polling
When running durable tasks such as
sh on an agent,
which involves “tailing” an
output.txt file in a hidden directory,
the agent process begins checking for growth in the file at short recurring intervals.
Any new content is streamed to the master.
By using a
last-location.txt file to record the last-observed file length,
the state is persistent, so that log streaming can resume after a restart of the agent process
(including a restart of the Jenkins master, which forces agents to restart as well).
The principal new API here, in
This takes a newly defined interface
capable of receiving text content and an exit event.
DurableTaskStep calls the new APIs when available
(which it will be for all standard steps),
otherwise falling back to the old polling mode.
The step will still poll at infrequent intervals
in case the agent became unresponsive and the step needs to abort.
Handler implementation serializes the
TaskListener coming from the
allowing an external log sink to bypass all master communications.
Replace aggregation with single log stream
There is a default
FileLogStorage implementation based on the traditional
log file inside the build directory.
All log messages associated with the build are sent to the
BuildListener which writes to this file.
Steps which call
getContext().get(TaskListener.class) for logging
trigger addition of a
LogStorageAction (a stateless marker) to the
this supplies a
TaskListener which tracks the
FlowNode.id (typically an integer)
used at certain byte offsets in the log, in a separate
1597 7 1985 2285 8 2686 2983 10 3034 4245 14 4287 10 4296 14 4328 10 4337 14 4360 10 4369 14 4392 10 4401 14 4424 5318 10 5329 14 7537
Conversely, calls to
LogAction.getLogText() produce an implementation
which concatenates segments of the
log for text corresponding to that step.
Console annotations (which are ignored by Blue Ocean)
continue to be written to the log file
using serialized, signed, and Base64-encoded
Semantic markup in HTML console
Every sequence of consecutive lines in the “classic UI” HTML output (
produced by the same step (technically,
FlowNode.id, which is a little more specific as it tracks body blocks)
is marked with an HTML
<span> identifying that ID.
(The absence of such a span indicates log lines not produced by any node.)
<span class="pipeline-node-12">+ whoami jglick </span>
LogStorage.overallLog such as in
FileLogStorage must call
NewNodeConsoleNote is used to print the introduction of a new node into the log, typically displayed like
with variants for block-scoped steps.
The generated HTML includes another
<span> which encodes not just the node ID,
but also the start node ID (in the case of a block end node),
the enclosing (start) node ID,
any the value of any
which, together with both kinds of spans,
implements a richer GUI in the build log view.
In particular, (show) and (hide) links are displayed
which allow output from individual steps or entire blocks
to be selectively displayed.
Pluggable log storage
By default the traditional
log file is used for build logs.
Plugins may override this storage in both the read and write aspects.
LogStorageFactory is the entry point for such an override;
currently a plugin may pick builds to provide storage for,
but this decision is not persisted
(pending work in JEP-207).
On the write side, the plugin is able to supply a custom
This interface was already defined by Jenkins core to be remotable.
An implementation which streams to external storage
therefore needs only to ensure that all fields are truly serializable
and that the code to connect to a storage service can be run on a remote node.
There is a variant which records a
The reference implementation creates JSON-format records to be sent to Fluentd.
On the read side, the plugin can supply an
for either the build as a whole or one node.
(While this Jenkins core interface supports HTML rendering for the “classic” UI,
it is also responsible for generating plain-text content as consumed by Blue Ocean.)
The reference implementation makes API calls to CloudWatch Logs to serve content
based on JSON filter patterns to select messages by build and optionally node.
A subtle issue is the use of
LargeText.isCompleted by UI callers,
which determines whether a given log is considered finalized,
in which case no further “AJAX” calls need be made to fetch subsequent content.
Yet Fluentd does not guarantee that a given record
has been received by CloudWatch Logs when the log event is sent,
and in the standard configuration in fact delays log flushes up to a second,
so without any special effort a build log would sometimes stop refreshing before the end.
This is solved with a utility class
(which could if necessary be pushed into a lower layer)
which records the last (master-side) log message sent for a given scope
and declines to mark the text block as completed
unless the last timestamp observed in CloudWatch Logs
matches the last-delivered timestamp.
Another feature of the reference implementation is to store
ConsoleNotes separately in JSON.
This is accomplished by the
(again, potentially extractable to a shared API layer)
which keeps opaque notes (serialized, signed, and Base64-encoded)
in a separate JSON field,
so that external log viewers can access the plain text easily.
The plugin also supplies a sidebar link in builds
which jumps to a suitably constructed CloudWatch Logs search URL
displayed in the AWS Console.
The overall goal is to minimize the load placed on the Jenkins master process in the common case that the build log is written but not read (or read only via an external log browser).
Changes to durable task polling, log aggregation, and especially log pluggability contribute directly to this goal. Semantic console markup is a small extension to log aggregation.
Push vs. pull for durable task output
Historically, when running durable tasks (
sh and kin),
running output was handled by having the master send a callable to the agent at intervals,
initially short (¼s) but growing exponentially up to some maximum (15s) if the process seems to be idle,
and resetting to short again if and when fresh output is detected.
Not only is there up to a 15s delay in displaying new output,
this is wasteful of master and network resources when the process is idle for a long time;
and sending a
UserRequest and corresponding response involves a fair amount of Java serialization.
By contrast, non-durable processes (such as those created by
Launcher in a freestyle build)
RemoteOutputStream to send content from the agent to master as soon as it is available, minimizing network traffic;
Remoting is also able to optimize this kind of traffic by sending low-overhead
Chunk packets of tailored sizes.
Another minor benefit is that
DurableTaskStep no longer needs to call
StepContext.saveState every time new output is observed,
which was potentially expensive since it involves a fresh serialization to
Therefore durable tasks should switch from the policy of pulling log output to having the agent push log output. The content cannot be detected immediately, since we are effectively tailing a log file, but it can be detected quickly after the log file is updated with minimal overhead.
Log file per step vs. single log stream
The original implementation of Pipeline (then “Workflow”) used a separate log file for each step
as a rough-and-ready solution to the problem of allowing clients of the flow graph
to determine which log lines came from which steps.
To provide support for the various methods in
Run which expect to read a single
WorkflowRun.copyLogs periodically checked for new output in the step log files
and synchronized it to the master log file.
The last-read location for each active step was saved in
build.xml to provide durability.
This system had numerous flaws. Most obviously, it requires almost double the disk space.
The copying had an inherent delay,
ameliorated by eager copying at the time of step completion,
which can lead to flaky tests if care is not taken to wait for content.
parallel steps was also not interleaved in real time.
A heavily loaded system could wind up consuming considerable CPU and IOPS
running copy tasks for numerous concurrent builds.
Not only did many small step log files need to be read frequently,
but in the safest durability modes every update forced a new
which besides I/O requires Java serialization of possibly large unrelated objects.
The required synchronization also introduced bottlenecks and occasional deadlocks.
All the background tasks also sometimes consumed all available threads in an executor pool,
leading to starvation of more critical operations.
Using a single
log file and streaming all data directly there
is considerably simpler, even accounting for the need to handle step ID prefixes.
It may be less efficient at read time,
but the primary consideration is minimizing overhead at write time.
The original Pipeline log display hard-coded markup for new node notes and failed to expose any of the node association information to potential UIs. Usability issues in the “classic” log build log UI continue to be brought up by users as annoyances, despite the availability of an alternative UI in Blue Ocean.
Other behaviors, like hiding new node displays
or hiding all but the first
parallel branch initially,
could be added later or even perhaps contributed by plugins.
The key goals of the JEP are addressed by external-storage implementations: the use of appropriate long-term storage systems for critical log data; and the ability to stream content from an agent JVM without consuming bandwidth on the Remoting channel.
An earlier draft implementation exposed a simpler SPI to plugins:
they could only supply a
TaskListener for the overall build,
InputStream for the overall build content.
This SPI is effectively still available via
but it has proven inadequate for the CloudWatch Logs implementation at least.
Most obviously, the
InputStream interface forces the implementation to serve a complete build log
even when the text for only a single step (~
FlowNode.id) has been requested.
The CloudWatch Logs implementation can do better by using a server-side filter.
This avoids any need for the
¦ separator used by
More subtly, the
InputStream interface lacked any room for indicating
that the build content was incomplete.
Annotating lines by step using
For the default filesystem-based storage,
originally it was attempted to use a special
ConsoleNote to mark which step produced a given line.
This worked, but resulted in unacceptably bloated raw logs:
the serialized form, after GZIPping and Base64-encoding, was over 200 characters per line;
and rendered typical raw logs more or less unreadable.
¦ separator (later used only by
StreamLogStorage) was introduced,
as it adds minimal space overhead and does not interfere with legibility.
Some aspects of the implementation would be easier given certain API changes in Jenkins core (or Stapler).
ConsoleAnnotators could be replaced by a proper API;
AnnotatedLargeText methods could be better designed for subclassing;
and some overrides in
WorkflowRun would make sense pulled up into
For now, these considerations were outweighed by the convenience of running on stock versions of Jenkins LTS.
When the synchronous
Launcher interface is used to start non-durable remote processes,
as happens for example from typical
SCM implementations delegating to a command-line tool,
historically the remotability of any supplied
TaskListener is ignored
and all log lines are sent over the Remoting channel to be processed on the master side:
This was fixed as a simple patch to
which would also benefit JEP-207 by removing any need to use
DecoratedLaunchers for freestyle build steps.
A related change could perhaps also fix encoding issues with such synchronous steps for JEP-206.
TaskListener was long ago designed to be remotable,
StreamTaskListener in fact handled that by using
until now it was not noticeable that
Launcher fails to remote the listener
since the effect is the same if the instance is in fact a
Explicit log mirroring
Some existing plugins such as Logstash or AWS CloudWatch Logs Publisher support redirecting or mirroring log messages to cloud services. To the extent that these are even compatible with Pipeline, they nonetheless suffer from fundamental limitations compared to the approach in this JEP: job configurers may have to opt-in to the publishing; log messages may still be kept on disk in the Jenkins master; the existing Jenkins UI gestures to display logs do not pick up data from the cloud; Remoting channels are still clogged with log-related traffic.
ConsoleLogFilter implementations must be safely remotable in order to work correctly on the agent side.
Since any implementations available for use in Pipeline jobs
must already have been
Serializable (to be saved in
this is not as significant a restriction as it might at first appear.
It does mean that besides being careful about state (non-
transient instance fields),
implementations may not assume they are running inside the master JVM.
Integration testing is likely to uncover any critical problems in widely-used filters.
FlowNodes using the old
LogActionImpl will continue to serve log text from the per-step file.
This applies both to completed historical builds,
and to steps completed prior to the resume of a build which spanned the upgrade.
For the special case of a step running across the upgrade,
LogActionImpl will stream new content to the overall build log,
as well as to the per-step log.
(For that purpose, the upgrade is detected as an update to the
Historical builds using
WorkflowRunConsoleNote should continue to render logs,
but without the new semantic markup features.
ConsoleLogFilter with security-sensitive fields
(notably the password masking by the
must take into account that it will now be sent to the agent side,
where that data is vulnerable to retrieval or even manipulation by rogue builds.
In the case of
withCredentials this is not an issue,
since the agent already received these same secrets as environment variables.
Currently the Fluentd logger in the reference implementation assumes that the Fluentd server is accessible anonymously. A production-grade implementation should prevent a rogue build from writing log lines to a build of an unrelated job. This would presume some kind of Fluentd authentication plugin capable of processing generated tokens scoped to a particular JSON field, which is not yet known to exist. Alternately, logs could be sent directly to CloudWatch Logs, but this would then perhaps require the master to be able to use IAM to create temporary roles and tokens.
There are no new infrastructure requirements related to this proposal, beyond what may arise in the course of testing external log implementations based on live services such as CloudWatch Logs.
Automated functional tests verify the basic aspects of the change,
such as the fact that with a suitably remotable
sh step will in fact deliver messages to the log sink constructed on the agent side.
Functional tests for open-source, cluster-based implementations such as ElasticSearch could be run using
Tests for SaaS-based implementations such as CloudWatch would require either mocks,
and/or live tests run on restricted CI machines.
Integration testing against uncommon plugins and usage modes will be needed,
which will likely use standard mechanisms such as
and some exploratory testing is expected.
The nature of performance testing remains to be defined. The principal constraint is that the production of logs during a build should be efficient; Jenkins-based display of logs during a running build or of a completed build may involve some overhead to retrieve and collate messages, but this is assumed to be a relatively infrequent event.
The reference implementation is a
which depends on a series of Pipeline-related pull requests.