Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-52165] Controller.watch API #60

Merged
merged 37 commits into from Sep 13, 2018

Conversation

Projects
None yet
4 participants
@jglick
Copy link
Member

jglick commented Feb 9, 2018

JENKINS-52165 Just the minimum changes pulled out of #29 to enable push rather than pull logging from durable tasks.

private static ScheduledExecutorService watchService;
private synchronized static ScheduledExecutorService watchService() {
if (watchService == null) {
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher"));

This comment has been minimized.

@svanoort

svanoort Feb 9, 2018

Member

TODO Use the new ClassLoaderSanityThreadFactory when available from core

jglick added some commits Feb 13, 2018

jglick added some commits Feb 13, 2018

@jglick jglick referenced this pull request Feb 13, 2018

Closed

Watching plus Unicode #62

try (FileChannel ch = FileChannel.open(Paths.get(logFile.getRemote()), StandardOpenOption.READ)) {
CountingInputStream cis = new CountingInputStream(Channels.newInputStream(ch.position(lastLocation)));
handler.output(cis);
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null);

This comment has been minimized.

@jglick

jglick Feb 13, 2018

Author Member

Seems that this would not suffer from JENKINS-37575-like problems. We intentionally write lastLocationFile only after a successful call to output, so that if there is an agent channel outage we do not lose output. In such a case this Watcher simply dies; if and when the agent is reconnected, a new one should start at the same position.

jglick added some commits Mar 6, 2018

@oleg-nenashev
Copy link
Member

oleg-nenashev left a comment

I do not see blocker issues here, but IMHO the code could be improved a bit. I will re-review after discussing comments with Jesse

* @param handler a remotable callback
* @param listener a remotable destination for messages
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

IOException needs to be documented. jenkinsci/workflow-durable-task-step-plugin#65 suggests that it happens iff the node is offline, but there is no evidence of that here

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Yes that could be clarified.

@@ -86,7 +100,7 @@
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status.
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)

This comment has been minimized.

@oleg-nenashev

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

(ditto)

@@ -55,7 +69,7 @@
/**
* Checks whether the task has finished.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

The method is overridable, and the downstream implementation may actually use it but inherit Javadocs. So I would rather not touch it

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.

/**
* Tails a log file and watches for an exit status file.
* Must be remotable so that {@link #watch} can transfer the implementation.
*/
protected static class FileMonitoringController extends Controller {

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

SerializableOnlyOverRemoting once it gets to 2.107+ ? Not the best API, but we do not have better annotations now (should be fixed eventually once we get to any kind of static analysis)

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Well SerializableOnlyOverRemoting indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.

(For this feature I also need a way of emphasizing that a ConsoleLogFilter must be serializable over Remoting.)

/**
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException {

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

Add CheckForNull so nobody unboxes it to int with FindBugs being silent?

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Could do that, but actually I think I want to mark this

@Restricted(ProtectedExternally.class)

as a reminder that only this class is expected to be calling it.

This comment has been minimized.

@jglick

jglick Jun 8, 2018

Author Member

Never mind—plain protected already accomplishes that. Can annotate nullness.

lastLocation = Long.parseLong(lastLocationFile.readToString());
}
FilePath logFile = controller.getLogFile(workspace);
long len = logFile.length();

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

Maybe check the file's existence just in case? otherwise the watcher will give up immediately (IOException ). But IIUC the code it may want to wait to avoid race conditions. Or not?

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

If the file does not exist (because the wrapper script is still launching, or failed to launch), length() will be zero, thus nothing will happen on this iteration and we check again in another 100ms.

* If it needs to communicate with the master, you may use {@link VirtualChannel#export}.
* @see Controller#watch
*/
public abstract class Handler implements Serializable {

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

SerializableOnlyOverRemoting after 2.107+

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Perhaps, yes. As discussed above.

* though in the typical case of line-oriented output it is likely that it will end in a newline.
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false.
* @param stream a way to read process output which has not already been handled
* @throws Exception if anything goes wrong, this watch is deactivated

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

I would rather use a more narrow exception type in throws so that Handler implementers do not start throwing RuntimeExceptions. But you will have to catch them when invoking Handler methods anyway...

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

I do not really care what type is thrown. If anything is caught, the watch task shuts down.

try (InputStream is = getOutputFile(workspace).read()) {
return IOUtils.toByteArray(is);
}
}

@Override public final void stop(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
launcher.kill(Collections.singletonMap(COOKIE, cookieFor(workspace)));
// TODO after 10s, if the control dir still exists, write a flag file and have the Watcher shut down (interrupting any ongoing handler.output call if possible)

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

Makes sense to add logging at least

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log

This comment has been minimized.

@oleg-nenashev

oleg-nenashev Mar 14, 2018

Member

so Runtime exceptions will be non-recoverable here... Looks fine assuming that we want Handler implementations to be written properly.

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.

This comment has been minimized.

@dwnusbaum

dwnusbaum Aug 10, 2018

Member

If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus returns null in DurableTaskStep.Execution#check. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

If the channel is closed, then everything should retry in DurableTask#getWorkspace during the next call to DurableTaskStep.Execution#check so that seems fine to me, although the 5 minute recurrence period seems a little long.

This comment has been minimized.

@jglick

jglick Aug 10, 2018

Author Member

Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken?

Possibly. I will think about another method in Handler for this purpose. As you note, in the case of a closed channel, the agent likely cannot communicate that error by definition, and getWorkspace is responsible for fixing things up a few minutes later. But in the case of a cloudy log implementation it could be useful to indicate to the master that some other error occurred, logs may be faulty, and we should rewatch when possible. Whether anything else in the build gets successfully logged at that point, who knows.

A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

Yeah it seems a bit of a toss-up. My general expectation is that if something strange happens without a clear recovery procedure, we should err on the side of just failing the build.

This comment has been minimized.

@jglick

jglick Aug 10, 2018

Author Member

jenkinsci/workflow-durable-task-step-plugin@c859378 may produce better outcomes for cloudy log sinks. Without a test case it is hard to predict exactly what the failure modes will be, much less what response a user would actually find helpful.

ExecutorStepTest.shellOutputAcrossRestart does now exercise the code path under discussion in the case of the stock Remoting sink. The aforementioned commit does not change the behavior in that test, because it is indeed a ChannelClosedException that you get when the agent is disconnected from the master side. (If you just kill the agent JVM via either SIGTERM or SIGKILL, then the watcher task dies with it and you are also OK.)

@oleg-nenashev oleg-nenashev self-requested a review Mar 14, 2018

@oleg-nenashev oleg-nenashev referenced this pull request Mar 14, 2018

Closed

Watching plus Unicode #65

@oleg-nenashev
Copy link
Member

oleg-nenashev left a comment

Requesting changes to indicate there are comments to be discussed before merging.

@jglick
Copy link
Member Author

jglick left a comment

Some responses. Certain things will need to be cleaned up when I revisit these PRs.

* @param handler a remotable callback
* @param listener a remotable destination for messages
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Yes that could be clarified.

@@ -55,7 +69,7 @@
/**
* Checks whether the task has finished.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.

/**
* Tails a log file and watches for an exit status file.
* Must be remotable so that {@link #watch} can transfer the implementation.
*/
protected static class FileMonitoringController extends Controller {

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Well SerializableOnlyOverRemoting indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.

(For this feature I also need a way of emphasizing that a ConsoleLogFilter must be serializable over Remoting.)

/**
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException {

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Could do that, but actually I think I want to mark this

@Restricted(ProtectedExternally.class)

as a reminder that only this class is expected to be calling it.

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Not sure I follow what you are saying.

To be clear, this overload needed to be introduced so it can be called from the watch task, which cannot produce a (usable) Launcher; the original overload was not using its launcher parameter anyway.

lastLocation = Long.parseLong(lastLocationFile.readToString());
}
FilePath logFile = controller.getLogFile(workspace);
long len = logFile.length();

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

If the file does not exist (because the wrapper script is still launching, or failed to launch), length() will be zero, thus nothing will happen on this iteration and we check again in another 100ms.

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.

* If it needs to communicate with the master, you may use {@link VirtualChannel#export}.
* @see Controller#watch
*/
public abstract class Handler implements Serializable {

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

Perhaps, yes. As discussed above.

* though in the typical case of line-oriented output it is likely that it will end in a newline.
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false.
* @param stream a way to read process output which has not already been handled
* @throws Exception if anything goes wrong, this watch is deactivated

This comment has been minimized.

@jglick

jglick Mar 27, 2018

Author Member

I do not really care what type is thrown. If anything is caught, the watch task shuts down.

jglick added some commits Jun 8, 2018

assertEncoding(description, file, charset, expected, expectedEncoding, true, dockerWS, dockerLauncher);
}
private void assertEncoding(String description, String file, String charset, String expected, String expectedEncoding, boolean output, FilePath dockerWS, Launcher dockerLauncher) throws Exception {
System.err.println(description + " (output=" + output + ")"); // TODO maybe this should just be moved into a new class and @RunWith(Parameterized.class) for clarity

This comment has been minimized.

@jglick

jglick Aug 7, 2018

Author Member

Did that refactoring, adding the watch parameter in the process.

jglick added some commits Aug 7, 2018

[JENKINS-37575] As in #74, last-location.txt should record only what …
…Handler.output actually received, in case of some exception halfway followed by a fresh agent connection.
* @throws IOException if initiating the watch fails, for example with a {@link ChannelClosedException}
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/
public void watch(@Nonnull FilePath workspace, @Nonnull Handler handler, @Nonnull TaskListener listener) throws IOException, InterruptedException, UnsupportedOperationException {

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Speculative, but what's the expected behavior in the face of an InterruptedException here?

This comment has been minimized.

@jglick

jglick Aug 8, 2018

Author Member

During DurableTaskStep.Execution.start: the sh step would fail. During getWorkspace: we retry later.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null);

This comment has been minimized.

@dwnusbaum

dwnusbaum Aug 8, 2018

Member

Could you use ch.position() instead and get rid of CountingInputStream entirely, or am I missing something?

This comment has been minimized.

@jglick

jglick Aug 8, 2018

Author Member

Probably. My concern here is to have last-location.txt record the last position that was actually delivered to the Handler, in case something broke the streaming midway. It is maybe clearer that this would correspond to how much of the locallyEncodedStream was processed as opposed to the behavior of newInputStream, but OTOH that method does claim that the stream is unbuffered so I guess it should work and be simpler. (And anyway I have not managed to test this scenario.) I will try it.

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

If we're going to rework the way output is done, I'd really like to return a stream-able data implementation rather than raw byte[] - this ensures we're not allocating giant wads of memory to collect all the data if something produces logs quickly. Whether it's a Stream, Channel, whatever I do not care as much.

Consider also something like 'writeOutput' that internally feeds data to a sink.

It's also just plain more elegant and easier to work with in general.

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Actually... we could still count location that way using a CountingOutputStream I believe.

This comment has been minimized.

@jglick

jglick Aug 8, 2018

Author Member

Was already discussed earlier. This method is only used when you specify returnStdout: true, a fairly specialized case which is anyway only suited for small amounts of output—in the case of a workflow-cps caller, the entire resulting String gets saved to program.dat.

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Damnit, we've gone in circles on these PRs so long that I'm repeating myself -- at least I'm consistent though! 😆

private synchronized static ScheduledExecutorService watchService() {
if (watchService == null) {
// TODO 2.105+ use ClassLoaderSanityThreadFactory
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher"));

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Wish we could use a thread pool that grows as needed for load (with an upper limit) but still supports scheduling. Sigh.

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

( I know, I know, not part of the standard library )

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {
return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
@Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
byte[] buf = FileUtils.readFileToByteArray(f);

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Sigh, so much wish for streaming use.

This comment has been minimized.

@jglick

jglick Aug 8, 2018

Author Member

Ditto, unnecessary here.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(ch.position()), null);

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

🐜 Noted that without atomic writes here, if the agent is restarted during writing, then we lose the position within the log and re-send I think the entire log. Potential for looping if somehow we get repeated failures (very unlikely though).

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved? Suppose, for example that we can't connect to whatever is supposed to receive the log output.

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Wait, at that point will we bomb out completely and never try to send the log again? Gah, this code is literally giving me headaches to follow the flow of execution because it's so indirect.

@jglick C'mon man, a few more explanatory comments please? I almost think the overall flow merits a flow chart so you can see what's passed back and forth.

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

Let me say what I expect: if the handler throws an exception (probably IOExeption) while trying to handle output, or takes longer than some timeout period, we break out (without mutating the lastLocationFile), and retry at least once after an appropriate delay.

At some point if issues are resolved so the handler passes, we should (via mechanism unspecified) be able to get the entire log re-sent from the point of failure, including any new content that was added.

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?

Then we assume the log sink is hopeless and the Watcher task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch call will be made and we will try to resume.

a few more explanatory comments please

ack

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

if the handler throws an exception (probably IOExeption)

I need to note in a comment that the default handler does not currently ever throw IOException as a result of a sink issue (only as a result of a local file error).

This comment has been minimized.

@jglick
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
try {
handler.output(utf8EncodedStream);

This comment has been minimized.

@svanoort

svanoort Aug 8, 2018

Member

@jglick Is there an enclosing Timeout somewhere here so we can't hang indefinitely? May be somewhere non-obvious since this has so many levels of indirection that my brain is throwing a StackOverflowException trying to process it, but if there isn't one there should be, and if there is one please note it somewhere obvious so this doesn't get flagged again.

Note also: as long as you haven't advanced the location marker it should be safe to try re-sending later... assuming this doesn't cause the whole thing to bomb out.

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

Is there an enclosing Timeout somewhere here so we can't hang indefinitely?

No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like

listener.getLogger().println("Whatever");

If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.

This comment has been minimized.

@svanoort

svanoort Aug 9, 2018

Member

What if it halts? I'm not clear what the result is there, and what if that happens a bunch of times in a row?

This comment has been minimized.

@jglick

jglick Aug 10, 2018

Author Member

Not sure what you mean by “it halts” in this context.

@svanoort
Copy link
Member

svanoort left a comment

Blocking to reflect comments that need addressing -- and, uh, a very unclear and perhaps incorrect handling of cases where you encounter a transitory exception that prevents sending log output via Handler.output -- but which is recoverable.

Which sounds obscure except that with networks, it could happen pretty regularly due to hiccups.

Also: my review isn't done per se yet, this is kind of the first pass (I will need to sit down and actually diagram the flow from start to end probably).

@jglick
Copy link
Member Author

jglick left a comment

It is instructive to watch ExecutorStepTest.buildShellScriptAcrossDisconnect in jenkinsci/workflow-durable-task-step-plugin#63, as I walked @dwnusbaum through.

@@ -86,7 +100,7 @@
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status.
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

(ditto)

InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
try {
handler.output(utf8EncodedStream);

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

Is there an enclosing Timeout somewhere here so we can't hang indefinitely?

No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like

listener.getLogger().println("Whatever");

If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(ch.position()), null);

This comment has been minimized.

@jglick

jglick Aug 9, 2018

Author Member

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?

Then we assume the log sink is hopeless and the Watcher task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch call will be made and we will try to resume.

a few more explanatory comments please

ack

Reverting 399e3eb; seems to be better to write last-location iff the …
…current write succeeded (err on the side of duplication).
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
handler.output(utf8EncodedStream);
lastLocationFile.write(Long.toString(ch.position()), null);

This comment has been minimized.

@dwnusbaum

dwnusbaum Aug 10, 2018

Member

I think a comment explaining that this is intentionally not in a finally block so that we duplicate data rather than skip it when an error occurs would be helpful.

This comment has been minimized.

@jglick

jglick Aug 10, 2018

Author Member

Just added that.

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log

This comment has been minimized.

@dwnusbaum

dwnusbaum Aug 10, 2018

Member

If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus returns null in DurableTaskStep.Execution#check. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

If the channel is closed, then everything should retry in DurableTask#getWorkspace during the next call to DurableTaskStep.Execution#check so that seems fine to me, although the 5 minute recurrence period seems a little long.

@jglick jglick requested review from svanoort and dwnusbaum Aug 10, 2018

@dwnusbaum
Copy link
Member

dwnusbaum left a comment

I don't have any unaddressed concerns now that we attempt to retry after non-remoting failures in the default Handler in workflow-durable-task.

@svanoort
Copy link
Member

svanoort left a comment

Seems fine to me now, AFAICT.

@svanoort svanoort merged commit a224295 into jenkinsci:master Sep 13, 2018

2 checks passed

continuous-integration/jenkins/incrementals Deployed to Incrementals.
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details

@jglick jglick deleted the jglick:watch-JENKINS-38381 branch Sep 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.