Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-52165] Controller.watch API #60

Merged
merged 37 commits into from Sep 13, 2018

Conversation

jglick
Copy link
Member

@jglick jglick commented Feb 9, 2018

JENKINS-52165 Just the minimum changes pulled out of #29 to enable push rather than pull logging from durable tasks.

private static ScheduledExecutorService watchService;
private synchronized static ScheduledExecutorService watchService() {
if (watchService == null) {
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO Use the new ClassLoaderSanityThreadFactory when available from core

@jglick jglick mentioned this pull request Feb 13, 2018
try (FileChannel ch = FileChannel.open(Paths.get(logFile.getRemote()), StandardOpenOption.READ)) {
CountingInputStream cis = new CountingInputStream(Channels.newInputStream(ch.position(lastLocation)));
handler.output(cis);
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this would not suffer from JENKINS-37575-like problems. We intentionally write lastLocationFile only after a successful call to output, so that if there is an agent channel outage we do not lose output. In such a case this Watcher simply dies; if and when the agent is reconnected, a new one should start at the same position.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see blocker issues here, but IMHO the code could be improved a bit. I will re-review after discussing comments with Jesse

* @param handler a remotable callback
* @param listener a remotable destination for messages
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException needs to be documented. jenkinsci/workflow-durable-task-step-plugin#65 suggests that it happens iff the node is offline, but there is no evidence of that here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that could be clarified.

@@ -86,7 +100,7 @@
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status.
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

@@ -55,7 +69,7 @@
/**
* Checks whether the task has finished.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is overridable, and the downstream implementation may actually use it but inherit Javadocs. So I would rather not touch it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.

/**
* Tails a log file and watches for an exit status file.
* Must be remotable so that {@link #watch} can transfer the implementation.
*/
protected static class FileMonitoringController extends Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializableOnlyOverRemoting once it gets to 2.107+ ? Not the best API, but we do not have better annotations now (should be fixed eventually once we get to any kind of static analysis)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well SerializableOnlyOverRemoting indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.

(For this feature I also need a way of emphasizing that a ConsoleLogFilter must be serializable over Remoting.)

/**
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add CheckForNull so nobody unboxes it to int with FindBugs being silent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that, but actually I think I want to mark this

@Restricted(ProtectedExternally.class)

as a reminder that only this class is expected to be calling it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind—plain protected already accomplishes that. Can annotate nullness.

lastLocation = Long.parseLong(lastLocationFile.readToString());
}
FilePath logFile = controller.getLogFile(workspace);
long len = logFile.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check the file's existence just in case? otherwise the watcher will give up immediately (IOException ). But IIUC the code it may want to wait to avoid race conditions. Or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file does not exist (because the wrapper script is still launching, or failed to launch), length() will be zero, thus nothing will happen on this iteration and we check again in another 100ms.

* If it needs to communicate with the master, you may use {@link VirtualChannel#export}.
* @see Controller#watch
*/
public abstract class Handler implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializableOnlyOverRemoting after 2.107+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, yes. As discussed above.

* though in the typical case of line-oriented output it is likely that it will end in a newline.
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false.
* @param stream a way to read process output which has not already been handled
* @throws Exception if anything goes wrong, this watch is deactivated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use a more narrow exception type in throws so that Handler implementers do not start throwing RuntimeExceptions. But you will have to catch them when invoking Handler methods anyway...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really care what type is thrown. If anything is caught, the watch task shuts down.

try (InputStream is = getOutputFile(workspace).read()) {
return IOUtils.toByteArray(is);
}
}

@Override public final void stop(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
launcher.kill(Collections.singletonMap(COOKIE, cookieFor(workspace)));
// TODO after 10s, if the control dir still exists, write a flag file and have the Watcher shut down (interrupting any ongoing handler.output call if possible)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to add logging at least

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so Runtime exceptions will be non-recoverable here... Looks fine assuming that we want Handler implementations to be written properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.

Copy link
Member

@dwnusbaum dwnusbaum Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus returns null in DurableTaskStep.Execution#check. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

If the channel is closed, then everything should retry in DurableTask#getWorkspace during the next call to DurableTaskStep.Execution#check so that seems fine to me, although the 5 minute recurrence period seems a little long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken?

Possibly. I will think about another method in Handler for this purpose. As you note, in the case of a closed channel, the agent likely cannot communicate that error by definition, and getWorkspace is responsible for fixing things up a few minutes later. But in the case of a cloudy log implementation it could be useful to indicate to the master that some other error occurred, logs may be faulty, and we should rewatch when possible. Whether anything else in the build gets successfully logged at that point, who knows.

A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

Yeah it seems a bit of a toss-up. My general expectation is that if something strange happens without a clear recovery procedure, we should err on the side of just failing the build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jenkinsci/workflow-durable-task-step-plugin@c859378 may produce better outcomes for cloudy log sinks. Without a test case it is hard to predict exactly what the failure modes will be, much less what response a user would actually find helpful.

ExecutorStepTest.shellOutputAcrossRestart does now exercise the code path under discussion in the case of the stock Remoting sink. The aforementioned commit does not change the behavior in that test, because it is indeed a ChannelClosedException that you get when the agent is disconnected from the master side. (If you just kill the agent JVM via either SIGTERM or SIGKILL, then the watcher task dies with it and you are also OK.)

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to indicate there are comments to be discussed before merging.

Copy link
Member Author

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some responses. Certain things will need to be cleaned up when I revisit these PRs.

* @param handler a remotable callback
* @param listener a remotable destination for messages
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that could be clarified.

@@ -55,7 +69,7 @@
/**
* Checks whether the task has finished.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.

/**
* Tails a log file and watches for an exit status file.
* Must be remotable so that {@link #watch} can transfer the implementation.
*/
protected static class FileMonitoringController extends Controller {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well SerializableOnlyOverRemoting indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.

(For this feature I also need a way of emphasizing that a ConsoleLogFilter must be serializable over Remoting.)

/**
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that, but actually I think I want to mark this

@Restricted(ProtectedExternally.class)

as a reminder that only this class is expected to be calling it.

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow what you are saying.

To be clear, this overload needed to be introduced so it can be called from the watch task, which cannot produce a (usable) Launcher; the original overload was not using its launcher parameter anyway.

lastLocation = Long.parseLong(lastLocationFile.readToString());
}
FilePath logFile = controller.getLogFile(workspace);
long len = logFile.length();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file does not exist (because the wrapper script is still launching, or failed to launch), length() will be zero, thus nothing will happen on this iteration and we check again in another 100ms.

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.

* If it needs to communicate with the master, you may use {@link VirtualChannel#export}.
* @see Controller#watch
*/
public abstract class Handler implements Serializable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, yes. As discussed above.

* though in the typical case of line-oriented output it is likely that it will end in a newline.
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false.
* @param stream a way to read process output which has not already been handled
* @throws Exception if anything goes wrong, this watch is deactivated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really care what type is thrown. If anything is caught, the watch task shuts down.

assertEncoding(description, file, charset, expected, expectedEncoding, true, dockerWS, dockerLauncher);
}
private void assertEncoding(String description, String file, String charset, String expected, String expectedEncoding, boolean output, FilePath dockerWS, Launcher dockerLauncher) throws Exception {
System.err.println(description + " (output=" + output + ")"); // TODO maybe this should just be moved into a new class and @RunWith(Parameterized.class) for clarity
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that refactoring, adding the watch parameter in the process.

…nly what Handler.output actually received, in case of some exception halfway followed by a fresh agent connection.
* @throws IOException if initiating the watch fails, for example with a {@link ChannelClosedException}
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)}
*/
public void watch(@Nonnull FilePath workspace, @Nonnull Handler handler, @Nonnull TaskListener listener) throws IOException, InterruptedException, UnsupportedOperationException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speculative, but what's the expected behavior in the face of an InterruptedException here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During DurableTaskStep.Execution.start: the sh step would fail. During getWorkspace: we retry later.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ch.position() instead and get rid of CountingInputStream entirely, or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. My concern here is to have last-location.txt record the last position that was actually delivered to the Handler, in case something broke the streaming midway. It is maybe clearer that this would correspond to how much of the locallyEncodedStream was processed as opposed to the behavior of newInputStream, but OTOH that method does claim that the stream is unbuffered so I guess it should work and be simpler. (And anyway I have not managed to test this scenario.) I will try it.

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to rework the way output is done, I'd really like to return a stream-able data implementation rather than raw byte[] - this ensures we're not allocating giant wads of memory to collect all the data if something produces logs quickly. Whether it's a Stream, Channel, whatever I do not care as much.

Consider also something like 'writeOutput' that internally feeds data to a sink.

It's also just plain more elegant and easier to work with in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... we could still count location that way using a CountingOutputStream I believe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was already discussed earlier. This method is only used when you specify returnStdout: true, a fairly specialized case which is anyway only suited for small amounts of output—in the case of a workflow-cps caller, the entire resulting String gets saved to program.dat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damnit, we've gone in circles on these PRs so long that I'm repeating myself -- at least I'm consistent though! 😆

private synchronized static ScheduledExecutorService watchService() {
if (watchService == null) {
// TODO 2.105+ use ClassLoaderSanityThreadFactory
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish we could use a thread pool that grows as needed for load (with an upper limit) but still supports scheduling. Sigh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( I know, I know, not part of the standard library )

/**
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway.
*/
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException {
return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
@Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
byte[] buf = FileUtils.readFileToByteArray(f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, so much wish for streaming use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, unnecessary here.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(ch.position()), null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐜 Noted that without atomic writes here, if the agent is restarted during writing, then we lose the position within the log and re-send I think the entire log. Potential for looping if somehow we get repeated failures (very unlikely though).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved? Suppose, for example that we can't connect to whatever is supposed to receive the log output.

Copy link
Member

@svanoort svanoort Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, at that point will we bomb out completely and never try to send the log again? Gah, this code is literally giving me headaches to follow the flow of execution because it's so indirect.

@jglick C'mon man, a few more explanatory comments please? I almost think the overall flow merits a flow chart so you can see what's passed back and forth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me say what I expect: if the handler throws an exception (probably IOExeption) while trying to handle output, or takes longer than some timeout period, we break out (without mutating the lastLocationFile), and retry at least once after an appropriate delay.

At some point if issues are resolved so the handler passes, we should (via mechanism unspecified) be able to get the entire log re-sent from the point of failure, including any new content that was added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?

Then we assume the log sink is hopeless and the Watcher task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch call will be made and we will try to resume.

a few more explanatory comments please

ack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the handler throws an exception (probably IOExeption)

I need to note in a comment that the default handler does not currently ever throw IOException as a result of a sink issue (only as a result of a local file error).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
try {
handler.output(utf8EncodedStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick Is there an enclosing Timeout somewhere here so we can't hang indefinitely? May be somewhere non-obvious since this has so many levels of indirection that my brain is throwing a StackOverflowException trying to process it, but if there isn't one there should be, and if there is one please note it somewhere obvious so this doesn't get flagged again.

Note also: as long as you haven't advanced the location marker it should be safe to try re-sending later... assuming this doesn't cause the whole thing to bomb out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an enclosing Timeout somewhere here so we can't hang indefinitely?

No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like

listener.getLogger().println("Whatever");

If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it halts? I'm not clear what the result is there, and what if that happens a bunch of times in a row?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by “it halts” in this context.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking to reflect comments that need addressing -- and, uh, a very unclear and perhaps incorrect handling of cases where you encounter a transitory exception that prevents sending log output via Handler.output -- but which is recoverable.

Which sounds obscure except that with networks, it could happen pretty regularly due to hiccups.

Also: my review isn't done per se yet, this is kind of the first pass (I will need to sit down and actually diagram the flow from start to end probably).

Copy link
Member Author

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is instructive to watch ExecutorStepTest.buildShellScriptAcrossDisconnect in jenkinsci/workflow-durable-task-step-plugin#63, as I walked @dwnusbaum through.

@@ -86,7 +100,7 @@
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status.
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result.
* @param workspace the workspace in use
* @param launcher a way to start processes
* @param launcher a way to start processes (currently unused)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
try {
handler.output(utf8EncodedStream);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an enclosing Timeout somewhere here so we can't hang indefinitely?

No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like

listener.getLogger().println("Whatever");

If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.

try {
handler.output(utf8EncodedStream);
} finally {
lastLocationFile.write(Long.toString(ch.position()), null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?

Then we assume the log sink is hopeless and the Watcher task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch call will be made and we will try to resume.

a few more explanatory comments please

ack

…current write succeeded (err on the side of duplication).
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation));
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8);
handler.output(utf8EncodedStream);
lastLocationFile.write(Long.toString(ch.position()), null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment explaining that this is intentionally not in a finally block so that we duplicate data rather than skip it when an error occurs would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added that.

watchService().schedule(this, 100, TimeUnit.MILLISECONDS);
}
} catch (Exception x) {
// note that LOGGER here is going to the agent log, not master log
Copy link
Member

@dwnusbaum dwnusbaum Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus returns null in DurableTaskStep.Execution#check. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.

If the channel is closed, then everything should retry in DurableTask#getWorkspace during the next call to DurableTaskStep.Execution#check so that seems fine to me, although the 5 minute recurrence period seems a little long.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any unaddressed concerns now that we attempt to retry after non-remoting failures in the default Handler in workflow-durable-task.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me now, AFAICT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants