New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-52165] Controller.watch API #60
Conversation
private static ScheduledExecutorService watchService; | ||
private synchronized static ScheduledExecutorService watchService() { | ||
if (watchService == null) { | ||
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO Use the new ClassLoaderSanityThreadFactory
when available from core
try (FileChannel ch = FileChannel.open(Paths.get(logFile.getRemote()), StandardOpenOption.READ)) { | ||
CountingInputStream cis = new CountingInputStream(Channels.newInputStream(ch.position(lastLocation))); | ||
handler.output(cis); | ||
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this would not suffer from JENKINS-37575-like problems. We intentionally write lastLocationFile
only after a successful call to output
, so that if there is an agent channel outage we do not lose output. In such a case this Watcher
simply dies; if and when the agent is reconnected, a new one should start at the same position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see blocker issues here, but IMHO the code could be improved a bit. I will re-review after discussing comments with Jesse
* @param handler a remotable callback | ||
* @param listener a remotable destination for messages | ||
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException needs to be documented. jenkinsci/workflow-durable-task-step-plugin#65 suggests that it happens iff the node is offline, but there is no evidence of that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that could be clarified.
@@ -86,7 +100,7 @@ | |||
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status. | |||
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result. | |||
* @param workspace the workspace in use | |||
* @param launcher a way to start processes | |||
* @param launcher a way to start processes (currently unused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ditto)
@@ -55,7 +69,7 @@ | |||
/** | |||
* Checks whether the task has finished. | |||
* @param workspace the workspace in use | |||
* @param launcher a way to start processes | |||
* @param launcher a way to start processes (currently unused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is overridable, and the downstream implementation may actually use it but inherit Javadocs. So I would rather not touch it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.
/** | ||
* Tails a log file and watches for an exit status file. | ||
* Must be remotable so that {@link #watch} can transfer the implementation. | ||
*/ | ||
protected static class FileMonitoringController extends Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SerializableOnlyOverRemoting
once it gets to 2.107+ ? Not the best API, but we do not have better annotations now (should be fixed eventually once we get to any kind of static analysis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well SerializableOnlyOverRemoting
indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.
(For this feature I also need a way of emphasizing that a ConsoleLogFilter
must be serializable over Remoting.)
/** | ||
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway. | ||
*/ | ||
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add CheckForNull so nobody unboxes it to int
with FindBugs being silent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do that, but actually I think I want to mark this
@Restricted(ProtectedExternally.class)
as a reminder that only this class is expected to be calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind—plain protected
already accomplishes that. Can annotate nullness.
lastLocation = Long.parseLong(lastLocationFile.readToString()); | ||
} | ||
FilePath logFile = controller.getLogFile(workspace); | ||
long len = logFile.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check the file's existence just in case? otherwise the watcher will give up immediately (IOException ). But IIUC the code it may want to wait to avoid race conditions. Or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file does not exist (because the wrapper script is still launching, or failed to launch), length()
will be zero, thus nothing will happen on this iteration and we check again in another 100ms.
* If it needs to communicate with the master, you may use {@link VirtualChannel#export}. | ||
* @see Controller#watch | ||
*/ | ||
public abstract class Handler implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SerializableOnlyOverRemoting after 2.107+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, yes. As discussed above.
* though in the typical case of line-oriented output it is likely that it will end in a newline. | ||
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false. | ||
* @param stream a way to read process output which has not already been handled | ||
* @throws Exception if anything goes wrong, this watch is deactivated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use a more narrow exception type in throws
so that Handler implementers do not start throwing RuntimeExceptions. But you will have to catch them when invoking Handler methods anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really care what type is thrown. If anything is caught, the watch task shuts down.
try (InputStream is = getOutputFile(workspace).read()) { | ||
return IOUtils.toByteArray(is); | ||
} | ||
} | ||
|
||
@Override public final void stop(FilePath workspace, Launcher launcher) throws IOException, InterruptedException { | ||
launcher.kill(Collections.singletonMap(COOKIE, cookieFor(workspace))); | ||
// TODO after 10s, if the control dir still exists, write a flag file and have the Watcher shut down (interrupting any ongoing handler.output call if possible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to add logging at least
watchService().schedule(this, 100, TimeUnit.MILLISECONDS); | ||
} | ||
} catch (Exception x) { | ||
// note that LOGGER here is going to the agent log, not master log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so Runtime exceptions will be non-recoverable here... Looks fine assuming that we want Handler implementations to be written properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus
returns null
in DurableTaskStep.Execution#check
. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.
If the channel is closed, then everything should retry in DurableTask#getWorkspace
during the next call to DurableTaskStep.Execution#check
so that seems fine to me, although the 5 minute recurrence period seems a little long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken?
Possibly. I will think about another method in Handler
for this purpose. As you note, in the case of a closed channel, the agent likely cannot communicate that error by definition, and getWorkspace
is responsible for fixing things up a few minutes later. But in the case of a cloudy log implementation it could be useful to indicate to the master that some other error occurred, logs may be faulty, and we should rewatch when possible. Whether anything else in the build gets successfully logged at that point, who knows.
A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.
Yeah it seems a bit of a toss-up. My general expectation is that if something strange happens without a clear recovery procedure, we should err on the side of just failing the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jenkinsci/workflow-durable-task-step-plugin@c859378 may produce better outcomes for cloudy log sinks. Without a test case it is hard to predict exactly what the failure modes will be, much less what response a user would actually find helpful.
ExecutorStepTest.shellOutputAcrossRestart
does now exercise the code path under discussion in the case of the stock Remoting sink. The aforementioned commit does not change the behavior in that test, because it is indeed a ChannelClosedException
that you get when the agent is disconnected from the master side. (If you just kill the agent JVM via either SIGTERM or SIGKILL, then the watcher task dies with it and you are also OK.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to indicate there are comments to be discussed before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some responses. Certain things will need to be cleaned up when I revisit these PRs.
* @param handler a remotable callback | ||
* @param listener a remotable destination for messages | ||
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that could be clarified.
@@ -55,7 +69,7 @@ | |||
/** | |||
* Checks whether the task has finished. | |||
* @param workspace the workspace in use | |||
* @param launcher a way to start processes | |||
* @param launcher a way to start processes (currently unused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is no longer used in any existing implementation and it likely never will be, but harms nothing to leave it there.
/** | ||
* Tails a log file and watches for an exit status file. | ||
* Must be remotable so that {@link #watch} can transfer the implementation. | ||
*/ | ||
protected static class FileMonitoringController extends Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well SerializableOnlyOverRemoting
indicates that it should not be serializable except when remoted. For this purpose I would want to emphasize that it must be serializable when remoted; whether it is also serializable in other contexts is of no relevance. We do not have such a marker interface.
(For this feature I also need a way of emphasizing that a ConsoleLogFilter
must be serializable over Remoting.)
/** | ||
* Like {@link #exitStatus(FilePath, Launcher, TaskListener)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway. | ||
*/ | ||
protected Integer exitStatus(FilePath workspace, TaskListener listener) throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do that, but actually I think I want to mark this
@Restricted(ProtectedExternally.class)
as a reminder that only this class is expected to be calling it.
/** | ||
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway. | ||
*/ | ||
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow what you are saying.
To be clear, this overload needed to be introduced so it can be called from the watch task, which cannot produce a (usable) Launcher
; the original overload was not using its launcher
parameter anyway.
lastLocation = Long.parseLong(lastLocationFile.readToString()); | ||
} | ||
FilePath logFile = controller.getLogFile(workspace); | ||
long len = logFile.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file does not exist (because the wrapper script is still launching, or failed to launch), length()
will be zero, thus nothing will happen on this iteration and we check again in another 100ms.
watchService().schedule(this, 100, TimeUnit.MILLISECONDS); | ||
} | ||
} catch (Exception x) { | ||
// note that LOGGER here is going to the agent log, not master log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error because the channel is closed, then if and when a new agent is started, a new watcher should be created; so, fine. If there is some other kind of error, then yes this is fatal—we stop sending status updates from this process.
* If it needs to communicate with the master, you may use {@link VirtualChannel#export}. | ||
* @see Controller#watch | ||
*/ | ||
public abstract class Handler implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, yes. As discussed above.
* though in the typical case of line-oriented output it is likely that it will end in a newline. | ||
* <p>Buffering is the responsibility of the caller, and {@link InputStream#markSupported} may be false. | ||
* @param stream a way to read process output which has not already been handled | ||
* @throws Exception if anything goes wrong, this watch is deactivated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really care what type is thrown. If anything is caught, the watch task shuts down.
assertEncoding(description, file, charset, expected, expectedEncoding, true, dockerWS, dockerLauncher); | ||
} | ||
private void assertEncoding(String description, String file, String charset, String expected, String expectedEncoding, boolean output, FilePath dockerWS, Launcher dockerLauncher) throws Exception { | ||
System.err.println(description + " (output=" + output + ")"); // TODO maybe this should just be moved into a new class and @RunWith(Parameterized.class) for clarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that refactoring, adding the watch
parameter in the process.
…nly what Handler.output actually received, in case of some exception halfway followed by a fresh agent connection.
* @throws IOException if initiating the watch fails, for example with a {@link ChannelClosedException} | ||
* @throws UnsupportedOperationException when this mode is not available, so you must fall back to polling {@link #writeLog} and {@link #exitStatus(FilePath, Launcher)} | ||
*/ | ||
public void watch(@Nonnull FilePath workspace, @Nonnull Handler handler, @Nonnull TaskListener listener) throws IOException, InterruptedException, UnsupportedOperationException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speculative, but what's the expected behavior in the face of an InterruptedException here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During DurableTaskStep.Execution.start
: the sh
step would fail. During getWorkspace
: we retry later.
try { | ||
handler.output(utf8EncodedStream); | ||
} finally { | ||
lastLocationFile.write(Long.toString(lastLocation + cis.getByteCount()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use ch.position()
instead and get rid of CountingInputStream
entirely, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. My concern here is to have last-location.txt
record the last position that was actually delivered to the Handler
, in case something broke the streaming midway. It is maybe clearer that this would correspond to how much of the locallyEncodedStream
was processed as opposed to the behavior of newInputStream
, but OTOH that method does claim that the stream is unbuffered so I guess it should work and be simpler. (And anyway I have not managed to test this scenario.) I will try it.
/** | ||
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway. | ||
*/ | ||
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to rework the way output is done, I'd really like to return a stream-able data implementation rather than raw byte[]
- this ensures we're not allocating giant wads of memory to collect all the data if something produces logs quickly. Whether it's a Stream, Channel, whatever I do not care as much.
Consider also something like 'writeOutput' that internally feeds data to a sink.
It's also just plain more elegant and easier to work with in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... we could still count location that way using a CountingOutputStream I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was already discussed earlier. This method is only used when you specify returnStdout: true
, a fairly specialized case which is anyway only suited for small amounts of output—in the case of a workflow-cps
caller, the entire resulting String
gets saved to program.dat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damnit, we've gone in circles on these PRs so long that I'm repeating myself -- at least I'm consistent though! 😆
private synchronized static ScheduledExecutorService watchService() { | ||
if (watchService == null) { | ||
// TODO 2.105+ use ClassLoaderSanityThreadFactory | ||
watchService = new /*ErrorLogging*/ScheduledThreadPoolExecutor(5, new NamingThreadFactory(new DaemonThreadFactory(), "FileMonitoringTask watcher")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish we could use a thread pool that grows as needed for load (with an upper limit) but still supports scheduling. Sigh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( I know, I know, not part of the standard library )
/** | ||
* Like {@link #getOutput(FilePath, Launcher)} but not requesting a {@link Launcher}, which would not be available in {@link #watch} mode anyway. | ||
*/ | ||
protected byte[] getOutput(FilePath workspace) throws IOException, InterruptedException { | ||
return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() { | ||
@Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException { | ||
byte[] buf = FileUtils.readFileToByteArray(f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh, so much wish for streaming use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, unnecessary here.
…CountingInputStream.
try { | ||
handler.output(utf8EncodedStream); | ||
} finally { | ||
lastLocationFile.write(Long.toString(ch.position()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐜 Noted that without atomic writes here, if the agent is restarted during writing, then we lose the position within the log and re-send I think the entire log. Potential for looping if somehow we get repeated failures (very unlikely though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved? Suppose, for example that we can't connect to whatever is supposed to receive the log output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, at that point will we bomb out completely and never try to send the log again? Gah, this code is literally giving me headaches to follow the flow of execution because it's so indirect.
@jglick C'mon man, a few more explanatory comments please? I almost think the overall flow merits a flow chart so you can see what's passed back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me say what I expect: if the handler throws an exception (probably IOExeption
) while trying to handle output, or takes longer than some timeout period, we break out (without mutating the lastLocationFile), and retry at least once after an appropriate delay.
At some point if issues are resolved so the handler passes, we should (via mechanism unspecified) be able to get the entire log re-sent from the point of failure, including any new content that was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?
Then we assume the log sink is hopeless and the Watcher
task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch
call will be made and we will try to resume.
a few more explanatory comments please
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the handler throws an exception (probably
IOExeption
)
I need to note in a comment that the default handler does not currently ever throw IOException
as a result of a sink issue (only as a result of a local file error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as of jenkinsci/workflow-durable-task-step-plugin@62aeb45.
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation)); | ||
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8); | ||
try { | ||
handler.output(utf8EncodedStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick Is there an enclosing Timeout somewhere here so we can't hang indefinitely? May be somewhere non-obvious since this has so many levels of indirection that my brain is throwing a StackOverflowException trying to process it, but if there isn't one there should be, and if there is one please note it somewhere obvious so this doesn't get flagged again.
Note also: as long as you haven't advanced the location marker it should be safe to try re-sending later... assuming this doesn't cause the whole thing to bomb out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an enclosing Timeout somewhere here so we can't hang indefinitely?
No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout
, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like
listener.getLogger().println("Whatever");
If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it halts? I'm not clear what the result is there, and what if that happens a bunch of times in a row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by “it halts” in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking to reflect comments that need addressing -- and, uh, a very unclear and perhaps incorrect handling of cases where you encounter a transitory exception that prevents sending log output via Handler.output
-- but which is recoverable.
Which sounds obscure except that with networks, it could happen pretty regularly due to hiccups.
Also: my review isn't done per se yet, this is kind of the first pass (I will need to sit down and actually diagram the flow from start to end probably).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is instructive to watch ExecutorStepTest.buildShellScriptAcrossDisconnect
in jenkinsci/workflow-durable-task-step-plugin#63, as I walked @dwnusbaum through.
@@ -86,7 +100,7 @@ | |||
* Intended for use after {@link #exitStatus(FilePath, Launcher)} has returned a non-null status. | |||
* The result is undefined if {@link DurableTask#captureOutput} was not called before launch; generally an {@link IOException} will result. | |||
* @param workspace the workspace in use | |||
* @param launcher a way to start processes | |||
* @param launcher a way to start processes (currently unused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ditto)
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation)); | ||
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8); | ||
try { | ||
handler.output(utf8EncodedStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an enclosing Timeout somewhere here so we can't hang indefinitely?
No. Unlike the code path in polling mode, which wraps everything of interest in a Timeout
, this is happening on the agent, so no master thread is being blocked. We no more apply a timeout to that than we would to any plugin code which anywhere calls something like
listener.getLogger().println("Whatever");
If it hangs for a while, it hangs for a while. If and when it recovers, it will proceed, and reschedule itself after 100ms as usual. If the master-side 5m sanity poll comes around (twice) and sees that the process has already exited yet no exit event was received, the step gets aborted.
try { | ||
handler.output(utf8EncodedStream); | ||
} finally { | ||
lastLocationFile.write(Long.toString(ch.position()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the handler throws an exception (likely IOException) while writing the output and the output isn't actually saved?
Then we assume the log sink is hopeless and the Watcher
task dies, after recording the last position it at least attempted to deliver. If and when the agent is reconnected, a new watch
call will be made and we will try to resume.
a few more explanatory comments please
ack
…current write succeeded (err on the side of duplication).
InputStream locallyEncodedStream = Channels.newInputStream(ch.position(lastLocation)); | ||
InputStream utf8EncodedStream = cs == null ? locallyEncodedStream : new ReaderInputStream(new InputStreamReader(locallyEncodedStream, cs), StandardCharsets.UTF_8); | ||
handler.output(utf8EncodedStream); | ||
lastLocationFile.write(Long.toString(ch.position()), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment explaining that this is intentionally not in a finally block so that we duplicate data rather than skip it when an error occurs would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added that.
watchService().schedule(this, 100, TimeUnit.MILLISECONDS); | ||
} | ||
} catch (Exception x) { | ||
// note that LOGGER here is going to the agent log, not master log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a non-remoting error, then it looks like the build will keep running until controller.exitStatus
returns null
in DurableTaskStep.Execution#check
. Would it helpful/possible to add some way to indicate to DurableTaskStep that logging appears to be broken? A passing build with broken logs would be strange, although failing the build because logging is broken doesn't seem any better.
If the channel is closed, then everything should retry in DurableTask#getWorkspace
during the next call to DurableTaskStep.Execution#check
so that seems fine to me, although the 5 minute recurrence period seems a little long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any unaddressed concerns now that we attempt to retry after non-remoting failures in the default Handler in workflow-durable-task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me now, AFAICT.
JENKINS-52165 Just the minimum changes pulled out of #29 to enable push rather than pull logging from durable tasks.