New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty port to fix RealJenkinsRule flakiness #465
Conversation
@@ -620,7 +620,8 @@ public void stopJenkins() throws Throwable { | |||
if (!proc.waitFor(60, TimeUnit.SECONDS) ) { | |||
System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process"); | |||
proc.destroyForcibly(); | |||
throw new AssertionError("Jenkins failed to terminate within 60 seconds"); | |||
proc = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when test flakes and fails:
- test runs well and going to
stopJenkins
at the end:stopJenkins(); stopJenkins
hit timeout:if (!proc.waitFor(60, TimeUnit.SECONDS) ) { - AssertExceptions thown as a result
finally
block executedjenkins-test-harness/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
Lines 389 to 396 in 6a1fdad
if (proc != null) { stopJenkins(); } try { tmp.dispose(); } catch (Exception x) { LOGGER.log(Level.WARNING, null, x); } stopJenkins(); - ConnectionException thrown:
endpoint("exit").openStream().close();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to set proc = null like
jenkins-test-harness/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
Lines 594 to 595 in 6a1fdad
proc.destroyForcibly(); | |
proc = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting proc = null
seems uncontroversial, just a bug fix. Replacing the AssertionError
with a potentially passing test seems like it is just masking an actual problem in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Pldi23, thank you for your contribution to the open-source Jenkins test harness.
RealJenkinsRule#then
always stops Jenkins after it is started, so the only reason for the stopJenkins
call in RealJenkinsRule#apply
must be as a fallback if someone started Jenkins by some other means but forgot to stop it: for example, by invoking RealJenkinsRule#startJenkins
directly from a test but forgetting to call RealJenkinsRule#stopJenkins
. It is certainly better to stop Jenkins late rather than never. (It might be good to log a warning before doing so, since this indicates a programming error.)
It does seem wrong that we would ever try to stop Jenkins via the /exit
endpoint twice: it is likely to fail the second time as you observed. Taking inspiration from the Java Closeable#close
interface, I think these semantics would be desirable:
If the stream is already closed then invoking this method has no effect.
In other words, the way close()
is usually implemented in Java Closeables
is to have some internal state regarding whether the resource has already been closed and to do nothing if it already has been closed. This is, in some way, indicated by proc
being non-null
. However, if we take the above design as the goal, then the current logic has two problems:
- The check for
proc
beingnull
happens outside the call tostopJenkins
rather than within it, violating the contract that invoking the method a second time has no effect - Once we are past the critical point (calling
/exit
) no state is recorded that would prevent us from trying to invoke that endpoint a second time (erroneously!)
My concrete suggestion would be something like this:
diff --git a/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java b/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
index d856b02..ca04515 100644
--- a/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
+++ b/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
@@ -180,12 +180,14 @@ public final class RealJenkinsRule implements TestRule {
- private Process proc;
+ private volatile Process proc;
// TODO may need to be relaxed for Gradle-based plugins
private static final Pattern SNAPSHOT_INDEX_JELLY = Pattern.compile("(file:/.+/target)/classes/index.jelly");
private transient boolean supportsPortFileName;
+ private volatile boolean stopping = false;
+
/**
* Add some plugins to the test classpath.
*
@@ -386,9 +388,7 @@ public final class RealJenkinsRule implements TestRule {
System.out.println("Will load plugins: " + Stream.of(plugins.list()).filter(n -> n.matches(".+[.][hj]p[il]")).sorted().collect(Collectors.joining(" ")));
base.evaluate();
} finally {
- if (proc != null) {
- stopJenkins();
- }
+ stopJenkins();
try {
tmp.dispose();
} catch (Exception x) {
@@ -615,18 +615,31 @@ public final class RealJenkinsRule implements TestRule {
}
}
+ /*
+ * Stops Jenkins and releases any system resources associated
+ * with it. If Jenkins is already stopped then invoking this
+ * method has no effect.
+ */
public void stopJenkins() throws Throwable {
- endpoint("exit").openStream().close();
- if (!proc.waitFor(60, TimeUnit.SECONDS) ) {
- System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process");
- proc.destroyForcibly();
- throw new AssertionError("Jenkins failed to terminate within 60 seconds");
+ if (!stopping) {
+ stopping = true;
+ endpoint("exit").openStream().close();
}
- int exitValue = proc.exitValue();
- if (exitValue != 0) {
- throw new AssertionError("nonzero exit code: " + exitValue);
+ if (proc != null) {
+ try {
+ if (!proc.waitFor(60, TimeUnit.SECONDS)) {
+ System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process");
+ proc.destroyForcibly();
+ throw new AssertionError("Jenkins failed to terminate within 60 seconds");
+ }
+ int exitValue = proc.exitValue();
+ if (exitValue != 0) {
+ throw new AssertionError("nonzero exit code: " + exitValue);
+ }
+ } finally {
+ proc = null;
+ }
}
- proc = null;
}
This accomplishes the following:
- Prevents
/exit
from being called twice by guarding it on astopping
boolean. We don't want it to be called more than once because the first call will prevent the second from succeeding. - Sets
proc
to null even when an assertion is thrown if Jenkins failed to terminate within 60 seconds or terminates with a non-zero exit code. This allows the call tostopJenkins
from#then
to fail as expected and then allows the second call tostopJenkins
fromRealJenkinsRule#apply
to not try to kill the process a second time (which is unnecessary and might confuse the user with a secondary failure mode).
Thanks @basil , I agree with your suggestion and applied it in 70722b9. With this changes we could clearly see AssertionError (it is not masked now). The only question is, do we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code change looks good. Regarding synchronization, I do not think it is needed but it could not hurt. Simplest would be to use public synchronized void stopJenkins
and remove volatile from stopping
. Or keep it simple and do nothing, since we do not expect this code to be used in a multithreaded context.
Now that stopJenkins
is doing the check for proc != null
internally the same check is redundant in the caller in RealJenkinsRule#apply
and can be deleted:
@@ -386,9 +388,7 @@ public final class RealJenkinsRule implements TestRule {
System.out.println("Will load plugins: " + Stream.of(plugins.list()).filter(n -> n.matches(".+[.][hj]p[il]")).sorted().collect(Collectors.joining(" ")));
base.evaluate();
} finally {
- if (proc != null) {
- stopJenkins();
- }
+ stopJenkins();
try {
tmp.dispose();
} catch (Exception x) {
However given my earlier point …
RealJenkinsRule#then
always stops Jenkins after it is started, so the only reason for thestopJenkins
call inRealJenkinsRule#apply
must be as a fallback if someone started Jenkins by some other means but forgot to stop it: for example, by invokingRealJenkinsRule#startJenkins
directly from a test but forgetting to callRealJenkinsRule#stopJenkins
. It is certainly better to stop Jenkins late rather than never. (It might be good to log a warning before doing so, since this indicates a programming error.)
… it may be wise to log a warning in this case. This is basically serving the purpose of a finalizer as described in Joshua Bloch's Effective Java:
So what, if anything, are finalizers good for? There are perhaps two legitimate uses. One is to act as a “safety net” in case the owner of an object forgets to call its explicit termination method. While there’s no guarantee that the finalizer will be invoked promptly, it may be better to free the resource late than never, in those (hopefully rare) cases when the client fails to call the explicit termination method. But the finalizer should log a warning if it finds that the resource has not been terminated, as this indicates a bug in the client code, which should be fixed. If you are considering writing such a safety-net finalizer, think long and hard about whether the extra protection is worth the extra cost.
Putting Joshua Bloch's advice into practice:
@@ -387,8 +387,9 @@ public final class RealJenkinsRule implements TestRule {
base.evaluate();
} finally {
if (proc != null) {
- stopJenkins();
+ System.err.println("Jenkins still running: Prefer #then, or ensure that every call to #startJenkins has a matching call to #stopJenkins");
}
+ stopJenkins();
try {
tmp.dispose();
} catch (Exception x) {
@@ -607,6 +616,9 @@ private static boolean supportsPortFileName(String war) throws IOException { | |||
} | |||
|
|||
private static int readPort(File portFile) throws IOException { | |||
if (!portFile.canRead()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s just my assumption, because I still can’t reproduce the failure, but perhaps something wrong with the file read, could it be the case when file read happens at the same time when port value written and we receive IOException? (reading the logs more carefully I see that when flakes happens the test overall time about 2-3 seconds (usual time 15-20 sec) it means that the Jenkins is just above the start and it’s almost immediately stops)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove this log before merge, I just want somehow to run failing tests on a ci using my fork and examine the logs then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to reproduce locally flaky failure and the reason is because port value is still empty we catch NumberFormatException and throw AssertionError (it was lost before and now visible because of 61659b8), what are your thoughts if we retry to read it until port gets populated? see my idea in 8106c20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that is an interesting finding :)
https://github.com/jenkinsci/winstone/blob/4754da67bf797b906a5c858b2de63b97ac609543/src/main/java/winstone/Launcher.java#L232-L234 is indeed a source of potential flakyness (that may or may not be your flake) but it looks like it could be.
The file and its contents are not created atomically - so if you get very unlucky the file will exist before it is being written to.
The upstream could be changed (maybe it should) to use an atomic write (write and flush to a intermediate file, then move the file in place with an atomic flag set, but this may really be overkill). if you want to fix the flakes in the interim then probably if the port number is not valid sleep 5 seconds and try again once only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the right fix would be 49689e7 . It is the while loop which is waiting until Jenkins complete the start procedure, so the condition when file is ready to provide the port value is not fully correct. (I'm going to test my changes and run flaky tests with this changes to see if they will be still flaky).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that should do the trick!
I would probably move that logic to readPort and return 0 if the file is empty which would cause the loop to retry as currently there is a bit of a mix of concerns between the two functions.
Ok, seems the issue is fixed with this changes, I’ve run flaky RemoteArtifactCopyBuilderTest hundreds of times and all runs are good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be mixing together several different changes:
- Only attempt to stop
proc
once. Seems like a clear bug fix. - Handle an empty port file. OK, though why would the Surefire JVM be seeing a half-written file to begin with? Did whatever code writes this file neglect to save to a temporary file and atomically rename it?
- Suppressing some exceptions, which sounds wrong.
- Some other changes of unclear importance.
} catch (Throwable t) { | ||
LOGGER.log(Level.WARNING, "Something went wrong", t); | ||
throw t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the purpose of this. If an error is thrown out of a test, the test fails and the error is displayed by the test runner right?
} catch (Throwable t) { | |
LOGGER.log(Level.WARNING, "Something went wrong", t); | |
throw t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost a lot of time understanding root cause with empty port because first exception was lost when finally throw it’s own exception, I still suggest leave it to at least log an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost a lot of time understanding root cause with empty port because first exception was lost when finally throw it’s own exception
Sounds like a bug somewhere, the first exception should be a supressed
exception of whatever was thrown in the finally
Perhaps a bug in junit runner or surefire or somewhere else where it is ignoring supressed exceptions (I expect there could be multiple cases of this).
https://issues.apache.org/jira/browse/SUREFIRE-1772 possibly.
@@ -528,7 +538,7 @@ public void startJenkins() throws Throwable { | |||
new StreamCopyThread(description.toString(), proc.getErrorStream(), System.err).start(); | |||
int tries = 0; | |||
while (true) { | |||
if (port == 0 && portFile != null && portFile.exists()) { | |||
if (port == 0 && portFile != null && portFile.canRead()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the practical difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for this change is simple, Jenkins-test-harness does not control the way portFile gets created, potentially it could be created with some security/access options, so I suggest this change, if we are sure that it always accessible for a read once created we could revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if it exists and we can not read it - we won;t ever be able to read it and rather than throwing an IOException
when we attempt to read it we wait forever in the hope that we can read it (when we never would be able to)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially it could be created with some security/access options
Why would that happen? Is this something you observed, or just speculated on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speculatively, I've not observed it. I don't mind and will revert it
In general this PR fixes 3 different problems:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concerns have been addressed here.
@@ -528,7 +538,7 @@ public void startJenkins() throws Throwable { | |||
new StreamCopyThread(description.toString(), proc.getErrorStream(), System.err).start(); | |||
int tries = 0; | |||
while (true) { | |||
if (port == 0 && portFile != null && portFile.exists()) { | |||
if (port == 0 && portFile != null && portFile.canRead()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially it could be created with some security/access options
Why would that happen? Is this something you observed, or just speculated on?
if (s.isEmpty()) { | ||
LOGGER.warning(() -> String.format("PortFile: %s exists, but value is still not written", portFile.getAbsolutePath())); | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, where is the code that writes this file, and can it be improved to do so atomically? Please fix here if in JTH. This logic is OK as a workaround if that code is in Winstone and you file a patch for it but it would take a long time for actual tests to pick it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so then that should be patched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a PR was filed to fix the underlying problem in Winstone, I could accept this workaround as a temporary measure. But since no PR has been filed to fix the underlying problem in Winstone, I am concerned that accepting this workaround will decrease the incentive to implement a fix for the underlying problem in Winstone. In the absence of such a fix to Winstone this workaround may not be temporary but rather it may become permanent (i.e., technical debt). The Winstone fix seems trivial and could likely be coded in less than an hour, so my preference would be for it to be implemented prior to the addition of this workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, and thank you! Could we now add a comment to the (indeed, temporary!) workaround being added in this PR that says something to the effect of:
Work around […] in Winstone releases prior to […]. When Winstone […] has been widely adopted, this can be deleted.
The reason I find it important to leave such comments is that without them, it is difficult to know when it is safe to remove the workaround in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added in 1479165
yes, as per #465 (comment)
given the lag for getting a winstone change then integrated into jenkins and then LTS etc, having a workaround here is believe ok. Could this PR be split so it was more obvious what each individual change was doing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced of the change to canRead
from exists
for the port file, but given I do not expect to be trying to read a port file with a different security setting I also do not expect to be timing out for ever without the diagnosis of the IOException
That would be OK, though I am also OK with the larger PR so long as the title explains that this is fixing flakiness related to port reading and the description enumerates the changes contributing to that fix. Given that there have been significant changes since #465 (comment) I guess it would be prudent to reverify effectiveness. |
|
Given the patch is likely that big for winstone I think it would be worth attempting a real fix there, we can get it out quickly and probably into the next LTS as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has it been (re-)tested at scale yet?
In progress, no failures at that moment https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/49/testReport/ . RemoteArtifactCopyBuilderTest are slow, I expect it to finish tomorrow. |
Re-Tested successfully, all of 7200 runs are green |
Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
Cloudbees CI reported a number of flaky tests encountering in operations-center-context plugin
Logs:
RemoteArtifactCopyBuilderTest.txt
RemoteArtifactCopyBuilderTest_stacktrace.txt
Changes made in scope of this PR: