Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty port to fix RealJenkinsRule flakiness #465

Merged
merged 4 commits into from Aug 8, 2022

Conversation

Pldi23
Copy link
Contributor

@Pldi23 Pldi23 commented Jul 26, 2022

Cloudbees CI reported a number of flaky tests encountering in operations-center-context plugin

Logs:
RemoteArtifactCopyBuilderTest.txt
RemoteArtifactCopyBuilderTest_stacktrace.txt

Changes made in scope of this PR:

  1. Handle empty port. 514bfbf
  2. Stop proc once. 38b7261

@@ -620,7 +620,8 @@ public void stopJenkins() throws Throwable {
if (!proc.waitFor(60, TimeUnit.SECONDS) ) {
System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process");
proc.destroyForcibly();
throw new AssertionError("Jenkins failed to terminate within 60 seconds");
proc = null;
Copy link
Contributor Author

@Pldi23 Pldi23 Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when test flakes and fails:

  1. test runs well and going to stopJenkins at the end:
  2. stopJenkins hit timeout:
    if (!proc.waitFor(60, TimeUnit.SECONDS) ) {
  3. AssertExceptions thown as a result
  4. finally block executed
    if (proc != null) {
    stopJenkins();
    }
    try {
    tmp.dispose();
    } catch (Exception x) {
    LOGGER.log(Level.WARNING, null, x);
    }
    and because proc != null tries to stop Jenkins again:
  5. ConnectionException thrown:

Copy link
Contributor Author

@Pldi23 Pldi23 Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to set proc = null like

proc.destroyForcibly();
proc = null;
instead of throwing AssertError

@jtnord jtnord requested a review from jglick July 26, 2022 13:32
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting proc = null seems uncontroversial, just a bug fix. Replacing the AssertionError with a potentially passing test seems like it is just masking an actual problem in the test.

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Pldi23, thank you for your contribution to the open-source Jenkins test harness.

RealJenkinsRule#then always stops Jenkins after it is started, so the only reason for the stopJenkins call in RealJenkinsRule#apply must be as a fallback if someone started Jenkins by some other means but forgot to stop it: for example, by invoking RealJenkinsRule#startJenkins directly from a test but forgetting to call RealJenkinsRule#stopJenkins. It is certainly better to stop Jenkins late rather than never. (It might be good to log a warning before doing so, since this indicates a programming error.)

It does seem wrong that we would ever try to stop Jenkins via the /exit endpoint twice: it is likely to fail the second time as you observed. Taking inspiration from the Java Closeable#close interface, I think these semantics would be desirable:

If the stream is already closed then invoking this method has no effect.

In other words, the way close() is usually implemented in Java Closeables is to have some internal state regarding whether the resource has already been closed and to do nothing if it already has been closed. This is, in some way, indicated by proc being non-null. However, if we take the above design as the goal, then the current logic has two problems:

  • The check for proc being null happens outside the call to stopJenkins rather than within it, violating the contract that invoking the method a second time has no effect
  • Once we are past the critical point (calling /exit) no state is recorded that would prevent us from trying to invoke that endpoint a second time (erroneously!)

My concrete suggestion would be something like this:

diff --git a/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java b/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
index d856b02..ca04515 100644
--- a/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
+++ b/src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java
@@ -180,12 +180,14 @@ public final class RealJenkinsRule implements TestRule {
 
-     private Process proc;
+    private volatile Process proc;
 
     // TODO may need to be relaxed for Gradle-based plugins
     private static final Pattern SNAPSHOT_INDEX_JELLY = Pattern.compile("(file:/.+/target)/classes/index.jelly");
     private transient boolean supportsPortFileName;
 
+    private volatile boolean stopping = false;
+
     /**
      * Add some plugins to the test classpath.
      *
@@ -386,9 +388,7 @@ public final class RealJenkinsRule implements TestRule {
                     System.out.println("Will load plugins: " + Stream.of(plugins.list()).filter(n -> n.matches(".+[.][hj]p[il]")).sorted().collect(Collectors.joining(" ")));
                     base.evaluate();
                 } finally {
-                    if (proc != null) {
-                        stopJenkins();
-                    }
+                    stopJenkins();
                     try {
                         tmp.dispose();
                     } catch (Exception x) {
@@ -615,18 +615,31 @@ public final class RealJenkinsRule implements TestRule {
         }
     }
 
+    /*
+     * Stops Jenkins and releases any system resources associated
+     * with it. If Jenkins is already stopped then invoking this
+     * method has no effect.
+     */
     public void stopJenkins() throws Throwable {
-        endpoint("exit").openStream().close();
-        if (!proc.waitFor(60, TimeUnit.SECONDS) ) {
-            System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process");
-            proc.destroyForcibly();
-            throw new AssertionError("Jenkins failed to terminate within 60 seconds");
+        if (!stopping) {
+            stopping = true;
+            endpoint("exit").openStream().close();
         }
-        int exitValue = proc.exitValue();
-        if (exitValue != 0) {
-            throw new AssertionError("nonzero exit code: " + exitValue);
+        if (proc != null) {
+            try {
+                if (!proc.waitFor(60, TimeUnit.SECONDS)) {
+                    System.err.println("Jenkins failed to stop within 60 seconds, attempting to kill the Jenkins process");
+                    proc.destroyForcibly();
+                    throw new AssertionError("Jenkins failed to terminate within 60 seconds");
+                }
+                int exitValue = proc.exitValue();
+                if (exitValue != 0) {
+                    throw new AssertionError("nonzero exit code: " + exitValue);
+                }
+            } finally {
+                proc = null;
+            }
         }
-        proc = null;
     }

This accomplishes the following:

  • Prevents /exit from being called twice by guarding it on a stopping boolean. We don't want it to be called more than once because the first call will prevent the second from succeeding.
  • Sets proc to null even when an assertion is thrown if Jenkins failed to terminate within 60 seconds or terminates with a non-zero exit code. This allows the call to stopJenkins from #then to fail as expected and then allows the second call to stopJenkins from RealJenkinsRule#apply to not try to kill the process a second time (which is unnecessary and might confuse the user with a secondary failure mode).

@Pldi23
Copy link
Contributor Author

Pldi23 commented Jul 27, 2022

Thanks @basil , I agree with your suggestion and applied it in 70722b9. With this changes we could clearly see AssertionError (it is not masked now). The only question is, do we want Process to be volatile? From what I know it will not take any effect and do we really need proc to be thread safe here? If yes we need to use ThreadLocal or other mechanism to synchronise Process, but from what I see it is not needed to be atomic, but I might be missing something.

basil
basil previously approved these changes Jul 27, 2022
Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good. Regarding synchronization, I do not think it is needed but it could not hurt. Simplest would be to use public synchronized void stopJenkins and remove volatile from stopping. Or keep it simple and do nothing, since we do not expect this code to be used in a multithreaded context.

Now that stopJenkins is doing the check for proc != null internally the same check is redundant in the caller in RealJenkinsRule#apply and can be deleted:

@@ -386,9 +388,7 @@ public final class RealJenkinsRule implements TestRule {
                     System.out.println("Will load plugins: " + Stream.of(plugins.list()).filter(n -> n.matches(".+[.][hj]p[il]")).sorted().collect(Collectors.joining(" ")));
                     base.evaluate();
                 } finally {
-                    if (proc != null) {
-                        stopJenkins();
-                    }
+                    stopJenkins();
                     try {
                         tmp.dispose();
                     } catch (Exception x) {

However given my earlier point …

RealJenkinsRule#then always stops Jenkins after it is started, so the only reason for the stopJenkins call in RealJenkinsRule#apply must be as a fallback if someone started Jenkins by some other means but forgot to stop it: for example, by invoking RealJenkinsRule#startJenkins directly from a test but forgetting to call RealJenkinsRule#stopJenkins. It is certainly better to stop Jenkins late rather than never. (It might be good to log a warning before doing so, since this indicates a programming error.)

… it may be wise to log a warning in this case. This is basically serving the purpose of a finalizer as described in Joshua Bloch's Effective Java:

So what, if anything, are finalizers good for? There are perhaps two legitimate uses. One is to act as a “safety net” in case the owner of an object forgets to call its explicit termination method. While there’s no guarantee that the finalizer will be invoked promptly, it may be better to free the resource late than never, in those (hopefully rare) cases when the client fails to call the explicit termination method. But the finalizer should log a warning if it finds that the resource has not been terminated, as this indicates a bug in the client code, which should be fixed. If you are considering writing such a safety-net finalizer, think long and hard about whether the extra protection is worth the extra cost.

Putting Joshua Bloch's advice into practice:

@@ -387,8 +387,9 @@ public final class RealJenkinsRule implements TestRule {
                     base.evaluate();
                 } finally {
                     if (proc != null) {
-                        stopJenkins();
+                        System.err.println("Jenkins still running: Prefer #then, or ensure that every call to #startJenkins has a matching call to #stopJenkins");
                     }
+                    stopJenkins();
                     try {
                         tmp.dispose();
                     } catch (Exception x) {

src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
@@ -607,6 +616,9 @@ private static boolean supportsPortFileName(String war) throws IOException {
}

private static int readPort(File portFile) throws IOException {
if (!portFile.canRead()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s just my assumption, because I still can’t reproduce the failure, but perhaps something wrong with the file read, could it be the case when file read happens at the same time when port value written and we receive IOException? (reading the logs more carefully I see that when flakes happens the test overall time about 2-3 seconds (usual time 15-20 sec) it means that the Jenkins is just above the start and it’s almost immediately stops)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove this log before merge, I just want somehow to run failing tests on a ci using my fork and examine the logs then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to reproduce locally flaky failure and the reason is because port value is still empty we catch NumberFormatException and throw AssertionError (it was lost before and now visible because of 61659b8), what are your thoughts if we retry to read it until port gets populated? see my idea in 8106c20.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that is an interesting finding :)

https://github.com/jenkinsci/winstone/blob/4754da67bf797b906a5c858b2de63b97ac609543/src/main/java/winstone/Launcher.java#L232-L234 is indeed a source of potential flakyness (that may or may not be your flake) but it looks like it could be.

The file and its contents are not created atomically - so if you get very unlucky the file will exist before it is being written to.
The upstream could be changed (maybe it should) to use an atomic write (write and flush to a intermediate file, then move the file in place with an atomic flag set, but this may really be overkill). if you want to fix the flakes in the interim then probably if the port number is not valid sleep 5 seconds and try again once only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the right fix would be 49689e7 . It is the while loop which is waiting until Jenkins complete the start procedure, so the condition when file is ready to provide the port value is not fully correct. (I'm going to test my changes and run flaky tests with this changes to see if they will be still flaky).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should do the trick!
I would probably move that logic to readPort and return 0 if the file is empty which would cause the loop to retry as currently there is a bit of a mix of concerns between the two functions.

@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 2, 2022

Ok, seems the issue is fixed with this changes, I’ve run flaky RemoteArtifactCopyBuilderTest hundreds of times and all runs are good.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be mixing together several different changes:

  • Only attempt to stop proc once. Seems like a clear bug fix.
  • Handle an empty port file. OK, though why would the Surefire JVM be seeing a half-written file to begin with? Did whatever code writes this file neglect to save to a temporary file and atomically rename it?
  • Suppressing some exceptions, which sounds wrong.
  • Some other changes of unclear importance.

Comment on lines +392 to +390
} catch (Throwable t) {
LOGGER.log(Level.WARNING, "Something went wrong", t);
throw t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the purpose of this. If an error is thrown out of a test, the test fails and the error is displayed by the test runner right?

Suggested change
} catch (Throwable t) {
LOGGER.log(Level.WARNING, "Something went wrong", t);
throw t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost a lot of time understanding root cause with empty port because first exception was lost when finally throw it’s own exception, I still suggest leave it to at least log an exception.

Copy link
Member

@jtnord jtnord Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost a lot of time understanding root cause with empty port because first exception was lost when finally throw it’s own exception

Sounds like a bug somewhere, the first exception should be a supressed exception of whatever was thrown in the finally

Perhaps a bug in junit runner or surefire or somewhere else where it is ignoring supressed exceptions (I expect there could be multiple cases of this).

https://issues.apache.org/jira/browse/SUREFIRE-1772 possibly.

src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
src/main/java/org/jvnet/hudson/test/RealJenkinsRule.java Outdated Show resolved Hide resolved
@@ -528,7 +538,7 @@ public void startJenkins() throws Throwable {
new StreamCopyThread(description.toString(), proc.getErrorStream(), System.err).start();
int tries = 0;
while (true) {
if (port == 0 && portFile != null && portFile.exists()) {
if (port == 0 && portFile != null && portFile.canRead()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the practical difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this change is simple, Jenkins-test-harness does not control the way portFile gets created, potentially it could be created with some security/access options, so I suggest this change, if we are sure that it always accessible for a read once created we could revert this change.

Copy link
Member

@jtnord jtnord Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if it exists and we can not read it - we won;t ever be able to read it and rather than throwing an IOException when we attempt to read it we wait forever in the hope that we can read it (when we never would be able to)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially it could be created with some security/access options

Why would that happen? Is this something you observed, or just speculated on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speculatively, I've not observed it. I don't mind and will revert it

@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 3, 2022

This seems to be mixing together several different changes:

  • Only attempt to stop proc once. Seems like a clear bug fix.
  • Handle an empty port file. OK, though why would the Surefire JVM be seeing a half-written file to begin with? Did whatever code writes this file neglect to save to a temporary file and atomically rename it?
  • Suppressing some exceptions, which sounds wrong.
  • Some other changes of unclear importance.

In general this PR fixes 3 different problems:

  1. stop proc once.
  2. fix empty port failure (the main reason tests are flaky)
  3. avoid lose an exception when finally throws it’s own exception (I open for a discussion, if we don't want that change I could revert it)

@Pldi23 Pldi23 changed the title set process to null when stop Jenkins hit timeout stop proc once and wait until port is not empty Aug 3, 2022
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concerns have been addressed here.

@@ -528,7 +538,7 @@ public void startJenkins() throws Throwable {
new StreamCopyThread(description.toString(), proc.getErrorStream(), System.err).start();
int tries = 0;
while (true) {
if (port == 0 && portFile != null && portFile.exists()) {
if (port == 0 && portFile != null && portFile.canRead()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially it could be created with some security/access options

Why would that happen? Is this something you observed, or just speculated on?

Comment on lines +612 to +615
if (s.isEmpty()) {
LOGGER.warning(() -> String.format("PortFile: %s exists, but value is still not written", portFile.getAbsolutePath()));
return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, where is the code that writes this file, and can it be improved to do so atomically? Please fix here if in JTH. This logic is OK as a workaround if that code is in Winstone and you file a patch for it but it would take a long time for actual tests to pick it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so then that should be patched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a PR was filed to fix the underlying problem in Winstone, I could accept this workaround as a temporary measure. But since no PR has been filed to fix the underlying problem in Winstone, I am concerned that accepting this workaround will decrease the incentive to implement a fix for the underlying problem in Winstone. In the absence of such a fix to Winstone this workaround may not be temporary but rather it may become permanent (i.e., technical debt). The Winstone fix seems trivial and could likely be coded in less than an hour, so my preference would be for it to be implemented prior to the addition of this workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@basil I created a PR#249 in Winstone to unblock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, and thank you! Could we now add a comment to the (indeed, temporary!) workaround being added in this PR that says something to the effect of:

Work around […] in Winstone releases prior to […]. When Winstone […] has been widely adopted, this can be deleted.

The reason I find it important to leave such comments is that without them, it is difficult to know when it is safe to remove the workaround in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added in 1479165

@jglick jglick requested a review from jtnord August 3, 2022 13:07
@jtnord
Copy link
Member

jtnord commented Aug 3, 2022

  • Handle an empty port file. OK, though why would the Surefire JVM be seeing a half-written file to begin with? Did whatever code writes this file neglect to save to a temporary file and atomically rename it?

yes, as per #465 (comment)

https://github.com/jenkinsci/winstone/blob/4754da67bf797b906a5c858b2de63b97ac609543/src/main/java/winstone/Launcher.java#L232-L234 is indeed a source of potential flakyness (that may or may not be your flake) but it looks like it could be.

The file and its contents are not created atomically - so if you get very unlucky the file will exist before it is being written to.
The upstream could be changed (maybe it should) to use an atomic write (write and flush to a intermediate file, then move the file in place with an atomic flag set, but this may really be overkill). if you want to fix the flakes in the interim then probably if the port number is not valid sleep 5 seconds and try again once only.

given the lag for getting a winstone change then integrated into jenkins and then LTS etc, having a workaround here is believe ok.

Could this PR be split so it was more obvious what each individual change was doing,

Copy link
Member

@jtnord jtnord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced of the change to canRead from exists for the port file, but given I do not expect to be trying to read a port file with a different security setting I also do not expect to be timing out for ever without the diagnosis of the IOException

@jglick
Copy link
Member

jglick commented Aug 3, 2022

Could this PR be split so it was more obvious what each individual change was doing,

That would be OK, though I am also OK with the larger PR so long as the title explains that this is fixing flakiness related to port reading and the description enumerates the changes contributing to that fix.

Given that there have been significant changes since #465 (comment) I guess it would be prudent to reverify effectiveness.

@jglick jglick added the bug label Aug 3, 2022
@Pldi23 Pldi23 changed the title stop proc once and wait until port is not empty Handle empty port to fix RealJenkinsRule flakiness Aug 4, 2022
@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 4, 2022

  1. History squashed into 2 commits, PR name and description was also updated. (I decided to leave 1 PR and don't split it into 2 to make it easier to run flakebusters automations for flaky test)
  2. Submitted a bug for Winstone to track.
  3. Plan to re-run flaky tests with a patch.
  4. canRead change reverted to exists

@timja
Copy link
Member

timja commented Aug 4, 2022

Given the patch is likely that big for winstone I think it would be worth attempting a real fix there, we can get it out quickly and probably into the next LTS as well.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it been (re-)tested at scale yet?

@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 4, 2022

Has it been (re-)tested at scale yet?

In progress, no failures at that moment https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/49/testReport/ .

RemoteArtifactCopyBuilderTest are slow, I expect it to finish tomorrow.

@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 8, 2022

Re-Tested successfully, all of 7200 runs are green

Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
@jglick jglick merged commit 6bcc1a7 into jenkinsci:master Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants