[JENKINS-61103] Retry on class resource load failures and introduce timeouts #379

jeffret-b · 2020-04-21T19:26:30Z

See JENKINS-61103. This is an alternative approach to #372.

There have been a couple of previous efforts to introduce retries into the RemoteClassLoader. These have reportedly resolved some situations, however we have continued to receive reports of class loading failures. As noted in JENKINS-61103 and the earlier PR, one area that isn't covered by a retry is resource loading as part of a class.

This PR does three things:

Fixes some of the tests so that they work correctly. Some of them weren't really testing what they claimed to. Some could be simplified a little.
Introduces timeouts and sleep to the retries. This should slow down further attempts that may quickly fail and provide a way to eventually terminate if the problem doesn't resolve. I add this to the two existing retry areas and to the new one. I arbitrarily chose 10 minutes as the total timeout time. It's much smaller than the previous value (infinite), but still a significant amount of time.
Adds retries to findResource() using the same pattern.

These changes only get involved in truly exceptional conditions. I've never been able to reproduce them directly. Other reports of class loading failures have similarly lacked for reproducibility. This has two major impacts:

It's very difficult to test how much this fix will impact the situation, how well it improve things, or whether it will correct any of the reported situations.
Any problems introduced will not have widespread effect. If the retry period is not ideal, it won't affect the great majority of situations. This makes me feel like it is worthwhile to try out this change to see how well it helps.

This change might help for JENKINS-51854 and JENKINS-514910.

Separate out the tests that don't work correclty in all the test runners. Add the checks for retries. Open question about whether to sleep. Or wait. Or anything.

jvz

Looks like a nice improvement. If this turns out to be helpful, perhaps this can be enhanced later to use a more generic retry strategy?

src/main/java/hudson/remoting/RemoteClassLoader.java

jeffret-b · 2020-04-21T22:00:36Z

If this turns out to be helpful, perhaps this can be enhanced later to use a more generic retry strategy?

Could be. I'm trying to keep it understandable and not too much of a change. There are other patterns that could be used, including the approach in the earlier PR. I couldn't get that one to work out right, including with the tests.

I'd also like to investigate some retries on channel failures, some sort of automated reconnection. That seems to be an issue that comes up fairly frequently. Might be kind of complicated.

jeffret-b · 2020-04-22T14:24:44Z

I'm hoping to get a few more reviews before proceeding on this, so please take a look.

jvz · 2020-04-22T14:49:49Z

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

basil · 2020-04-22T14:54:34Z

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

Failsafe is another option. Since we're on the subject, I love Tenacity's documentation on retrying. It so clearly and concisely explains the pros and cons of various retrying strategies.

res0nance

LGTM, infinite retries stop being infinite but it doesn't feel like retries being infinite would have ever helped. Hopefully this can help surface reproducible cases.

jvz · 2020-04-22T14:57:52Z

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

Failsafe is another option. Since we're on the subject, I love Tenacity's documentation on retrying. It so clearly and concisely explains the pros and cons of various retrying strategies.

Failsafe looks really neat!

jeffret-b · 2020-04-22T15:00:48Z

Thanks for the comments and reviews. As Matt mentioned earlier, I'm trying to keep this change relatively simple for now. Those libraries might be useful later. At a minimum, maybe we can start iterating towards more information on failures and reproducible cases.

thomasgl-orange · 2020-04-22T23:24:36Z

I have done a few manual tests of a 4.4-SNAPSHOT built from this PR (with a patched Jenkins 2.222.1), using the procedure I had described in JENKINS-61103:

disconnect the agent
delete its jar cache (not sure it really matters, I've just assumed it would give me more opportunities to interrupt class loading while doing remoting stuff)
reconnect it
run a simple Maven job (here I'm running the "clean" target on the jenkinsci/remoting repository, but it doesn't matter)
interrupt the job when it starts talking about the POM

After a few tries, I've reproduced a LinkageError caused by a RemotingSystemException of InterruptedException. Following attempts at running Maven job then fail with a NoClassDefFoundError, for the reasons described in t.

Log of the interrupted build:

[...]
23:43:16  > git checkout -f 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:43:18 Commit message: "Merge pull request #378 from jeffret-b/updateRCL"
23:43:18  > git rev-list --no-walk 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:43:20 Parsing POMs
23:43:20 using global settings config with name FaaS network settings
23:43:20 Replacing all maven server entries not found in credentials list is true
23:43:23 Build was aborted
23:43:23 Aborted by Thomas De Grenier De La Tour
23:43:23 Started calculate disk usage of build
23:43:23 Finished Calculation of disk usage of build in 0 seconds
23:43:23 Started calculate disk usage of workspace
23:43:24 Stop LogSizeTimerTask
23:43:24 Finished Calculation of disk usage of workspace in  1 second
23:43:24 [WS-CLEANUP] Deleting project workspace...
23:43:25 [WS-CLEANUP] done
23:43:25 Finished: ABORTED

Exception in the agent log:

Apr 22, 2020 11:43:23 PM hudson.remoting.UserRequest perform
WARNING: LinkageError while performing UserRequest:hudson.maven.MavenModuleSetBuild$PomParser@4480bf1e
java.lang.ExceptionInInitializerError
	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:182)
	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:168)
	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:166)
	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:159)
	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:110)
	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:137)
	at hudson.maven.MavenUtil.createEmbedder(MavenUtil.java:211)
	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1323)
	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1126)
	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3069)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:368)
	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: hudson.remoting.RemotingSystemException: java.lang.InterruptedException
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:301)
	at com.sun.proxy.$Proxy5.fetch(Unknown Source)
	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:294)
	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:257)
	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:216)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.eclipse.sisu.inject.Weak.concurrentKeys(Weak.java:89)
	at org.eclipse.sisu.inject.Weak.concurrentKeys(Weak.java:79)
	at org.eclipse.sisu.plexus.ClassRealmManager.<clinit>(ClassRealmManager.java:66)
	... 18 more
Caused by: java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:176)
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)
	... 27 more

Log of the next build attempt:

[...]
23:45:04  > git checkout -f 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:45:04 Commit message: "Merge pull request #378 from jeffret-b/updateRCL"
23:45:04  > git rev-list --no-walk 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:45:04 Parsing POMs
23:45:04 using global settings config with name FaaS network settings
23:45:04 Replacing all maven server entries not found in credentials list is true
23:45:04 ERROR: Failed to parse POMs
23:45:04 java.io.IOException: Remote call on faas-tmp-thomas-21158-2ezsv225r failed
23:45:04 	at hudson.remoting.Channel.call(Channel.java:1004)
23:45:04 	at hudson.FilePath.act(FilePath.java:1069)
23:45:04 	at hudson.FilePath.act(FilePath.java:1058)
23:45:04 	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.parsePoms(MavenModuleSetBuild.java:987)
23:45:04 	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:691)
23:45:04 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
23:45:04 	at hudson.model.Run.execute(Run.java:1856)
23:45:04 	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
23:45:04 	at hudson.model.ResourceController.execute(ResourceController.java:97)
23:45:04 	at hudson.model.Executor.run(Executor.java:428)
23:45:04 Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.eclipse.sisu.plexus.ClassRealmManager
23:45:04 	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:182)
23:45:04 	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:168)
23:45:04 	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:166)
23:45:04 	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:159)
23:45:04 	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:110)
23:45:04 	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:137)
23:45:04 	at hudson.maven.MavenUtil.createEmbedder(MavenUtil.java:211)
23:45:04 	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1323)
23:45:04 	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1126)
23:45:04 	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3069)
23:45:04 	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
23:45:04 	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
23:45:04 	at hudson.remoting.Request$2.run(Request.java:368)
23:45:04 	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
23:45:04 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
23:45:04 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
23:45:04 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
23:45:04 	at java.lang.Thread.run(Thread.java:748)
23:45:04 	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to faas-tmp-thomas-21158-2ezsv225r
23:45:04 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1788)
23:45:04 		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
23:45:04 		at hudson.remoting.Channel.call(Channel.java:998)
23:45:04 		at hudson.FilePath.act(FilePath.java:1069)
23:45:04 		at hudson.FilePath.act(FilePath.java:1058)
23:45:04 		at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.parsePoms(MavenModuleSetBuild.java:987)
23:45:04 		at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:691)
23:45:04 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
23:45:04 		at hudson.model.Run.execute(Run.java:1856)
23:45:04 		at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
23:45:04 		at hudson.model.ResourceController.execute(ResourceController.java:97)
23:45:04 		at hudson.model.Executor.run(Executor.java:428)
23:45:05 Started calculate disk usage of build
23:45:05 Finished Calculation of disk usage of build in 0 seconds
23:45:06 Started calculate disk usage of workspace
23:45:06 Finished Calculation of disk usage of workspace in 0 seconds
23:45:06 [WS-CLEANUP] Deleting project workspace...
23:45:06 [WS-CLEANUP] Skipped based on build state FAILURE
23:45:06 Finished: FAILURE

That's one difference with the approach I had taken in #372 (the dynamic proxy would have retried the interrupted fetch call - although, as mentioned in that PR, I had no unit test to prove it, because it was not obvious how to even reach that line).

Of course, this case can be fixed with your approach too. In master, this RemotingSystemException would come from here:
RemoteClassLoader.java#L264
There is some try/catch/retry logic (that's one you've modified in this PR), but it only applies in case of plain InterruptedException, not in case of "RemotingSystemException of InterruptedException". Catching the right exception there might be enough.

One more other random thought regarding this PR vs #372: with this PR come proper unit tests of the retry logic itself. You can simulate repeated interruptions, because the try/catch/retry blocks are at high level in some RemoteClassLoader methods, "around" the test plugs (the TESTING_CLASS_LOAD and TESTING_CLASS_REFERENCE_LOAD runnables). In #372, I've not been able to find a way to do the same: all that these test plugs can do is set the interrupted flag before a remote class-loading operation, leading to one exception, but I couldn't test more than one retry this way, because it was implemented at a lower level. It would have required a different approach, and I had no clear intuition of what (something at the Channel level? one more layer of proxy?).

jeffret-b · 2020-04-23T20:48:59Z

There is some try/catch/retry logic (that's one you've modified in this PR), but it only applies in case of plain InterruptedException, not in case of "RemotingSystemException of InterruptedException". Catching the right exception there might be enough.

I wondered about why that catch was different than the others, but I had poor information as to why things might be different or where the failures occurred. I've just extended the catch in that area in my latest commit to also handle RemotingSystemException similar to how it's being done elsewhere. That may give some nice additional protection.

In #372, I've not been able to find a way to do the same: all that these test plugs can do is set the interrupted flag before a remote class-loading operation, leading to one exception, but I couldn't test more than one retry this way, because it was implemented at a lower level.

I like the cleanness and thoroughness of your approach, but I've struggled with a couple of things with it. I never could get the tests or MAX_RETRIES to work with it. Given the level at which that approach behaves, I wasn't as confident in exactly what it was doing or how much it might apply to other situations. I ended up pursuing this approach because I had more confidence in what it was doing, partly because I was able to get decent tests around it.

thomasgl-orange · 2020-04-23T22:51:35Z

I've just extended the catch in that area in my latest commit to also handle RemotingSystemException similar to how it's being done elsewhere. That may give some nice additional protection.

Yes, 5c0bec8 should fix it.
I've not tried yet, but does it by chance allow to get rid of the throw new InterruptedException in invokeClassLoadTestingHookIfNeeded()?

Reviewing the code, I find a few other call paths from ClassLoader public methods to the remoting proxy (IClassLoader) with no safe-guard (no try/catch/retry for RemotingSystemExcpetion(InterruptedException)), which could thus theoretically lead to similar errors:

findClass(String) => fetchFromProxy(String, Channel) => IClassLoader.fetch(String)
- RemoteClassLoader.java#L218 => RemoteClassLoader.java#L225
- only when !channel.remoteCapability.supportsMultiClassLoaderRPC(), so probably not worth fixing
findClass(String) => loadWithMultiClassLoader(String, Channel) => IClassLoader.fetch2(String)
- RemoteClassLoader.java#L216 => RemoteClassLoader.java#L249
- only when !channel.remoteCapability.supportsPrefetch(), so probably not worth fixing
findResources(String) => IClassLoader.getResources2(String)
- RemoteClassLoader.java#L611
- this one, I think, deserves being safe-guard, in a way similar to findResource(String)

Also, I see that in findResource(String), there is no retry in case of InterruptedException, whereas there are retries for that exception in loadRemoteClass(...) (ie., in findClass(String)). It doesn't sound very coherent, what do you think? (To be clear, things where already like that before your PR, and were still like that with my PR too, I was just wondering if that's something you would consider cleaning up.)

And two last thoughts/questions:

what about handling InterruptedIOException? That's also an exception which can be thrown when the thread is interrupted, thus similar to an InterruptedException, except it is not an InterruptedException because it is an IOException and Java is Java. I tend to think it makes sense to handle it in a similar way (I was doing that in [WIP][JENKINS-61103] - improve handling of RemotingSystemException(InterruptedException) in RemoteClassLoader #372 when it was the cause of a RemotingSystemException), but I've never been 100% sure it was right.
an idea I did mention (but not implement) in [WIP][JENKINS-61103] - improve handling of RemotingSystemException(InterruptedException) in RemoteClassLoader #372 was, for resources loading (findResource/findResources), to only retry in case the call was part of some class initialization (<clinit> in the stack). Because the overall goal of this retry mechanism is to avoid errors which could leave the classloader in an unrecoverable state with half-loaded classes, but there's no arm in immediately interrupting resources loading in an other context. What do you think?

I ended up pursuing this approach because I had more confidence in what it was doing, partly because I was able to get decent tests around it.

Sure, I get that, testability is a big plus here.

src/main/java/hudson/remoting/RemoteClassLoader.java

Co-Authored-By: Thomas de Grenier de Latour <thomas.degrenierdelatour@orange.com>

jeffret-b · 2020-04-24T19:33:11Z

I've not tried yet, but does it by chance allow to get rid of the throw new InterruptedException in invokeClassLoadTestingHookIfNeeded()?

No, that's still needed.

* only when `!channel.remoteCapability.supportsMultiClassLoaderRPC()`, so probably not worth fixing

Yes, I see no need to do anything about those. I noticed at least one of those when I was putting my PR together but there's no reason to expend effort there.

* findResources(String)

I agree that we should do something about this one. I'll take a look.

(More responses to come.)

jeffret-b · 2020-04-24T21:10:48Z

* `findResources(String)` => `IClassLoader.getResources2(String)`

This one should now be covered with the same pattern in e6ddc55.

jeffret-b · 2020-04-24T22:20:55Z

* an idea I did mention (but not implement) in #372 was, for resources loading (`findResource`/`findResources`), to only retry in case the call was part of some class initialization (`<clinit>` in the stack). Because the overall goal of this retry mechanism is to avoid errors which could leave the classloader in an unrecoverable state with half-loaded classes, but there's no harm in immediately interrupting resources loading in an other context. What do you think?

On the idea that it's easier to evaluate ideas like this with working code, I threw together an implementation. It's on another branch because I'm not sure what to do with it yet. So far I'm inclined to not merge it in and keep the retries for resource loading generally. When only catching the set of exceptions we're looking at, these are mostly involved with channel operations. We're not likely to catch ones that aren't involved in channel stuff anyway. Regular resource loading problems should continue to fail quickly. I'm not sure the special handling helps, though it probably doesn't hurt, either. See the branch and the commit diff.

thehustler088 · 2020-11-11T08:53:47Z

@jeffret-b on some nodes(not all nodes) i see the below error when I abort the running job:
Nov 11, 2020 12:52:04 AM hudson.util.ProcessTree get
WARNING: Error while determining if vetoers exist
java.lang.InterruptedException
at java.base/java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:177)
at hudson.remoting.Channel.call(Channel.java:1000)
at hudson.util.ProcessTree.get(ProcessTree.java:434)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:1164)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:1155)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:376)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

jeffret-b · 2020-11-11T15:42:33Z

Hmmm ... this PR includes some changes in how interrupts are handled. Looks like one now bubbles up in a different way.

Do you see any indication that this causes a problem? (Besides the stack trace. Maybe we can just handle it differently if it doesn't cause any actual issues.)

thehustler088 · 2020-11-11T17:50:16Z

@jeffret-b I haven't seen any issues, all looks good except above traces on some nodes. does increasing max retries would avoid hitting these exceptions as a temporary workaround?

jeffret-b · 2020-11-11T17:54:14Z

This sounds like a specific case that needs to be handled in the PR. I'll see if I can scrape together enough time in the next day or two to try and see how to handle it. I doubt increasing the max retries would change it, but you can always give it a try.

jeffret-b · 2020-11-13T18:22:23Z

@thehustler088, are you using the msbuild or gradle-daemon plugins? (Just checking on something.)

thehustler088 · 2020-11-14T02:40:07Z

@jeffret-b I am running bunch of python scripts in background that does deployment and testing in my environment.

jeffret-b · 2020-11-19T16:37:58Z

@thehustler088, I've gotten another incrementals build of the Jenkins war that may cover the follow-on issue: https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ . If you can give that a try and report on the results that would help.

thehustler088 · 2020-11-20T10:27:38Z

Sure @jeffret-b I may not able to load it today since weekend runs are in progress. I will load it on Sunday, I will let you the test results by Monday / Tuesday. Thanks!

thehustler088 · 2020-11-24T09:01:19Z

@jeffret-b I have loaded war file(https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ ) in my env, did not see any issues / traces till now. I tried on 3 nodes(running ubuntu 18.04 LTS) by interrupting the running jobs. I may try on different nodes this week

thehustler088 · 2020-11-24T09:05:31Z

Is there way to make these draft changes(remoting jar) compatible with LTS versions, as I see some UI bugs with the current loaded build.?

jeffret-b · 2020-11-24T15:26:08Z

@thehustler088, thanks for the testing! This gives more confidence that this change improves the situation, which will help push it forward.

The core build uses the incrementals capability, which requires it to be based on the latest state of the master branch. This can indeed bring in unstable changes. If you want to have a different version you can build jenkinsci/jenkins#5054 on top of whatever branch you want. I don't know where I would host a build like that for you.

thehustler088 · 2020-12-07T17:04:24Z

@jeffret-b Thanks for the changes. I am trying to build piplines job for my new requirement with 2.268-rc30582.2308bd563998 and UI becomes unmanageable with active choice parameters. Is there a way to make remoting jar(which includes your draft changes) compatible with jenkins - Jenkins 2.263.1? I did not find a way to make it work.

Also, may I know when are these changes will be officially available with Jenkins?

jeffret-b · 2020-12-08T19:12:08Z

@thehustler088, I cherry-picked the two relevant commits from jenkinsci/jenkins#5054 on top of the jenkinsci/jenkins branch stable-2.263. Then I built with skipping tests. I've uploaded the resulting war file to https://www.dropbox.com/s/1ez4t5pxp30q8tw/jenkins.war?dl=0 . Not sure if you can access that, but you can give it a try or build yourself.

I have no plan yet for merging these in. If you report success in your environment that gives more confidence. I could see about integrating these in sometime in January 2021.

thehustler088 · 2020-12-09T12:39:51Z

@jeffret-b Thanks much for building on stable branch. I have downloaded it, will load it sometime tomorrow. If this solves UI issues, I will load this build on multiple master servers.

thehustler088 · 2020-12-17T09:14:18Z

@jeffret-b I have been using this build for about week and I did not notice any issues with interrupting the running jobs. Also, UI is working fine with active choice parameters.

jeffret-b · 2020-12-17T17:44:15Z

Thanks for testing that. The results are very promising. I'll try to get this moving forward early next year.

In the meantime, if anyone else wants to review or especially test, that would be great.

thehustler088 · 2021-01-15T08:49:22Z

@jeffret-b any progress on pushing the changes to official jenkins release

jeffret-b · 2021-01-16T01:06:51Z

@thehustler088 , I've been busy, which may continue through the end of this month, but I want to get this done as soon as I can.

jeffret-b · 2021-02-08T19:15:29Z

I'm moving this officially out of draft and work-in-progress. There are a number of reviews. Some people have tested and verified improvements using this reworked approach.

I'd like to get this merged and released soon. I appreciate any other reviews or tests. I'd like to get it merged in and ready for release within the next couple of days.

jeffret-b · 2021-02-16T21:56:57Z

@thehustler088 , this is now released in Remoting 4.7. I'm not going to push it into the Jenkins packaging immediately. Any testing or use results you can share will help increase confidence and speed of getting it packaged into core.

I'm hoping to get it into a weekly Jenkins release within the next couple of releases.

Prabhu088 · 2021-02-17T13:08:09Z

@jeffret-b Thanks much for pushing the changes to official remoting 4.7. May I know what version of jenkins is compatible with remoting jar 4.7? can I use with Jenkins 2.263.1? If yes, please let me know what branches to be pulled in order to build Jenkins war 2.263.1 with Remoting 4.7

jeffret-b · 2021-02-17T16:23:54Z

This Remoting version should work fine with older versions, especially including 2.263.x. Even older versions should still work fine, though they get little testing. Newer features may not work -- for example, Jenkins 2.217 is required for WebSocket capabilities.

Prabhu088 · 2021-02-23T04:44:08Z

@jeffret-b I have added 4.7 remoting jar in Jenkins 2.263.1, I will load and will let you know by end of this week

thehustler088 · 2021-03-02T09:32:42Z

I have loaded this in couple of my testing machine running ubuntu 18.04lts and cent os 7, did not see any issue / exceptions when I abort running jobs.

jeffret-b added 5 commits April 21, 2020 10:36

WIP initial work

02f15c0

Get timeouts working for reference and class.

45acb17

Separate out the tests that don't work correclty in all the test runners. Add the checks for retries. Open question about whether to sleep. Or wait. Or anything.

Add retries for resource loading as part of the class.

adb3b80

A little more cleanup.

11edae4

Add a little logging.

23e8617

jeffret-b requested a review from jvz April 21, 2020 19:26

jvz approved these changes Apr 21, 2020

View reviewed changes

src/main/java/hudson/remoting/RemoteClassLoader.java Outdated Show resolved Hide resolved

Change while to for.

08ce1b4

jeffret-b requested a review from res0nance April 22, 2020 14:23

oleg-nenashev self-requested a review April 22, 2020 14:25

res0nance approved these changes Apr 22, 2020

View reviewed changes

Also handle the RemotingSystemException for loadRemoteClass.

5c0bec8

thomasgl-orange reviewed Apr 23, 2020

View reviewed changes

src/main/java/hudson/remoting/RemoteClassLoader.java Outdated Show resolved Hide resolved

Fix spacing.

8099a97

Co-Authored-By: Thomas de Grenier de Latour <thomas.degrenierdelatour@orange.com>

jeffret-b added 2 commits April 24, 2020 15:07

Apply the same retry pattern to findResources().

e6ddc55

Merge branch 'retryRCL' of github.com:jeffret-b/remoting into retryRCL

8d8321e

jeffret-b added 2 commits April 24, 2020 15:21

Add a test for the normal case.

780bdde

Fix copy/paste mistake.

24a1f89

jeffret-b changed the title ~~[WIP] [JENKINS-61103] Retry on class resource load failures and introduce timeouts~~ [JENKINS-61103] Retry on class resource load failures and introduce timeouts Feb 8, 2021

jeffret-b added bug For changelog: Fixes a bug. and removed work-in-progress labels Feb 8, 2021

jeffret-b marked this pull request as ready for review February 8, 2021 19:13

jeffret-b added the ready-to-merge label Feb 8, 2021

jeffret-b merged commit 8e71970 into jenkinsci:master Feb 10, 2021

[JENKINS-61103] Retry on class resource load failures and introduce timeouts #379

[JENKINS-61103] Retry on class resource load failures and introduce timeouts #379

Conversation

jeffret-b commented Apr 21, 2020

jvz left a comment

Choose a reason for hiding this comment

jeffret-b commented Apr 21, 2020

jeffret-b commented Apr 22, 2020

jvz commented Apr 22, 2020

basil commented Apr 22, 2020

res0nance left a comment

Choose a reason for hiding this comment

jvz commented Apr 22, 2020

jeffret-b commented Apr 22, 2020

thomasgl-orange commented Apr 22, 2020

jeffret-b commented Apr 23, 2020

thomasgl-orange commented Apr 23, 2020

jeffret-b commented Apr 24, 2020

jeffret-b commented Apr 24, 2020

jeffret-b commented Apr 24, 2020

thehustler088 commented Nov 11, 2020 • edited Loading

jeffret-b commented Nov 11, 2020

thehustler088 commented Nov 11, 2020 • edited Loading

jeffret-b commented Nov 11, 2020

jeffret-b commented Nov 13, 2020

thehustler088 commented Nov 14, 2020 • edited Loading

jeffret-b commented Nov 19, 2020

thehustler088 commented Nov 20, 2020

thehustler088 commented Nov 24, 2020

thehustler088 commented Nov 24, 2020

jeffret-b commented Nov 24, 2020

thehustler088 commented Dec 7, 2020

jeffret-b commented Dec 8, 2020

thehustler088 commented Dec 9, 2020

thehustler088 commented Dec 17, 2020

jeffret-b commented Dec 17, 2020

thehustler088 commented Jan 15, 2021

jeffret-b commented Jan 16, 2021

jeffret-b commented Feb 8, 2021

jeffret-b commented Feb 16, 2021

Prabhu088 commented Feb 17, 2021

jeffret-b commented Feb 17, 2021

Prabhu088 commented Feb 23, 2021

thehustler088 commented Mar 2, 2021

thehustler088 commented Nov 11, 2020 •

edited

Loading

thehustler088 commented Nov 11, 2020 •

edited

Loading

thehustler088 commented Nov 14, 2020 •

edited

Loading