Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-61103] Retry on class resource load failures and introduce timeouts #379

Merged
merged 19 commits into from
Feb 10, 2021

Conversation

jeffret-b
Copy link
Contributor

See JENKINS-61103. This is an alternative approach to #372.

There have been a couple of previous efforts to introduce retries into the RemoteClassLoader. These have reportedly resolved some situations, however we have continued to receive reports of class loading failures. As noted in JENKINS-61103 and the earlier PR, one area that isn't covered by a retry is resource loading as part of a class.

This PR does three things:

  1. Fixes some of the tests so that they work correctly. Some of them weren't really testing what they claimed to. Some could be simplified a little.
  2. Introduces timeouts and sleep to the retries. This should slow down further attempts that may quickly fail and provide a way to eventually terminate if the problem doesn't resolve. I add this to the two existing retry areas and to the new one. I arbitrarily chose 10 minutes as the total timeout time. It's much smaller than the previous value (infinite), but still a significant amount of time.
  3. Adds retries to findResource() using the same pattern.

These changes only get involved in truly exceptional conditions. I've never been able to reproduce them directly. Other reports of class loading failures have similarly lacked for reproducibility. This has two major impacts:

  1. It's very difficult to test how much this fix will impact the situation, how well it improve things, or whether it will correct any of the reported situations.
  2. Any problems introduced will not have widespread effect. If the retry period is not ideal, it won't affect the great majority of situations. This makes me feel like it is worthwhile to try out this change to see how well it helps.

This change might help for JENKINS-51854 and JENKINS-514910.

Separate out the tests that don't work correclty in all the test runners.
Add the checks for retries.
Open question about whether to sleep. Or wait. Or anything.
@jeffret-b jeffret-b requested a review from jvz April 21, 2020 19:26
Copy link
Member

@jvz jvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement. If this turns out to be helpful, perhaps this can be enhanced later to use a more generic retry strategy?

src/main/java/hudson/remoting/RemoteClassLoader.java Outdated Show resolved Hide resolved
@jeffret-b
Copy link
Contributor Author

If this turns out to be helpful, perhaps this can be enhanced later to use a more generic retry strategy?

Could be. I'm trying to keep it understandable and not too much of a change. There are other patterns that could be used, including the approach in the earlier PR. I couldn't get that one to work out right, including with the tests.

I'd also like to investigate some retries on channel failures, some sort of automated reconnection. That seems to be an issue that comes up fairly frequently. Might be kind of complicated.

@jeffret-b
Copy link
Contributor Author

I'm hoping to get a few more reviews before proceeding on this, so please take a look.

@oleg-nenashev oleg-nenashev self-requested a review April 22, 2020 14:25
@jvz
Copy link
Member

jvz commented Apr 22, 2020

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

@basil
Copy link
Member

basil commented Apr 22, 2020

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

Failsafe is another option. Since we're on the subject, I love Tenacity's documentation on retrying. It so clearly and concisely explains the pros and cons of various retrying strategies.

Copy link
Contributor

@res0nance res0nance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, infinite retries stop being infinite but it doesn't feel like retries being infinite would have ever helped. Hopefully this can help surface reproducible cases.

@jvz
Copy link
Member

jvz commented Apr 22, 2020

For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is.

Failsafe is another option. Since we're on the subject, I love Tenacity's documentation on retrying. It so clearly and concisely explains the pros and cons of various retrying strategies.

Failsafe looks really neat!

@jeffret-b
Copy link
Contributor Author

Thanks for the comments and reviews. As Matt mentioned earlier, I'm trying to keep this change relatively simple for now. Those libraries might be useful later. At a minimum, maybe we can start iterating towards more information on failures and reproducible cases.

@thomasgl-orange
Copy link
Contributor

I have done a few manual tests of a 4.4-SNAPSHOT built from this PR (with a patched Jenkins 2.222.1), using the procedure I had described in JENKINS-61103:

  • disconnect the agent
  • delete its jar cache (not sure it really matters, I've just assumed it would give me more opportunities to interrupt class loading while doing remoting stuff)
  • reconnect it
  • run a simple Maven job (here I'm running the "clean" target on the jenkinsci/remoting repository, but it doesn't matter)
  • interrupt the job when it starts talking about the POM

After a few tries, I've reproduced a LinkageError caused by a RemotingSystemException of InterruptedException. Following attempts at running Maven job then fail with a NoClassDefFoundError, for the reasons described in t.

  • Log of the interrupted build:
[...]
23:43:16  > git checkout -f 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:43:18 Commit message: "Merge pull request #378 from jeffret-b/updateRCL"
23:43:18  > git rev-list --no-walk 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:43:20 Parsing POMs
23:43:20 using global settings config with name FaaS network settings
23:43:20 Replacing all maven server entries not found in credentials list is true
23:43:23 Build was aborted
23:43:23 Aborted by Thomas De Grenier De La Tour
23:43:23 Started calculate disk usage of build
23:43:23 Finished Calculation of disk usage of build in 0 seconds
23:43:23 Started calculate disk usage of workspace
23:43:24 Stop LogSizeTimerTask
23:43:24 Finished Calculation of disk usage of workspace in  1 second
23:43:24 [WS-CLEANUP] Deleting project workspace...
23:43:25 [WS-CLEANUP] done
23:43:25 Finished: ABORTED
  • Exception in the agent log:
Apr 22, 2020 11:43:23 PM hudson.remoting.UserRequest perform
WARNING: LinkageError while performing UserRequest:hudson.maven.MavenModuleSetBuild$PomParser@4480bf1e
java.lang.ExceptionInInitializerError
	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:182)
	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:168)
	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:166)
	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:159)
	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:110)
	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:137)
	at hudson.maven.MavenUtil.createEmbedder(MavenUtil.java:211)
	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1323)
	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1126)
	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3069)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:368)
	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: hudson.remoting.RemotingSystemException: java.lang.InterruptedException
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:301)
	at com.sun.proxy.$Proxy5.fetch(Unknown Source)
	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:294)
	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:257)
	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:216)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.eclipse.sisu.inject.Weak.concurrentKeys(Weak.java:89)
	at org.eclipse.sisu.inject.Weak.concurrentKeys(Weak.java:79)
	at org.eclipse.sisu.plexus.ClassRealmManager.<clinit>(ClassRealmManager.java:66)
	... 18 more
Caused by: java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:176)
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)
	... 27 more
  • Log of the next build attempt:
[...]
23:45:04  > git checkout -f 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:45:04 Commit message: "Merge pull request #378 from jeffret-b/updateRCL"
23:45:04  > git rev-list --no-walk 142a5ef53ef80df3bf7b0f10ca9e555238ec33e3 # timeout=10
23:45:04 Parsing POMs
23:45:04 using global settings config with name FaaS network settings
23:45:04 Replacing all maven server entries not found in credentials list is true
23:45:04 ERROR: Failed to parse POMs
23:45:04 java.io.IOException: Remote call on faas-tmp-thomas-21158-2ezsv225r failed
23:45:04 	at hudson.remoting.Channel.call(Channel.java:1004)
23:45:04 	at hudson.FilePath.act(FilePath.java:1069)
23:45:04 	at hudson.FilePath.act(FilePath.java:1058)
23:45:04 	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.parsePoms(MavenModuleSetBuild.java:987)
23:45:04 	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:691)
23:45:04 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
23:45:04 	at hudson.model.Run.execute(Run.java:1856)
23:45:04 	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
23:45:04 	at hudson.model.ResourceController.execute(ResourceController.java:97)
23:45:04 	at hudson.model.Executor.run(Executor.java:428)
23:45:04 Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.eclipse.sisu.plexus.ClassRealmManager
23:45:04 	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:182)
23:45:04 	at org.codehaus.plexus.DefaultPlexusContainer.<init>(DefaultPlexusContainer.java:168)
23:45:04 	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:166)
23:45:04 	at hudson.maven.MavenEmbedderUtils.buildPlexusContainer(MavenEmbedderUtils.java:159)
23:45:04 	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:110)
23:45:04 	at hudson.maven.MavenEmbedder.<init>(MavenEmbedder.java:137)
23:45:04 	at hudson.maven.MavenUtil.createEmbedder(MavenUtil.java:211)
23:45:04 	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1323)
23:45:04 	at hudson.maven.MavenModuleSetBuild$PomParser.invoke(MavenModuleSetBuild.java:1126)
23:45:04 	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3069)
23:45:04 	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
23:45:04 	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
23:45:04 	at hudson.remoting.Request$2.run(Request.java:368)
23:45:04 	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
23:45:04 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
23:45:04 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
23:45:04 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
23:45:04 	at java.lang.Thread.run(Thread.java:748)
23:45:04 	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to faas-tmp-thomas-21158-2ezsv225r
23:45:04 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1788)
23:45:04 		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
23:45:04 		at hudson.remoting.Channel.call(Channel.java:998)
23:45:04 		at hudson.FilePath.act(FilePath.java:1069)
23:45:04 		at hudson.FilePath.act(FilePath.java:1058)
23:45:04 		at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.parsePoms(MavenModuleSetBuild.java:987)
23:45:04 		at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:691)
23:45:04 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
23:45:04 		at hudson.model.Run.execute(Run.java:1856)
23:45:04 		at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
23:45:04 		at hudson.model.ResourceController.execute(ResourceController.java:97)
23:45:04 		at hudson.model.Executor.run(Executor.java:428)
23:45:05 Started calculate disk usage of build
23:45:05 Finished Calculation of disk usage of build in 0 seconds
23:45:06 Started calculate disk usage of workspace
23:45:06 Finished Calculation of disk usage of workspace in 0 seconds
23:45:06 [WS-CLEANUP] Deleting project workspace...
23:45:06 [WS-CLEANUP] Skipped based on build state FAILURE
23:45:06 Finished: FAILURE

That's one difference with the approach I had taken in #372 (the dynamic proxy would have retried the interrupted fetch call - although, as mentioned in that PR, I had no unit test to prove it, because it was not obvious how to even reach that line).

Of course, this case can be fixed with your approach too. In master, this RemotingSystemException would come from here:
RemoteClassLoader.java#L264
There is some try/catch/retry logic (that's one you've modified in this PR), but it only applies in case of plain InterruptedException, not in case of "RemotingSystemException of InterruptedException". Catching the right exception there might be enough.

One more other random thought regarding this PR vs #372: with this PR come proper unit tests of the retry logic itself. You can simulate repeated interruptions, because the try/catch/retry blocks are at high level in some RemoteClassLoader methods, "around" the test plugs (the TESTING_CLASS_LOAD and TESTING_CLASS_REFERENCE_LOAD runnables). In #372, I've not been able to find a way to do the same: all that these test plugs can do is set the interrupted flag before a remote class-loading operation, leading to one exception, but I couldn't test more than one retry this way, because it was implemented at a lower level. It would have required a different approach, and I had no clear intuition of what (something at the Channel level? one more layer of proxy?).

@jeffret-b
Copy link
Contributor Author

There is some try/catch/retry logic (that's one you've modified in this PR), but it only applies in case of plain InterruptedException, not in case of "RemotingSystemException of InterruptedException". Catching the right exception there might be enough.

I wondered about why that catch was different than the others, but I had poor information as to why things might be different or where the failures occurred. I've just extended the catch in that area in my latest commit to also handle RemotingSystemException similar to how it's being done elsewhere. That may give some nice additional protection.

In #372, I've not been able to find a way to do the same: all that these test plugs can do is set the interrupted flag before a remote class-loading operation, leading to one exception, but I couldn't test more than one retry this way, because it was implemented at a lower level.

I like the cleanness and thoroughness of your approach, but I've struggled with a couple of things with it. I never could get the tests or MAX_RETRIES to work with it. Given the level at which that approach behaves, I wasn't as confident in exactly what it was doing or how much it might apply to other situations. I ended up pursuing this approach because I had more confidence in what it was doing, partly because I was able to get decent tests around it.

@thomasgl-orange
Copy link
Contributor

I've just extended the catch in that area in my latest commit to also handle RemotingSystemException similar to how it's being done elsewhere. That may give some nice additional protection.

Yes, 5c0bec8 should fix it.
I've not tried yet, but does it by chance allow to get rid of the throw new InterruptedException in invokeClassLoadTestingHookIfNeeded()?

Reviewing the code, I find a few other call paths from ClassLoader public methods to the remoting proxy (IClassLoader) with no safe-guard (no try/catch/retry for RemotingSystemExcpetion(InterruptedException)), which could thus theoretically lead to similar errors:

Also, I see that in findResource(String), there is no retry in case of InterruptedException, whereas there are retries for that exception in loadRemoteClass(...) (ie., in findClass(String)). It doesn't sound very coherent, what do you think? (To be clear, things where already like that before your PR, and were still like that with my PR too, I was just wondering if that's something you would consider cleaning up.)

And two last thoughts/questions:

I ended up pursuing this approach because I had more confidence in what it was doing, partly because I was able to get decent tests around it.

Sure, I get that, testability is a big plus here.

Co-Authored-By: Thomas de Grenier de Latour <thomas.degrenierdelatour@orange.com>
@jeffret-b
Copy link
Contributor Author

I've not tried yet, but does it by chance allow to get rid of the throw new InterruptedException in invokeClassLoadTestingHookIfNeeded()?

No, that's still needed.

* only when `!channel.remoteCapability.supportsMultiClassLoaderRPC()`, so probably not worth fixing

Yes, I see no need to do anything about those. I noticed at least one of those when I was putting my PR together but there's no reason to expend effort there.

* findResources(String)

I agree that we should do something about this one. I'll take a look.

(More responses to come.)

@jeffret-b
Copy link
Contributor Author

* `findResources(String)` => `IClassLoader.getResources2(String)`

This one should now be covered with the same pattern in e6ddc55.

@jeffret-b
Copy link
Contributor Author

* an idea I did mention (but not implement) in #372 was, for resources loading (`findResource`/`findResources`), to only retry in case the call was part of some class initialization (`<clinit>` in the stack). Because the overall goal of this retry mechanism is to avoid errors which could leave the classloader in an unrecoverable state with half-loaded classes, but there's no harm in immediately interrupting resources loading in an other context. What do you think?

On the idea that it's easier to evaluate ideas like this with working code, I threw together an implementation. It's on another branch because I'm not sure what to do with it yet. So far I'm inclined to not merge it in and keep the retries for resource loading generally. When only catching the set of exceptions we're looking at, these are mostly involved with channel operations. We're not likely to catch ones that aren't involved in channel stuff anyway. Regular resource loading problems should continue to fail quickly. I'm not sure the special handling helps, though it probably doesn't hurt, either. See the branch and the commit diff.

@thehustler088
Copy link

thehustler088 commented Nov 11, 2020

@jeffret-b on some nodes(not all nodes) i see the below error when I abort the running job:
Nov 11, 2020 12:52:04 AM hudson.util.ProcessTree get
WARNING: Error while determining if vetoers exist
java.lang.InterruptedException
at java.base/java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:177)
at hudson.remoting.Channel.call(Channel.java:1000)
at hudson.util.ProcessTree.get(ProcessTree.java:434)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:1164)
at hudson.Launcher$RemoteLauncher$KillTask.call(Launcher.java:1155)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:376)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

@jeffret-b
Copy link
Contributor Author

Hmmm ... this PR includes some changes in how interrupts are handled. Looks like one now bubbles up in a different way.

Do you see any indication that this causes a problem? (Besides the stack trace. Maybe we can just handle it differently if it doesn't cause any actual issues.)

@thehustler088
Copy link

thehustler088 commented Nov 11, 2020

@jeffret-b I haven't seen any issues, all looks good except above traces on some nodes. does increasing max retries would avoid hitting these exceptions as a temporary workaround?

@jeffret-b
Copy link
Contributor Author

This sounds like a specific case that needs to be handled in the PR. I'll see if I can scrape together enough time in the next day or two to try and see how to handle it. I doubt increasing the max retries would change it, but you can always give it a try.

@jeffret-b
Copy link
Contributor Author

@thehustler088, are you using the msbuild or gradle-daemon plugins? (Just checking on something.)

@thehustler088
Copy link

thehustler088 commented Nov 14, 2020

@jeffret-b I am running bunch of python scripts in background that does deployment and testing in my environment.

@jeffret-b
Copy link
Contributor Author

@thehustler088, I've gotten another incrementals build of the Jenkins war that may cover the follow-on issue: https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ . If you can give that a try and report on the results that would help.

@thehustler088
Copy link

Sure @jeffret-b I may not able to load it today since weekend runs are in progress. I will load it on Sunday, I will let you the test results by Monday / Tuesday. Thanks!

@thehustler088
Copy link

@jeffret-b I have loaded war file(https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ ) in my env, did not see any issues / traces till now. I tried on 3 nodes(running ubuntu 18.04 LTS) by interrupting the running jobs. I may try on different nodes this week

@thehustler088
Copy link

Is there way to make these draft changes(remoting jar) compatible with LTS versions, as I see some UI bugs with the current loaded build.?

@jeffret-b
Copy link
Contributor Author

@thehustler088, thanks for the testing! This gives more confidence that this change improves the situation, which will help push it forward.

The core build uses the incrementals capability, which requires it to be based on the latest state of the master branch. This can indeed bring in unstable changes. If you want to have a different version you can build jenkinsci/jenkins#5054 on top of whatever branch you want. I don't know where I would host a build like that for you.

@thehustler088
Copy link

@jeffret-b Thanks for the changes. I am trying to build piplines job for my new requirement with 2.268-rc30582.2308bd563998 and UI becomes unmanageable with active choice parameters. Is there a way to make remoting jar(which includes your draft changes) compatible with jenkins - Jenkins 2.263.1? I did not find a way to make it work.

Also, may I know when are these changes will be officially available with Jenkins?

@jeffret-b
Copy link
Contributor Author

@thehustler088, I cherry-picked the two relevant commits from jenkinsci/jenkins#5054 on top of the jenkinsci/jenkins branch stable-2.263. Then I built with skipping tests. I've uploaded the resulting war file to https://www.dropbox.com/s/1ez4t5pxp30q8tw/jenkins.war?dl=0 . Not sure if you can access that, but you can give it a try or build yourself.

I have no plan yet for merging these in. If you report success in your environment that gives more confidence. I could see about integrating these in sometime in January 2021.

@thehustler088
Copy link

@jeffret-b Thanks much for building on stable branch. I have downloaded it, will load it sometime tomorrow. If this solves UI issues, I will load this build on multiple master servers.

@thehustler088
Copy link

@jeffret-b I have been using this build for about week and I did not notice any issues with interrupting the running jobs. Also, UI is working fine with active choice parameters.

@jeffret-b
Copy link
Contributor Author

Thanks for testing that. The results are very promising. I'll try to get this moving forward early next year.

In the meantime, if anyone else wants to review or especially test, that would be great.

@thehustler088
Copy link

@jeffret-b any progress on pushing the changes to official jenkins release

@jeffret-b
Copy link
Contributor Author

@thehustler088 , I've been busy, which may continue through the end of this month, but I want to get this done as soon as I can.

@jeffret-b jeffret-b changed the title [WIP] [JENKINS-61103] Retry on class resource load failures and introduce timeouts [JENKINS-61103] Retry on class resource load failures and introduce timeouts Feb 8, 2021
@jeffret-b jeffret-b added bug For changelog: Fixes a bug. and removed work-in-progress labels Feb 8, 2021
@jeffret-b jeffret-b marked this pull request as ready for review February 8, 2021 19:13
@jeffret-b
Copy link
Contributor Author

I'm moving this officially out of draft and work-in-progress. There are a number of reviews. Some people have tested and verified improvements using this reworked approach.

I'd like to get this merged and released soon. I appreciate any other reviews or tests. I'd like to get it merged in and ready for release within the next couple of days.

@jeffret-b jeffret-b merged commit 8e71970 into jenkinsci:master Feb 10, 2021
@jeffret-b
Copy link
Contributor Author

@thehustler088 , this is now released in Remoting 4.7. I'm not going to push it into the Jenkins packaging immediately. Any testing or use results you can share will help increase confidence and speed of getting it packaged into core.

I'm hoping to get it into a weekly Jenkins release within the next couple of releases.

@Prabhu088
Copy link

@jeffret-b Thanks much for pushing the changes to official remoting 4.7. May I know what version of jenkins is compatible with remoting jar 4.7? can I use with Jenkins 2.263.1? If yes, please let me know what branches to be pulled in order to build Jenkins war 2.263.1 with Remoting 4.7

@jeffret-b
Copy link
Contributor Author

This Remoting version should work fine with older versions, especially including 2.263.x. Even older versions should still work fine, though they get little testing. Newer features may not work -- for example, Jenkins 2.217 is required for WebSocket capabilities.

@Prabhu088
Copy link

@jeffret-b I have added 4.7 remoting jar in Jenkins 2.263.1, I will load and will let you know by end of this week

@thehustler088
Copy link

I have loaded this in couple of my testing machine running ubuntu 18.04lts and cent os 7, did not see any issue / exceptions when I abort running jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Fixes a bug. ready-to-merge
Projects
None yet
7 participants