-
-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-61103] Retry on class resource load failures and introduce timeouts #379
Conversation
Separate out the tests that don't work correclty in all the test runners. Add the checks for retries. Open question about whether to sleep. Or wait. Or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice improvement. If this turns out to be helpful, perhaps this can be enhanced later to use a more generic retry strategy?
Could be. I'm trying to keep it understandable and not too much of a change. There are other patterns that could be used, including the approach in the earlier PR. I couldn't get that one to work out right, including with the tests. I'd also like to investigate some retries on channel failures, some sort of automated reconnection. That seems to be an issue that comes up fairly frequently. Might be kind of complicated. |
I'm hoping to get a few more reviews before proceeding on this, so please take a look. |
For retrying, there's always https://github.com/resilience4j/resilience4j thought I don't know how heavy that is. |
Failsafe is another option. Since we're on the subject, I love Tenacity's documentation on retrying. It so clearly and concisely explains the pros and cons of various retrying strategies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, infinite retries stop being infinite but it doesn't feel like retries being infinite would have ever helped. Hopefully this can help surface reproducible cases.
Failsafe looks really neat! |
Thanks for the comments and reviews. As Matt mentioned earlier, I'm trying to keep this change relatively simple for now. Those libraries might be useful later. At a minimum, maybe we can start iterating towards more information on failures and reproducible cases. |
I have done a few manual tests of a 4.4-SNAPSHOT built from this PR (with a patched Jenkins 2.222.1), using the procedure I had described in JENKINS-61103:
After a few tries, I've reproduced a
That's one difference with the approach I had taken in #372 (the dynamic proxy would have retried the interrupted Of course, this case can be fixed with your approach too. In One more other random thought regarding this PR vs #372: with this PR come proper unit tests of the retry logic itself. You can simulate repeated interruptions, because the try/catch/retry blocks are at high level in some |
I wondered about why that catch was different than the others, but I had poor information as to why things might be different or where the failures occurred. I've just extended the catch in that area in my latest commit to also handle RemotingSystemException similar to how it's being done elsewhere. That may give some nice additional protection.
I like the cleanness and thoroughness of your approach, but I've struggled with a couple of things with it. I never could get the tests or MAX_RETRIES to work with it. Given the level at which that approach behaves, I wasn't as confident in exactly what it was doing or how much it might apply to other situations. I ended up pursuing this approach because I had more confidence in what it was doing, partly because I was able to get decent tests around it. |
Yes, 5c0bec8 should fix it. Reviewing the code, I find a few other call paths from
Also, I see that in And two last thoughts/questions:
Sure, I get that, testability is a big plus here. |
Co-Authored-By: Thomas de Grenier de Latour <thomas.degrenierdelatour@orange.com>
No, that's still needed.
Yes, I see no need to do anything about those. I noticed at least one of those when I was putting my PR together but there's no reason to expend effort there.
I agree that we should do something about this one. I'll take a look. (More responses to come.) |
This one should now be covered with the same pattern in e6ddc55. |
On the idea that it's easier to evaluate ideas like this with working code, I threw together an implementation. It's on another branch because I'm not sure what to do with it yet. So far I'm inclined to not merge it in and keep the retries for resource loading generally. When only catching the set of exceptions we're looking at, these are mostly involved with channel operations. We're not likely to catch ones that aren't involved in channel stuff anyway. Regular resource loading problems should continue to fail quickly. I'm not sure the special handling helps, though it probably doesn't hurt, either. See the branch and the commit diff. |
@jeffret-b on some nodes(not all nodes) i see the below error when I abort the running job: |
Hmmm ... this PR includes some changes in how interrupts are handled. Looks like one now bubbles up in a different way. Do you see any indication that this causes a problem? (Besides the stack trace. Maybe we can just handle it differently if it doesn't cause any actual issues.) |
@jeffret-b I haven't seen any issues, all looks good except above traces on some nodes. does increasing max retries would avoid hitting these exceptions as a temporary workaround? |
This sounds like a specific case that needs to be handled in the PR. I'll see if I can scrape together enough time in the next day or two to try and see how to handle it. I doubt increasing the max retries would change it, but you can always give it a try. |
@thehustler088, are you using the msbuild or gradle-daemon plugins? (Just checking on something.) |
@jeffret-b I am running bunch of python scripts in background that does deployment and testing in my environment. |
@thehustler088, I've gotten another incrementals build of the Jenkins war that may cover the follow-on issue: https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ . If you can give that a try and report on the results that would help. |
Sure @jeffret-b I may not able to load it today since weekend runs are in progress. I will load it on Sunday, I will let you the test results by Monday / Tuesday. Thanks! |
@jeffret-b I have loaded war file(https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.268-rc30582.2308bd563998/ ) in my env, did not see any issues / traces till now. I tried on 3 nodes(running ubuntu 18.04 LTS) by interrupting the running jobs. I may try on different nodes this week |
Is there way to make these draft changes(remoting jar) compatible with LTS versions, as I see some UI bugs with the current loaded build.? |
@thehustler088, thanks for the testing! This gives more confidence that this change improves the situation, which will help push it forward. The core build uses the incrementals capability, which requires it to be based on the latest state of the master branch. This can indeed bring in unstable changes. If you want to have a different version you can build jenkinsci/jenkins#5054 on top of whatever branch you want. I don't know where I would host a build like that for you. |
@jeffret-b Thanks for the changes. I am trying to build piplines job for my new requirement with 2.268-rc30582.2308bd563998 and UI becomes unmanageable with active choice parameters. Is there a way to make remoting jar(which includes your draft changes) compatible with jenkins - Jenkins 2.263.1? I did not find a way to make it work. Also, may I know when are these changes will be officially available with Jenkins? |
@thehustler088, I cherry-picked the two relevant commits from jenkinsci/jenkins#5054 on top of the jenkinsci/jenkins branch I have no plan yet for merging these in. If you report success in your environment that gives more confidence. I could see about integrating these in sometime in January 2021. |
@jeffret-b Thanks much for building on stable branch. I have downloaded it, will load it sometime tomorrow. If this solves UI issues, I will load this build on multiple master servers. |
@jeffret-b I have been using this build for about week and I did not notice any issues with interrupting the running jobs. Also, UI is working fine with active choice parameters. |
Thanks for testing that. The results are very promising. I'll try to get this moving forward early next year. In the meantime, if anyone else wants to review or especially test, that would be great. |
@jeffret-b any progress on pushing the changes to official jenkins release |
@thehustler088 , I've been busy, which may continue through the end of this month, but I want to get this done as soon as I can. |
I'm moving this officially out of draft and work-in-progress. There are a number of reviews. Some people have tested and verified improvements using this reworked approach. I'd like to get this merged and released soon. I appreciate any other reviews or tests. I'd like to get it merged in and ready for release within the next couple of days. |
@thehustler088 , this is now released in Remoting 4.7. I'm not going to push it into the Jenkins packaging immediately. Any testing or use results you can share will help increase confidence and speed of getting it packaged into core. I'm hoping to get it into a weekly Jenkins release within the next couple of releases. |
@jeffret-b Thanks much for pushing the changes to official remoting 4.7. May I know what version of jenkins is compatible with remoting jar 4.7? can I use with Jenkins 2.263.1? If yes, please let me know what branches to be pulled in order to build Jenkins war 2.263.1 with Remoting 4.7 |
This Remoting version should work fine with older versions, especially including 2.263.x. Even older versions should still work fine, though they get little testing. Newer features may not work -- for example, Jenkins 2.217 is required for WebSocket capabilities. |
@jeffret-b I have added 4.7 remoting jar in Jenkins 2.263.1, I will load and will let you know by end of this week |
I have loaded this in couple of my testing machine running ubuntu 18.04lts and cent os 7, did not see any issue / exceptions when I abort running jobs. |
See JENKINS-61103. This is an alternative approach to #372.
There have been a couple of previous efforts to introduce retries into the RemoteClassLoader. These have reportedly resolved some situations, however we have continued to receive reports of class loading failures. As noted in JENKINS-61103 and the earlier PR, one area that isn't covered by a retry is resource loading as part of a class.
This PR does three things:
findResource()
using the same pattern.These changes only get involved in truly exceptional conditions. I've never been able to reproduce them directly. Other reports of class loading failures have similarly lacked for reproducibility. This has two major impacts:
This change might help for JENKINS-51854 and JENKINS-514910.