-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle bad clients #466
Comments
Yeah, this would happen occasionally and was pretty annoying. I thought it would be far better to restart clients that automatically stopped than it is to go through and invalidate a huge number of jobs. This isn't obviously the most robust fix but it would probably be straight forward to implement and ease the annoyance on this relatively rare problem. Anyway, obviously never got around to implementing this so there might be problems with this approach. |
@brianmoose - Are you looking for a job yet? 😄 |
Hah! Not looking as yet, just getting back from an Alaska trip. |
Awesome - I'm sure several of us will go have a beer with you. Looking
forward to hearing about your adventures.
…On Fri, Aug 16, 2019 at 11:38 AM brianmoose ***@***.***> wrote:
Hah! Not looking as yet, just getting back from an Alaska trip.
I will be in Idaho Falls in a week or two if anybody wants to get a 🍺!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#466?email_source=notifications&email_token=AAXFOIA7RK772KJXIL2GNSTQE3QZBA5CNFSM4IITPJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PHXYI#issuecomment-522091489>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXFOIA3ZNNMLWQFXKIB543QE3QZBANCNFSM4IITPJKA>
.
|
So this is straight out of "fault tolerant systems" issues. It would be really nice to handle "faulty clients" at the server level. One one of our clients breaks in a way that it's still picking up tests but failing them immediately due to problems with keys or the file system. They can wreak havoc on the dash board. A single client can be handed several jobs that they mark as failed (incorrectly). Solving this robustly means taking a vote from multiple clients (when this happens) so that the server can determine that there is a faulty client taking jobs but not actually performing the work, essentially "disabling" or not feeding that client any more work.
Assuming we don't have malicious clients, we could probably take a less rigorous approach to routing out bad actors. Perhaps we slow down eligibility for clients that perform work too fast (e.g. if they fail 2,3 jobs in fast succession, we put them on a cool-down for "awhile"). Of course this won't fix the problem, but it might be a simple improvment. There is no easy path to fixing this problem robustly, but just handing out jobs like we do now is really bad in this situation.
The text was updated successfully, but these errors were encountered: