Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle bad clients #466

Open
permcody opened this issue Aug 1, 2019 · 4 comments
Open

Handle bad clients #466

permcody opened this issue Aug 1, 2019 · 4 comments

Comments

@permcody
Copy link
Member

permcody commented Aug 1, 2019

So this is straight out of "fault tolerant systems" issues. It would be really nice to handle "faulty clients" at the server level. One one of our clients breaks in a way that it's still picking up tests but failing them immediately due to problems with keys or the file system. They can wreak havoc on the dash board. A single client can be handed several jobs that they mark as failed (incorrectly). Solving this robustly means taking a vote from multiple clients (when this happens) so that the server can determine that there is a faulty client taking jobs but not actually performing the work, essentially "disabling" or not feeding that client any more work.

Assuming we don't have malicious clients, we could probably take a less rigorous approach to routing out bad actors. Perhaps we slow down eligibility for clients that perform work too fast (e.g. if they fail 2,3 jobs in fast succession, we put them on a cool-down for "awhile"). Of course this won't fix the problem, but it might be a simple improvment. There is no easy path to fixing this problem robustly, but just handing out jobs like we do now is really bad in this situation.

@brianmoose
Copy link
Contributor

Yeah, this would happen occasionally and was pretty annoying.
I was not a big fan of keeping state on the server since it would need to be stored in the database and be constantly checked. I was going to go with your assumption that we don't have malicious clients and so they can take care of themselves.
My psuedo plan was to mark most of the initial git operations as "No Fail" as that is where most of these sort of problems happen (network, keys, git server down, etc). I think the only git operation that should be a real failure is if the merge fails. The "No Fail" operations would exit with a certain code or touch a file in the filesystem (if the git exit code is wanted). The client would check this and if something marked "No Fail" failed, the client would add that job id to a list of jobs not to ask for, then tell the server to reschedule the job. This list of failed jobs would be only held in the client. The list would probably be specific to each git server. If github goes down it shouldn't affect doing gitlab jobs. If too many are added to the list then it stop doing jobs for that server. Could also have timeouts for each job (ie if it gets a "No Fail" but then starts working again, the initial "No Fail" gets expired).

I thought it would be far better to restart clients that automatically stopped than it is to go through and invalidate a huge number of jobs. This isn't obviously the most robust fix but it would probably be straight forward to implement and ease the annoyance on this relatively rare problem.

Anyway, obviously never got around to implementing this so there might be problems with this approach.

@permcody
Copy link
Member Author

@brianmoose - Are you looking for a job yet? 😄
We are hiring, yours would come with a decent raise

@brianmoose
Copy link
Contributor

Hah! Not looking as yet, just getting back from an Alaska trip.
I will be in Idaho Falls in a week or two if anybody wants to get a 🍺!

@permcody
Copy link
Member Author

permcody commented Aug 16, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants