-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_job: check if we really got a job hash, not an error #1410
Conversation
After updating an openQA instance to current git, I noticed that many workers stopped picking up jobs. Looking at their logs, I see a bunch of errors indicating that first the line modified in this commit (the `$job->{id}` part) and then a line in `websocket_commands` (where it does `$job->{URL}`) are trying to use a string as a hash ref. In most cases the string was some sort of DBus error, but in one case the string was *itself* an error about trying to use a string as a hash ref (holy recursion, batman). It seems like what's going on is that, occasionally, when `check_job` tries to grab a job, it gets back a response that is not a proper job hash ref, but some kind of error string. We can also look at the `grab_job` end of this problem and see if we can stop it from doing this, but making this check more robust should also help (it should cause the `else` clause to kick in properly and make the worker ignore the response and keep polling for a job on the next tick).
There's always one "can't use string as a HASH ref" error in
Here's the variant where the string was something else:
The string is always truncated like that, so we can never see the whole thing. |
Oh, forgot to mention, @Vogtinator confirmed seeing this on his test instance too. |
Codecov Report
@@ Coverage Diff @@
## master #1410 +/- ##
==========================================
- Coverage 87.77% 87.73% -0.04%
==========================================
Files 105 105
Lines 7588 7588
==========================================
- Hits 6660 6657 -3
- Misses 928 931 +3
Continue to review full report at Codecov.
|
So as to what caused this, there are a few changes that look potentially relevant. For one thing, the API job grab function - For another thing, the so tagging the person involved with these changes: @mudler I've updated our staging instance with my proposed patch and restarted all worker services, so I'll keep an eye on whether they're all still taking jobs tomorrow... |
Efforts are being made to remove check_job completely ( #1403 ), it's still in progress and needs refinements. Your patch will just avoid from the error appear in the client side, but it's just a symptom of failures contacting the scheduler over dbus on the server side, this would workaround the problem in the server side instead of the client: https://github.com/os-autoinst/openQA/pull/1399/commits |
yeah, master is broken with love :( |
Well, it's not just a cosmetic problem, because of the thing I noted about the (shared) variable So far, at least, since I applied the patch, no workers seem to have become stuck. I'll check again tomorrow. |
if it helps you, we can merge it - otherwise I would revert master to the state before the async changes. |
Well, it's always a bit difficult to be 100% sure, of course. I think @mudler 's closed PR would do the job equally well. I think either change (or both) is justified, but if you'd rather focus on the |
as we made good progress in the nopoll change today, I'm closing this one |
After updating an openQA instance to current git, I noticed that
many workers stopped picking up jobs. Looking at their logs, I
see a bunch of errors indicating that first the line modified
in this commit (the
$job->{id}
part) and then a line inwebsocket_commands
(where it does$job->{URL}
) are trying touse a string as a hash ref. In most cases the string was some
sort of DBus error, but in one case the string was itself an
error about trying to use a string as a hash ref (holy recursion,
batman).
It seems like what's going on is that, occasionally, when
check_job
tries to grab a job, it gets back a response that isnot a proper job hash ref, but some kind of error string. We
can also look at the
grab_job
end of this problem and see ifwe can stop it from doing this, but making this check more
robust should also help (it should cause the
else
clause tokick in properly and make the worker ignore the response and
keep polling for a job on the next tick).