check_job: check if we really got a job hash, not an error #1410

AdamWill · 2017-07-31T19:41:32Z

After updating an openQA instance to current git, I noticed that
many workers stopped picking up jobs. Looking at their logs, I
see a bunch of errors indicating that first the line modified
in this commit (the $job->{id} part) and then a line in
websocket_commands (where it does $job->{URL}) are trying to
use a string as a hash ref. In most cases the string was some
sort of DBus error, but in one case the string was itself an
error about trying to use a string as a hash ref (holy recursion,
batman).

It seems like what's going on is that, occasionally, when
check_job tries to grab a job, it gets back a response that is
not a proper job hash ref, but some kind of error string. We
can also look at the grab_job end of this problem and see if
we can stop it from doing this, but making this check more
robust should also help (it should cause the else clause to
kick in properly and make the worker ignore the response and
keep polling for a job on the next tick).

After updating an openQA instance to current git, I noticed that many workers stopped picking up jobs. Looking at their logs, I see a bunch of errors indicating that first the line modified in this commit (the `$job->{id}` part) and then a line in `websocket_commands` (where it does `$job->{URL}`) are trying to use a string as a hash ref. In most cases the string was some sort of DBus error, but in one case the string was *itself* an error about trying to use a string as a hash ref (holy recursion, batman). It seems like what's going on is that, occasionally, when `check_job` tries to grab a job, it gets back a response that is not a proper job hash ref, but some kind of error string. We can also look at the `grab_job` end of this problem and see if we can stop it from doing this, but making this check more robust should also help (it should cause the `else` clause to kick in properly and make the worker ignore the response and keep polling for a job on the next tick).

AdamWill · 2017-07-31T19:44:41Z

There's always one "can't use string as a HASH ref" error in Jobs.pm line 115 (check_jobs), then a whole series of the same error in Commands.pm line 64 (websocket_commands). This is because after the error in check_jobs, $job remains defined as the string, so websocket_commands just keeps trying to use it, over and over. Here's what the errors look like:

Jul 31 19:12:04 qa09.qa.fedoraproject.org worker[16656]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("org.freedesktop.DBus.Error.Disco"...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 64.

Here's the variant where the string was something else:

Jul 30 00:00:00 qa07.qa.fedoraproject.org worker[24095]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("Can't use string ("10.5.124.157""...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 64.

The string is always truncated like that, so we can never see the whole thing.

AdamWill · 2017-07-31T19:45:12Z

Oh, forgot to mention, @Vogtinator confirmed seeing this on his test instance too.

codecov · 2017-07-31T19:53:30Z

Codecov Report

Merging #1410 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1410      +/-   ##
==========================================
- Coverage   87.77%   87.73%   -0.04%     
==========================================
  Files         105      105              
  Lines        7588     7588              
==========================================
- Hits         6660     6657       -3     
- Misses        928      931       +3

Impacted Files	Coverage Δ
lib/OpenQA/Worker/Jobs.pm	`83.91% <100%> (ø)`	⬆️
lib/OpenQA/Worker/Commands.pm	`76.92% <0%> (-4.62%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c72a17...e760214. Read the comment docs.

AdamWill · 2017-07-31T23:34:53Z

So as to what caused this, there are a few changes that look potentially relevant. For one thing, the API job grab function - grab in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm - was changed
to do its work asynchronously:

c9efa04
92ab1e9

For another thing, the job_grab function in lib/OpenQA/Scheduler/Scheduler.pm that is called (via IPC) by the API grab function, and whose result is passed on to check_jobs and websocket_commands, has been fiddled with a bit recently (that is, since the April snapshot I was running before, without this issue):

328f642
85151ec

so tagging the person involved with these changes: @mudler

I've updated our staging instance with my proposed patch and restarted all worker services, so I'll keep an eye on whether they're all still taking jobs tomorrow...

mudler · 2017-08-01T05:19:43Z

Efforts are being made to remove check_job completely ( #1403 ), it's still in progress and needs refinements. Your patch will just avoid from the error appear in the client side, but it's just a symptom of failures contacting the scheduler over dbus on the server side, this would workaround the problem in the server side instead of the client: https://github.com/os-autoinst/openQA/pull/1399/commits

coolo · 2017-08-01T05:49:13Z

yeah, master is broken with love :(

AdamWill · 2017-08-01T06:13:04Z

Well, it's not just a cosmetic problem, because of the thing I noted about the (shared) variable $job staying set - once the error starts occurring, $job remains set to the error string until the process is terminated, at least AIUI. The worker is effectively blocked and will never pick up a job. So my fix isn't just cosmetic, it actually prevents the worker process becoming stuck, because the else clause will kick in, $job will be unset, and the worker will try again to grab a job on the next tick.

So far, at least, since I applied the patch, no workers seem to have become stuck. I'll check again tomorrow.

coolo · 2017-08-01T07:34:41Z

if it helps you, we can merge it - otherwise I would revert master to the state before the async changes.

AdamWill · 2017-08-01T14:57:03Z

Well, it's always a bit difficult to be 100% sure, of course. I think @mudler 's closed PR would do the job equally well. I think either change (or both) is justified, but if you'd rather focus on the check_job removal, I can easily just patch this downstream till then. So, I mean, I'm relatively easy.

coolo · 2017-08-01T15:06:11Z

as we made good progress in the nopoll change today, I'm closing this one

coolo closed this Aug 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_job: check if we really got a job hash, not an error #1410

check_job: check if we really got a job hash, not an error #1410

AdamWill commented Jul 31, 2017

AdamWill commented Jul 31, 2017

AdamWill commented Jul 31, 2017

codecov bot commented Jul 31, 2017 •

edited

AdamWill commented Jul 31, 2017 •

edited

mudler commented Aug 1, 2017

coolo commented Aug 1, 2017

AdamWill commented Aug 1, 2017

coolo commented Aug 1, 2017

AdamWill commented Aug 1, 2017

coolo commented Aug 1, 2017

check_job: check if we really got a job hash, not an error #1410

check_job: check if we really got a job hash, not an error #1410

Conversation

AdamWill commented Jul 31, 2017

AdamWill commented Jul 31, 2017

AdamWill commented Jul 31, 2017

codecov bot commented Jul 31, 2017 • edited

Codecov Report

AdamWill commented Jul 31, 2017 • edited

mudler commented Aug 1, 2017

coolo commented Aug 1, 2017

AdamWill commented Aug 1, 2017

coolo commented Aug 1, 2017

AdamWill commented Aug 1, 2017

coolo commented Aug 1, 2017

codecov bot commented Jul 31, 2017 •

edited

AdamWill commented Jul 31, 2017 •

edited