Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_job: check if we really got a job hash, not an error #1410

Closed
wants to merge 1 commit into from

Conversation

AdamWill
Copy link
Contributor

After updating an openQA instance to current git, I noticed that
many workers stopped picking up jobs. Looking at their logs, I
see a bunch of errors indicating that first the line modified
in this commit (the $job->{id} part) and then a line in
websocket_commands (where it does $job->{URL}) are trying to
use a string as a hash ref. In most cases the string was some
sort of DBus error, but in one case the string was itself an
error about trying to use a string as a hash ref (holy recursion,
batman).

It seems like what's going on is that, occasionally, when
check_job tries to grab a job, it gets back a response that is
not a proper job hash ref, but some kind of error string. We
can also look at the grab_job end of this problem and see if
we can stop it from doing this, but making this check more
robust should also help (it should cause the else clause to
kick in properly and make the worker ignore the response and
keep polling for a job on the next tick).

After updating an openQA instance to current git, I noticed that
many workers stopped picking up jobs. Looking at their logs, I
see a bunch of errors indicating that first the line modified
in this commit (the `$job->{id}` part) and then a line in
`websocket_commands` (where it does `$job->{URL}`) are trying to
use a string as a hash ref. In most cases the string was some
sort of DBus error, but in one case the string was *itself* an
error about trying to use a string as a hash ref (holy recursion,
batman).

It seems like what's going on is that, occasionally, when
`check_job` tries to grab a job, it gets back a response that is
 not a proper job hash ref, but some kind of error string. We
can also look at the `grab_job` end of this problem and see if
we can stop it from doing this, but making this check more
robust should also help (it should cause the `else` clause to
kick in properly and make the worker ignore the response and
keep polling for a job on the next tick).
@AdamWill
Copy link
Contributor Author

There's always one "can't use string as a HASH ref" error in Jobs.pm line 115 (check_jobs), then a whole series of the same error in Commands.pm line 64 (websocket_commands). This is because after the error in check_jobs, $job remains defined as the string, so websocket_commands just keeps trying to use it, over and over. Here's what the errors look like:

Jul 31 19:12:04 qa09.qa.fedoraproject.org worker[16656]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("org.freedesktop.DBus.Error.Disco"...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 64.

Here's the variant where the string was something else:

Jul 30 00:00:00 qa07.qa.fedoraproject.org worker[24095]: Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("Can't use string ("10.5.124.157""...) as a HASH ref while "strict refs" in use at /usr/share/openqa/script/../lib/OpenQA/Worker/Commands.pm line 64.

The string is always truncated like that, so we can never see the whole thing.

@AdamWill
Copy link
Contributor Author

Oh, forgot to mention, @Vogtinator confirmed seeing this on his test instance too.

@codecov
Copy link

codecov bot commented Jul 31, 2017

Codecov Report

Merging #1410 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1410      +/-   ##
==========================================
- Coverage   87.77%   87.73%   -0.04%     
==========================================
  Files         105      105              
  Lines        7588     7588              
==========================================
- Hits         6660     6657       -3     
- Misses        928      931       +3
Impacted Files Coverage Δ
lib/OpenQA/Worker/Jobs.pm 83.91% <100%> (ø) ⬆️
lib/OpenQA/Worker/Commands.pm 76.92% <0%> (-4.62%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c72a17...e760214. Read the comment docs.

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 31, 2017

So as to what caused this, there are a few changes that look potentially relevant. For one thing, the API job grab function - grab in lib/OpenQA/WebAPI/Controller/API/V1/Job.pm - was changed
to do its work asynchronously:

c9efa04
92ab1e9

For another thing, the job_grab function in lib/OpenQA/Scheduler/Scheduler.pm that is called (via IPC) by the API grab function, and whose result is passed on to check_jobs and websocket_commands, has been fiddled with a bit recently (that is, since the April snapshot I was running before, without this issue):

328f642
85151ec

so tagging the person involved with these changes: @mudler

I've updated our staging instance with my proposed patch and restarted all worker services, so I'll keep an eye on whether they're all still taking jobs tomorrow...

@mudler
Copy link
Member

mudler commented Aug 1, 2017

Efforts are being made to remove check_job completely ( #1403 ), it's still in progress and needs refinements. Your patch will just avoid from the error appear in the client side, but it's just a symptom of failures contacting the scheduler over dbus on the server side, this would workaround the problem in the server side instead of the client: https://github.com/os-autoinst/openQA/pull/1399/commits

@coolo
Copy link
Contributor

coolo commented Aug 1, 2017

yeah, master is broken with love :(

@AdamWill
Copy link
Contributor Author

AdamWill commented Aug 1, 2017

Well, it's not just a cosmetic problem, because of the thing I noted about the (shared) variable $job staying set - once the error starts occurring, $job remains set to the error string until the process is terminated, at least AIUI. The worker is effectively blocked and will never pick up a job. So my fix isn't just cosmetic, it actually prevents the worker process becoming stuck, because the else clause will kick in, $job will be unset, and the worker will try again to grab a job on the next tick.

So far, at least, since I applied the patch, no workers seem to have become stuck. I'll check again tomorrow.

@coolo
Copy link
Contributor

coolo commented Aug 1, 2017

if it helps you, we can merge it - otherwise I would revert master to the state before the async changes.

@AdamWill
Copy link
Contributor Author

AdamWill commented Aug 1, 2017

Well, it's always a bit difficult to be 100% sure, of course. I think @mudler 's closed PR would do the job equally well. I think either change (or both) is justified, but if you'd rather focus on the check_job removal, I can easily just patch this downstream till then. So, I mean, I'm relatively easy.

@coolo
Copy link
Contributor

coolo commented Aug 1, 2017

as we made good progress in the nopoll change today, I'm closing this one

@coolo coolo closed this Aug 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants