Worker hitting `QEMU should be dead - WASUP?` and dying on job cancel #530

AdamWill · 2016-07-05T06:00:55Z

Ever since I updated Fedora openQA staging to recent openQA and os-autoinst git snapshots - fe19b00 for os-autoinst, a08377c for openQA - I've noticed that worker services sometimes seem to be dying. On closer inspection of the system logs, they seem to be dying sometimes when a job is cancelled (either user_cancelled or parallel_failed), and hitting this point in openQA:

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/Pool.pm#L57

so why am I filing on os-autoinst? Well, I'm not sure, but I'm guessing this may be due to this PR:

#523

Here's a sample case. https://openqa.stg.fedoraproject.org/tests/26770 is the job. Here are the log messages from the worker host:

Jul 01 00:01:18 qa06.qa.fedoraproject.org worker[21937]: killing 29128
Jul 01 00:01:18 qa06.qa.fedoraproject.org worker[21937]: can't open /var/lib/openqa/pool/2/testresults/result-upgrade_run.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 529.
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: cleaning up 00026770-fedora-Rawhide-universal-x86_64-BuildFedora-Rawhide-20160630.n.0-upgrade_minimal_64bit@64bit...
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: setting job 26770 to incomplete (cancel)
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: QEMU (29137 -> /usr/bin/qemu-system-x86_64) should be dead - WASUP?
Jul 01 00:01:27 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@2.service: Main process exited, code=exited, status=1/FAILURE

now here's an interesting bit from the job's autoinst log:

00:01:18.0239 29128: killing backend process 29132
00:01:18.0338 29128: got sigchld
00:01:18.0339 29128: done with backend process

note the worker code seems to be bailing because 29137 was still running, but os-autoinst killed 29132. I'm not entirely sure what that got sigchld is about either, but it comes from that same PR (#523). This may not be an issue - I don't entirely follow what's going on there, there's forking and all sorts of other wackiness that I haven't parsed out yet - but it seems at least odd.

It seems that, on one of our two worker hosts, I have the following cases:

https://openqa.stg.fedoraproject.org/tests/26283
https://openqa.stg.fedoraproject.org/tests/26312
https://openqa.stg.fedoraproject.org/tests/26313
https://openqa.stg.fedoraproject.org/tests/26322
https://openqa.stg.fedoraproject.org/tests/26329
https://openqa.stg.fedoraproject.org/tests/26330

all of those ones may have been, I think, while the database was still a bit messed up from os-autoinst/openQA#762 . But the following are from after that:

https://openqa.stg.fedoraproject.org/tests/26568 (job ran beyond MAX_JOB_TIME, which I didn't even know was a thing)
https://openqa.stg.fedoraproject.org/tests/26614 (user_restarted)
https://openqa.stg.fedoraproject.org/tests/26657 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/26770 (user_cancelled)
https://openqa.stg.fedoraproject.org/tests/26967 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27062 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27306 (user_restarted)

It seems like every time the message setting job [job ID] to incomplete (cancel) is present, this bug happens, but that's only a sample size of 4, so it could be a coincidence, I guess.

There are some other cases on the other worker host, I can link those too if it's helpful. The patterns seem more or less the same.

The text was updated successfully, but these errors were encountered:

okurz · 2016-07-05T06:04:20Z

duplicate of https://progress.opensuse.org/issues/12566

AdamWill · 2016-07-05T06:05:12Z

It's worth noting, I guess, that kill_qemu() - which is what the pre-523 code's stop_vm() would have called - seems to have lots of waiting and taking care of child PIDs and stuff, while the new code from #523 doesn't seem to take quite as much care.

AdamWill · 2016-07-05T07:02:02Z

@coolo says "4d9a31f is the fix" and, indeed, that pretty much adds back a bmwqemu::stop_vm() call that #523 got rid of. The other two changes are not relevant without the rest of #524 , but I tested a patch which simply re-adds the bmwqemu::stop_vm() line and it does indeed seem to fix this.

Would it be worth me sending a PR that just makes that change for now, so we don't have this bug on master while #524 is going through review?

coolo · 2016-07-05T09:04:09Z

I prefer to get #524 in - the whole teardown is too complicated ;(

AdamWill · 2016-07-09T15:24:50Z

This seems to be happening again on current git master, even though the change that fixed it when applied in isolation was merged as part of #524 :(

Affected job: https://openqa.stg.fedoraproject.org/tests/28028

Log messages from the job:

11:53:51.0978 30389 >>> testapi::_check_backend_response: found desktop_clean, similarity 1.00 @ 3/739
11:53:51.0988 30389 ||| finished _graphical_wait_login tests at 2016-07-09 11:53:51 (200 s)
11:53:51.0991 30389 Creating a VM snapshot lastgood
11:54:01.1243 30391 SAVED lastgood savevm lastgood
11:54:01.1252 30384 sysread failed: 
11:54:01.1272 30384 30384: got sigchld
11:54:01.1274 30384 THERE IS NOTHING TO READ 12 12 11
Can't kill('SIGTERM', '30389'): No such process at /usr/bin/isotovideo line 345
+++ worker notes +++
end time: 2016-07-09 11:54:01
result: died
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

Log messages from the worker host:

Jul 09 11:36:18 qa06.qa.fedoraproject.org worker[10441]: 30384: WORKING 28028
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: child 30384 died with exit status 768
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: killing 30384
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: waitpid returned error: No child processes
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: cleaning up 00028028-fedora-Rawhide-KDE-live-iso-i386-BuildFedora-Rawhide-20160709.n.0-install_default@64bit...
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: duplicating job 28028
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: QEMU (30396 -> /usr/bin/qemu-system-x86_64) should be dead - WASUP?
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Unit entered failed state.
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Failed with result 'exit-code'.

so I guess this looks slightly different in details...

AdamWill · 2016-07-09T15:45:39Z

The use of autodie in isotovideo changes the behaviour of kill(), apparently, and causes it to make the script die if the target PID doesn't exist. Compare the behaviour of these test scripts:

use autodie qw(:all);

kill('SIGTERM', 99999);
print "foo";

and:

kill('SIGTERM', 99999);
print "foo";

AdamWill · 2016-07-09T15:48:58Z

The use of autodie was intentionally added by 8a15617 . I think I'll make a new issue for this, as it's clearly something different from the first case: #535 . This one I think we can call closed.

AdamWill mentioned this issue Jul 9, 2016

isotovideo can die unexpectedly when trying to kill autotest process (due to use of autodie) #535

Closed

AdamWill closed this as completed Jul 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker hitting `QEMU should be dead - WASUP?` and dying on job cancel #530

Worker hitting `QEMU should be dead - WASUP?` and dying on job cancel #530

AdamWill commented Jul 5, 2016

okurz commented Jul 5, 2016

AdamWill commented Jul 5, 2016

AdamWill commented Jul 5, 2016

coolo commented Jul 5, 2016

AdamWill commented Jul 9, 2016 •

edited

AdamWill commented Jul 9, 2016

AdamWill commented Jul 9, 2016 •

edited

Worker hitting QEMU should be dead - WASUP? and dying on job cancel #530

Worker hitting QEMU should be dead - WASUP? and dying on job cancel #530

Comments

AdamWill commented Jul 5, 2016

okurz commented Jul 5, 2016

AdamWill commented Jul 5, 2016

AdamWill commented Jul 5, 2016

coolo commented Jul 5, 2016

AdamWill commented Jul 9, 2016 • edited

AdamWill commented Jul 9, 2016

AdamWill commented Jul 9, 2016 • edited

Worker hitting `QEMU should be dead - WASUP?` and dying on job cancel #530

Worker hitting `QEMU should be dead - WASUP?` and dying on job cancel #530

AdamWill commented Jul 9, 2016 •

edited

AdamWill commented Jul 9, 2016 •

edited