Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker hitting QEMU should be dead - WASUP? and dying on job cancel #530

Closed
AdamWill opened this issue Jul 5, 2016 · 7 comments
Closed

Comments

@AdamWill
Copy link
Contributor

AdamWill commented Jul 5, 2016

Ever since I updated Fedora openQA staging to recent openQA and os-autoinst git snapshots - fe19b00 for os-autoinst, a08377c for openQA - I've noticed that worker services sometimes seem to be dying. On closer inspection of the system logs, they seem to be dying sometimes when a job is cancelled (either user_cancelled or parallel_failed), and hitting this point in openQA:

https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/Pool.pm#L57

so why am I filing on os-autoinst? Well, I'm not sure, but I'm guessing this may be due to this PR:

#523

Here's a sample case. https://openqa.stg.fedoraproject.org/tests/26770 is the job. Here are the log messages from the worker host:

Jul 01 00:01:18 qa06.qa.fedoraproject.org worker[21937]: killing 29128
Jul 01 00:01:18 qa06.qa.fedoraproject.org worker[21937]: can't open /var/lib/openqa/pool/2/testresults/result-upgrade_run.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 529.
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: cleaning up 00026770-fedora-Rawhide-universal-x86_64-BuildFedora-Rawhide-20160630.n.0-upgrade_minimal_64bit@64bit...
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: setting job 26770 to incomplete (cancel)
Jul 01 00:01:24 qa06.qa.fedoraproject.org worker[21937]: QEMU (29137 -> /usr/bin/qemu-system-x86_64) should be dead - WASUP?
Jul 01 00:01:27 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@2.service: Main process exited, code=exited, status=1/FAILURE

now here's an interesting bit from the job's autoinst log:

00:01:18.0239 29128: killing backend process 29132
00:01:18.0338 29128: got sigchld
00:01:18.0339 29128: done with backend process

note the worker code seems to be bailing because 29137 was still running, but os-autoinst killed 29132. I'm not entirely sure what that got sigchld is about either, but it comes from that same PR (#523). This may not be an issue - I don't entirely follow what's going on there, there's forking and all sorts of other wackiness that I haven't parsed out yet - but it seems at least odd.

It seems that, on one of our two worker hosts, I have the following cases:

https://openqa.stg.fedoraproject.org/tests/26283
https://openqa.stg.fedoraproject.org/tests/26312
https://openqa.stg.fedoraproject.org/tests/26313
https://openqa.stg.fedoraproject.org/tests/26322
https://openqa.stg.fedoraproject.org/tests/26329
https://openqa.stg.fedoraproject.org/tests/26330

all of those ones may have been, I think, while the database was still a bit messed up from os-autoinst/openQA#762 . But the following are from after that:

https://openqa.stg.fedoraproject.org/tests/26568 (job ran beyond MAX_JOB_TIME, which I didn't even know was a thing)
https://openqa.stg.fedoraproject.org/tests/26614 (user_restarted)
https://openqa.stg.fedoraproject.org/tests/26657 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/26770 (user_cancelled)
https://openqa.stg.fedoraproject.org/tests/26967 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27062 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27306 (user_restarted)

It seems like every time the message setting job [job ID] to incomplete (cancel) is present, this bug happens, but that's only a sample size of 4, so it could be a coincidence, I guess.

There are some other cases on the other worker host, I can link those too if it's helpful. The patterns seem more or less the same.

@okurz
Copy link
Member

okurz commented Jul 5, 2016

duplicate of https://progress.opensuse.org/issues/12566

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 5, 2016

It's worth noting, I guess, that kill_qemu() - which is what the pre-523 code's stop_vm() would have called - seems to have lots of waiting and taking care of child PIDs and stuff, while the new code from #523 doesn't seem to take quite as much care.

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 5, 2016

@coolo says "4d9a31f is the fix" and, indeed, that pretty much adds back a bmwqemu::stop_vm() call that #523 got rid of. The other two changes are not relevant without the rest of #524 , but I tested a patch which simply re-adds the bmwqemu::stop_vm() line and it does indeed seem to fix this.

Would it be worth me sending a PR that just makes that change for now, so we don't have this bug on master while #524 is going through review?

@coolo
Copy link
Contributor

coolo commented Jul 5, 2016

I prefer to get #524 in - the whole teardown is too complicated ;(

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 9, 2016

This seems to be happening again on current git master, even though the change that fixed it when applied in isolation was merged as part of #524 :(

Affected job: https://openqa.stg.fedoraproject.org/tests/28028

Log messages from the job:

11:53:51.0978 30389 >>> testapi::_check_backend_response: found desktop_clean, similarity 1.00 @ 3/739
11:53:51.0988 30389 ||| finished _graphical_wait_login tests at 2016-07-09 11:53:51 (200 s)
11:53:51.0991 30389 Creating a VM snapshot lastgood
11:54:01.1243 30391 SAVED lastgood savevm lastgood
11:54:01.1252 30384 sysread failed: 
11:54:01.1272 30384 30384: got sigchld
11:54:01.1274 30384 THERE IS NOTHING TO READ 12 12 11
Can't kill('SIGTERM', '30389'): No such process at /usr/bin/isotovideo line 345
+++ worker notes +++
end time: 2016-07-09 11:54:01
result: died
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

Log messages from the worker host:

Jul 09 11:36:18 qa06.qa.fedoraproject.org worker[10441]: 30384: WORKING 28028
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: child 30384 died with exit status 768
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: killing 30384
Jul 09 11:54:01 qa06.qa.fedoraproject.org worker[10441]: waitpid returned error: No child processes
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: cleaning up 00028028-fedora-Rawhide-KDE-live-iso-i386-BuildFedora-Rawhide-20160709.n.0-install_default@64bit...
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: duplicating job 28028
Jul 09 11:59:06 qa06.qa.fedoraproject.org worker[10441]: QEMU (30396 -> /usr/bin/qemu-system-x86_64) should be dead - WASUP?
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Unit entered failed state.
Jul 09 11:59:07 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@4.service: Failed with result 'exit-code'.

so I guess this looks slightly different in details...

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 9, 2016

The use of autodie in isotovideo changes the behaviour of kill(), apparently, and causes it to make the script die if the target PID doesn't exist. Compare the behaviour of these test scripts:

use autodie qw(:all);

kill('SIGTERM', 99999);
print "foo";

and:

kill('SIGTERM', 99999);
print "foo";

@AdamWill
Copy link
Contributor Author

AdamWill commented Jul 9, 2016

The use of autodie was intentionally added by 8a15617 . I think I'll make a new issue for this, as it's clearly something different from the first case: #535 . This one I think we can call closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants