New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker hitting QEMU should be dead - WASUP?
and dying on job cancel
#530
Comments
duplicate of https://progress.opensuse.org/issues/12566 |
It's worth noting, I guess, that |
@coolo says "4d9a31f is the fix" and, indeed, that pretty much adds back a Would it be worth me sending a PR that just makes that change for now, so we don't have this bug on master while #524 is going through review? |
I prefer to get #524 in - the whole teardown is too complicated ;( |
This seems to be happening again on current git master, even though the change that fixed it when applied in isolation was merged as part of #524 :( Affected job: https://openqa.stg.fedoraproject.org/tests/28028 Log messages from the job:
Log messages from the worker host:
so I guess this looks slightly different in details... |
The use of
and:
|
Ever since I updated Fedora openQA staging to recent openQA and os-autoinst git snapshots - fe19b00 for os-autoinst, a08377c for openQA - I've noticed that worker services sometimes seem to be dying. On closer inspection of the system logs, they seem to be dying sometimes when a job is cancelled (either
user_cancelled
orparallel_failed
), and hitting this point in openQA:https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Worker/Pool.pm#L57
so why am I filing on os-autoinst? Well, I'm not sure, but I'm guessing this may be due to this PR:
#523
Here's a sample case. https://openqa.stg.fedoraproject.org/tests/26770 is the job. Here are the log messages from the worker host:
now here's an interesting bit from the job's autoinst log:
note the worker code seems to be bailing because 29137 was still running, but os-autoinst killed 29132. I'm not entirely sure what that
got sigchld
is about either, but it comes from that same PR (#523). This may not be an issue - I don't entirely follow what's going on there, there's forking and all sorts of other wackiness that I haven't parsed out yet - but it seems at least odd.It seems that, on one of our two worker hosts, I have the following cases:
https://openqa.stg.fedoraproject.org/tests/26283
https://openqa.stg.fedoraproject.org/tests/26312
https://openqa.stg.fedoraproject.org/tests/26313
https://openqa.stg.fedoraproject.org/tests/26322
https://openqa.stg.fedoraproject.org/tests/26329
https://openqa.stg.fedoraproject.org/tests/26330
all of those ones may have been, I think, while the database was still a bit messed up from os-autoinst/openQA#762 . But the following are from after that:
https://openqa.stg.fedoraproject.org/tests/26568 (job ran beyond MAX_JOB_TIME, which I didn't even know was a thing)
https://openqa.stg.fedoraproject.org/tests/26614 (user_restarted)
https://openqa.stg.fedoraproject.org/tests/26657 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/26770 (user_cancelled)
https://openqa.stg.fedoraproject.org/tests/26967 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27062 (parallel_failed)
https://openqa.stg.fedoraproject.org/tests/27306 (user_restarted)
It seems like every time the message
setting job [job ID] to incomplete (cancel)
is present, this bug happens, but that's only a sample size of 4, so it could be a coincidence, I guess.There are some other cases on the other worker host, I can link those too if it's helpful. The patterns seem more or less the same.
The text was updated successfully, but these errors were encountered: