-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make as much of the worker code non-blocking as possible #2270
Conversation
In the meantime i will also deploy this to the staging machines. |
Discussed this PR with @Martchus today, and decided to try to rewrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also discovered a few new potential risks that need to be assessed, like accidental parallel uploads.
We have
# ensure an ongoing timer is cancelled in case upload_results has been called manually
if (my $upload_results_timer = delete $self->{_upload_results_timer}) {
Mojo::IOLoop->remove($upload_results_timer);
}
right at the start of _upload_results
. So parallel uploads should not be triggered by the timer. Otherwise _upload_results
is only called when starting and stopping a job which should be safe as well. So I guess the current behavior already prevents us from having multiple parallel uploads. We just need to keep it that way.
Looking at the code it seems that inside the subprocesses no variables are mutated. That is instead done afterwards in the callback in the main process. That's how it is supposed to work but I'm wondering why you then need to use this "shared hash" in the worker job test.
But in general - and considering that it is still WIP and needs more testing - it looks good.
@Martchus The |
That makes sense. But you still need some way to trace what's happening inside the subprocesses. |
That is true... maybe it will stay after all. 😆 |
Codecov Report
@@ Coverage Diff @@
## master #2270 +/- ##
==========================================
- Coverage 86.09% 83.56% -2.54%
==========================================
Files 166 164 -2
Lines 10833 10791 -42
==========================================
- Hits 9327 9017 -310
- Misses 1506 1774 +268
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #2270 +/- ##
==========================================
- Coverage 86.78% 85.76% -1.02%
==========================================
Files 167 168 +1
Lines 10848 10870 +22
==========================================
- Hits 9414 9323 -91
- Misses 1434 1547 +113
Continue to review full report at Codecov.
|
47817da
to
1ff0cd7
Compare
ca1eca9
to
14207d7
Compare
This PR should be mostly complete now and ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying this myself I have some problems reported by the worker:
[warn] [pid:4460] Unable to query isotovideo status via "http://localhost:20113/735qvYDWTocdQPln/isotovideo/status": Connection refused
[debug] [pid:4460] REST-API call: POST http://localhost/api/v1/jobs/2642/status
[error] [pid:4460] 400 response: Got status update for job 2642 with unexpected worker ID 3 (expected no updates anymore, job is scheduled) (remaining tries: 2)
[error] [pid:4680] Websocket connection to http://localhost/api/v1/ws/3 finished by remote side with code 1006, no reason - trying again in 10 seconds
[debug] [pid:4460] Collected unknown process with pid 4680 and exit status: 0
However the job still continues, e.g. see
http://lord.arch.suse.de/tests/2642
any ideas?
@@ -187,7 +187,7 @@ sub engine_workit { | |||
session->on( | |||
collected_orphan => sub { | |||
my ($session, $p) = @_; | |||
log_info("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status, | |||
log_debug("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, when the log messages are "annoying" as you state, why make them more verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the solution recommended by @Martchus. To me the whole event doesn't make much sense, but it seems there is nobody left on the team who knows why this was turned into a log message in the first place... so not just outright removing it seems like the safest choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okurz Why would that change make it more verbose? It in fact will silence the message unless debug messages are turned on. But we can also get completely rid of it and only log an error in case of non-zero exist status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if that event doesn't make sense at all (e.g. because a failure to run a subprocess would be logged elsewhere anyways) we should really just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it a debug message will not help to keep autoinst-log.txt clean which is already an unreadable mess. Now it looks like this:
[2019-09-02T12:49:26.0182 CEST] [info] [pid:18157] Collected unknown process with pid 27786 and exit status: 0
[2019-09-02T12:49:26.643 CEST] [debug] no match: 268.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:27.0271 CEST] [info] [pid:18157] Collected unknown process with pid 27787 and exit status: 0
[2019-09-02T12:49:27.645 CEST] [debug] no match: 267.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:28.0371 CEST] [info] [pid:18157] Collected unknown process with pid 27788 and exit status: 0
[2019-09-02T12:49:28.620 CEST] [debug] no change: 266.9s
[2019-09-02T12:49:29.0454 CEST] [info] [pid:18157] Collected unknown process with pid 27789 and exit status: 0
[2019-09-02T12:49:29.852 CEST] [debug] no match: 265.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:30.0568 CEST] [info] [pid:18157] Collected unknown process with pid 27790 and exit status: 0
[2019-09-02T12:49:30.620 CEST] [debug] no change: 264.9s
[2019-09-02T12:49:31.576 CEST] [debug] no change: 264.0s
[2019-09-02T12:49:31.0632 CEST] [info] [pid:18157] Collected unknown process with pid 27791 and exit status: 0
[2019-09-02T12:49:32.577 CEST] [debug] no change: 263.0s
[2019-09-02T12:49:32.0687 CEST] [info] [pid:18157] Collected unknown process with pid 27792 and exit status: 0
[2019-09-02T12:49:33.577 CEST] [debug] no change: 262.0s
[2019-09-02T12:49:33.0791 CEST] [info] [pid:18157] Collected unknown process with pid 27794 and exit status: 0
[2019-09-02T12:49:34.578 CEST] [debug] no change: 261.0s
[2019-09-02T12:49:34.0856 CEST] [info] [pid:18157] Collected unknown process with pid 27797 and exit status: 0
[2019-09-02T12:49:35.842 CEST] [debug] no match: 260.0s, best candidate: pkcon-finished-20180704 (0.00)
So I'd just remove that log line completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allright, that's 3 votes for remove then. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, remove it. Sorry, I got confused: Of course it's not making it "more verbose" but the opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because you have replaced the test I can not easily check myself so I would like to ask you to confirm:
- Does the live view still work?
- Does the developer mode still work?
- When you cancel a job midway (e.g. also after using developer mode) are all results gathered until then uploaded including the current test module during which the job was aborted (this last point might not have worked correctly even in before)?
- Are qemu snapshots still preserved when starting a worker with
--no-cleanup
? - Did you also try uploading qcow files (big files), rather than only the small results from the openposix ltp test case?
Overall looks nice and the results on http://openqa-staging-1.qa.suse.de/tests are already very promising :)
please also check the coverage target, you might want to adjust the target threshold if you are confident with the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize the issues we've already been discussing:
- Web socket server error message
- autoinst-log.txt being spammed (I'd just get rid of the log line completely)
- New worker job tests spams test log with isotovideo log messages (see http://open.qa/docs/#_running_tests_of_openqa_itself#_logging_behavior)
I think 3. is not very important because the messages are very few and there are no confusing errors among them.
Tested your latest commit. It resolves 1. :-) |
if you reduce the log level to fix 3. and fix the conflict we are one step further :) |
… to log them as unexpected anymore
0ead988
to
9fc8435
Compare
…e can be useful for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can approve the PR in the current state as the tests are promising but I would appreciate if you can check the open points #2270 (review) as follow-up.
The test coverage problem here is actually similar to the one from #2289. Because the subprocesses exit with A solution to change the codecov value might have to wait for a followup PR though, since i can't think of a quick hack at the moment. |
you might have misunderstood me. I wanted to actually suggest the "quick hack": In |
…ude subprocesses in stats
commit 439fff1 Merge: ae171c0 dd099b5 Author: Sebastian Riedel <sri@cpan.org> AuthorDate: Tue Sep 3 11:19:26 2019 +0200 Commit: GitHub <noreply@github.com> CommitDate: Tue Sep 3 11:19:26 2019 +0200 Merge pull request #2270 from kraih/async_thumbnails Make as much of the worker code non-blocking as possible
This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've cauht this.
This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've cauht this.
- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.
- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've cauht this.
- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too, and distinguishes between invalid and missing parameters. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too, and distinguishes between invalid and missing parameters. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075
This is a more complete alternative to #2268. It makes all uploads non-blocking and changes almost all continuation-passing style APIs to use the event loop to defer callback execution. So interruptions of worker status updates should be very minimal if everything works as expected.
Downside is that
24-worker-job.t
had to be disabled, since it depends on the worker not using the event loop correctly in very big parts. I can try to rewrite it once i understand what exactly the tests were meant to cover. But that could take some time.