Make as much of the worker code non-blocking as possible #2270

kraih · 2019-08-19T09:31:06Z

This is a more complete alternative to #2268. It makes all uploads non-blocking and changes almost all continuation-passing style APIs to use the event loop to defer callback execution. So interruptions of worker status updates should be very minimal if everything works as expected.

Downside is that 24-worker-job.t had to be disabled, since it depends on the worker not using the event loop correctly in very big parts. I can try to rewrite it once i understand what exactly the tests were meant to cover. But that could take some time.

kraih · 2019-08-19T09:31:25Z

In the meantime i will also deploy this to the staging machines.

kraih · 2019-08-19T13:20:34Z

Discussed this PR with @Martchus today, and decided to try to rewrite 24-worker-job.t, even if it takes some time. Investing into a clearer OpenQA::Worker::Job life cycle now should be worth it for future improvements. Also discovered a few new potential risks that need to be assessed, like accidental parallel uploads.

Martchus

Also discovered a few new potential risks that need to be assessed, like accidental parallel uploads.

We have

    # ensure an ongoing timer is cancelled in case upload_results has been called manually
    if (my $upload_results_timer = delete $self->{_upload_results_timer}) {
        Mojo::IOLoop->remove($upload_results_timer);
    }

right at the start of _upload_results. So parallel uploads should not be triggered by the timer. Otherwise _upload_results is only called when starting and stopping a job which should be safe as well. So I guess the current behavior already prevents us from having multiple parallel uploads. We just need to keep it that way.

Looking at the code it seems that inside the subprocesses no variables are mutated. That is instead done afterwards in the callback in the main process. That's how it is supposed to work but I'm wondering why you then need to use this "shared hash" in the worker job test.

But in general - and considering that it is still WIP and needs more testing - it looks good.

lib/OpenQA/Worker/Job.pm

kraih · 2019-08-19T16:08:14Z

@Martchus The shared_hash was a hack to keep 24-worker-job.t working when asset and log file uploading was changed to use a subprocess. It was needed because the test mocks _upload_log_file and _upload_asset to mutate an array used in the test to record the call history. It will hopefully go away when i rewrite the test.

Martchus · 2019-08-19T16:10:56Z

That makes sense. But you still need some way to trace what's happening inside the subprocesses.

kraih · 2019-08-19T16:19:08Z

That is true... maybe it will stay after all. 😆

codecov · 2019-08-20T13:49:22Z

Codecov Report

Merging #2270 into master will decrease coverage by 2.53%.
The diff coverage is 35.95%.

@@            Coverage Diff             @@
##           master    #2270      +/-   ##
==========================================
- Coverage   86.09%   83.56%   -2.54%     
==========================================
  Files         166      164       -2     
  Lines       10833    10791      -42     
==========================================
- Hits         9327     9017     -310     
- Misses       1506     1774     +268

Impacted Files	Coverage Δ
lib/OpenQA/Worker/Job.pm	`68.14% <35.95%> (-20.66%)`	⬇️
lib/OpenQA/WebAPI/Controller/File.pm	`37.5% <0%> (-54.17%)`	⬇️
lib/OpenQA/WebAPI/Controller/Session.pm	`35.82% <0%> (-23.89%)`	⬇️
lib/OpenQA/WebAPI/Plugin/REST.pm	`75% <0%> (-16.67%)`	⬇️
lib/OpenQA/WebAPI/Controller/Step.pm	`53.3% <0%> (-9.56%)`	⬇️
lib/OpenQA/Schema/Result/Users.pm	`87.5% <0%> (-6.25%)`	⬇️
lib/OpenQA/WebAPI/Auth/Fake.pm	`95.23% <0%> (-4.77%)`	⬇️
lib/OpenQA/Utils.pm	`88.35% <0%> (-4.2%)`	⬇️
lib/OpenQA/Setup.pm	`87.02% <0%> (-3.82%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95c83e7...b8aa935. Read the comment docs.

codecov · 2019-08-20T13:49:22Z

Codecov Report

Merging #2270 into master will decrease coverage by 1.01%.
The diff coverage is 56.93%.

@@            Coverage Diff             @@
##           master    #2270      +/-   ##
==========================================
- Coverage   86.78%   85.76%   -1.02%     
==========================================
  Files         167      168       +1     
  Lines       10848    10870      +22     
==========================================
- Hits         9414     9323      -91     
- Misses       1434     1547     +113

Impacted Files	Coverage Δ
lib/OpenQA/Worker/Engines/isotovideo.pm	`94.58% <ø> (-0.08%)`	⬇️
lib/OpenQA/Worker/Isotovideo/Client.pm	`100% <100%> (ø)`
lib/OpenQA/Worker/WebUIConnection.pm	`81.33% <20%> (-3.09%)`	⬇️
lib/OpenQA/Worker/Job.pm	`70.54% <47.95%> (-18.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae171c0...dd099b5. Read the comment docs.

kraih · 2019-09-02T09:12:48Z

This PR should be mostly complete now and ready for review.

okurz

Trying this myself I have some problems reported by the worker:

[warn] [pid:4460] Unable to query isotovideo status via "http://localhost:20113/735qvYDWTocdQPln/isotovideo/status": Connection refused
[debug] [pid:4460] REST-API call: POST http://localhost/api/v1/jobs/2642/status
[error] [pid:4460] 400 response: Got status update for job 2642 with unexpected worker ID 3 (expected no updates anymore, job is scheduled) (remaining tries: 2)
[error] [pid:4680] Websocket connection to http://localhost/api/v1/ws/3 finished by remote side with code 1006, no reason - trying again in 10 seconds
[debug] [pid:4460] Collected unknown process with pid 4680 and exit status: 0

However the job still continues, e.g. see
http://lord.arch.suse.de/tests/2642

any ideas?

okurz · 2019-09-02T10:47:48Z

lib/OpenQA/Worker/Engines/isotovideo.pm

@@ -187,7 +187,7 @@ sub engine_workit {
    session->on(
        collected_orphan => sub {
            my ($session, $p) = @_;
-            log_info("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status,
+            log_debug("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status,


I don't understand, when the log messages are "annoying" as you state, why make them more verbose?

This was the solution recommended by @Martchus. To me the whole event doesn't make much sense, but it seems there is nobody left on the team who knows why this was turned into a log message in the first place... so not just outright removing it seems like the safest choice.

@okurz Why would that change make it more verbose? It in fact will silence the message unless debug messages are turned on. But we can also get completely rid of it and only log an error in case of non-zero exist status.

And if that event doesn't make sense at all (e.g. because a failure to run a subprocess would be logged elsewhere anyways) we should really just remove it.

Making it a debug message will not help to keep autoinst-log.txt clean which is already an unreadable mess. Now it looks like this:

[2019-09-02T12:49:26.0182 CEST] [info] [pid:18157] Collected unknown process with pid 27786 and exit status: 0 [2019-09-02T12:49:26.643 CEST] [debug] no match: 268.9s, best candidate: pkcon-finished-20180704 (0.00) [2019-09-02T12:49:27.0271 CEST] [info] [pid:18157] Collected unknown process with pid 27787 and exit status: 0 [2019-09-02T12:49:27.645 CEST] [debug] no match: 267.9s, best candidate: pkcon-finished-20180704 (0.00) [2019-09-02T12:49:28.0371 CEST] [info] [pid:18157] Collected unknown process with pid 27788 and exit status: 0 [2019-09-02T12:49:28.620 CEST] [debug] no change: 266.9s [2019-09-02T12:49:29.0454 CEST] [info] [pid:18157] Collected unknown process with pid 27789 and exit status: 0 [2019-09-02T12:49:29.852 CEST] [debug] no match: 265.9s, best candidate: pkcon-finished-20180704 (0.00) [2019-09-02T12:49:30.0568 CEST] [info] [pid:18157] Collected unknown process with pid 27790 and exit status: 0 [2019-09-02T12:49:30.620 CEST] [debug] no change: 264.9s [2019-09-02T12:49:31.576 CEST] [debug] no change: 264.0s [2019-09-02T12:49:31.0632 CEST] [info] [pid:18157] Collected unknown process with pid 27791 and exit status: 0 [2019-09-02T12:49:32.577 CEST] [debug] no change: 263.0s [2019-09-02T12:49:32.0687 CEST] [info] [pid:18157] Collected unknown process with pid 27792 and exit status: 0 [2019-09-02T12:49:33.577 CEST] [debug] no change: 262.0s [2019-09-02T12:49:33.0791 CEST] [info] [pid:18157] Collected unknown process with pid 27794 and exit status: 0 [2019-09-02T12:49:34.578 CEST] [debug] no change: 261.0s [2019-09-02T12:49:34.0856 CEST] [info] [pid:18157] Collected unknown process with pid 27797 and exit status: 0 [2019-09-02T12:49:35.842 CEST] [debug] no match: 260.0s, best candidate: pkcon-finished-20180704 (0.00)

So I'd just remove that log line completely.

Allright, that's 3 votes for remove then. 😉

sure, remove it. Sorry, I got confused: Of course it's not making it "more verbose" but the opposite.

okurz

because you have replaced the test I can not easily check myself so I would like to ask you to confirm:

Does the live view still work?
Does the developer mode still work?
When you cancel a job midway (e.g. also after using developer mode) are all results gathered until then uploaded including the current test module during which the job was aborted (this last point might not have worked correctly even in before)?
Are qemu snapshots still preserved when starting a worker with --no-cleanup?
Did you also try uploading qcow files (big files), rather than only the small results from the openposix ltp test case?

Overall looks nice and the results on http://openqa-staging-1.qa.suse.de/tests are already very promising :)

lib/OpenQA/Worker/Isotovideo/Client.pm

okurz · 2019-09-02T11:06:01Z

please also check the coverage target, you might want to adjust the target threshold if you are confident with the changes.

Martchus

To summarize the issues we've already been discussing:

Web socket server error message
autoinst-log.txt being spammed (I'd just get rid of the log line completely)
New worker job tests spams test log with isotovideo log messages (see http://open.qa/docs/#_running_tests_of_openqa_itself#_logging_behavior)

I think 3. is not very important because the messages are very few and there are no confusing errors among them.

Martchus · 2019-09-02T12:18:37Z

Tested your latest commit. It resolves 1. :-)

okurz · 2019-09-02T12:51:20Z

if you reduce the log level to fix 3. and fix the conflict we are one step further :)

… to log them as unexpected anymore

…e can be useful for debugging

okurz

I can approve the PR in the current state as the tests are promising but I would appreciate if you can check the open points #2270 (review) as follow-up.

kraih · 2019-09-02T16:45:20Z

The test coverage problem here is actually similar to the one from #2289. Because the subprocesses exit with POSIX::_exit there are no stats written for the forked processes and it looks like the whole block has no tests. The actual coverage of the new test should be similar or even a bit more than the old one though.

A solution to change the codecov value might have to wait for a followup PR though, since i can't think of a quick hack at the moment.

okurz · 2019-09-02T16:53:29Z

A solution to change the codecov value might have to wait for a followup PR though, since i can't think of a quick hack at the moment.

you might have misunderstood me. I wanted to actually suggest the "quick hack": In codecov.yml simply change the line target: 85.9 to e.g. target: 85.7 to get a green check mark for at least the "project" coverage :) As soon as we have the coverage increased again, e.g. with the POSIX::_exit change we can also bump the threshold in the file again

…ude subprocesses in stats

commit 439fff1 Merge: ae171c0 dd099b5 Author: Sebastian Riedel <sri@cpan.org> AuthorDate: Tue Sep 3 11:19:26 2019 +0200 Commit: GitHub <noreply@github.com> CommitDate: Tue Sep 3 11:19:26 2019 +0200 Merge pull request #2270 from kraih/async_thumbnails Make as much of the worker code non-blocking as possible

This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've cauht this.

- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.

- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've cauht this.

- Schema filename validation should be done in the controller, for that reason the tests are also updated. - With debugging enabled, unknown parameters result in an error. - $validation->param is used. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.

- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this.

- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075

- All params should be validated before use. - Schema filename validation should be done in the controller, for that reason the tests are also updated. - $validation->param is used, except params which are part of the path get validated through typed routes. - The `validation_error` helper now speaks JSON, too, and distinguishes between invalid and missing parameters. This came up in the context of os-autoinst#2270 where the re-use of the URL object caused unrelated parameters to be merged until it corrupted the response. Disallowing unknown parameters would've caught this. See https://progress.opensuse.org/issues/64075

kraih added the acceptance-tests-needed Needed for code that is required to be tested on a production-like environment label Aug 19, 2019

kraih mentioned this pull request Aug 19, 2019

Make asset and log file uploads non-blocking #2268

Closed

Martchus requested changes Aug 19, 2019

View reviewed changes

lib/OpenQA/Worker/Job.pm Outdated Show resolved Hide resolved

lib/OpenQA/Worker/Job.pm Show resolved Hide resolved

kraih force-pushed the async_thumbnails branch 5 times, most recently from 47817da to 1ff0cd7 Compare August 28, 2019 15:20

kraih force-pushed the async_thumbnails branch 3 times, most recently from ca1eca9 to 14207d7 Compare August 30, 2019 14:43

kraih changed the title ~~[WIP] Make as much of the worker code non-blocking as possible~~ Make as much of the worker code non-blocking as possible Sep 2, 2019

kraih requested review from Martchus and okurz September 2, 2019 09:08

okurz reviewed Sep 2, 2019

View reviewed changes

lib/OpenQA/Worker/Isotovideo/Client.pm Outdated Show resolved Hide resolved

Martchus requested changes Sep 2, 2019

View reviewed changes

kraih added 2 commits September 2, 2019 14:53

Split up upload handling into more manageable parts

bb7b801

Handle asset and logfile uploads in a subprocess

ab9ff0e

kraih added 5 commits September 2, 2019 14:56

Test asset uploads too

f159cc2

Test API and upload errors

17b2480

Make annoying log message debug level instead of info for now

970be1f

Do not try to reconnect the WebSocket connection in subprocesses

f5fe73b

Since we now expect there to be more subprocesses, there is no reason…

9fc8435

… to log them as unexpected anymore

kraih force-pushed the async_thumbnails branch from 0ead988 to 9fc8435 Compare September 2, 2019 12:56

Failed isotovideo status checks are fairly harmless, but a log messag…

6677612

…e can be useful for debugging

okurz approved these changes Sep 2, 2019

View reviewed changes

Martchus approved these changes Sep 2, 2019

View reviewed changes

Reduce coverage requirement a bit until we can make Devel::Cover incl…

dd099b5

…ude subprocesses in stats

kraih merged commit 439fff1 into os-autoinst:master Sep 3, 2019

kraih mentioned this pull request Sep 3, 2019

Fix artefact upload errors to show the message #2301

Merged

kalikiana mentioned this pull request Feb 27, 2020

Consistent validation in JobTemplate update route #2781

Merged

kraih deleted the async_thumbnails branch May 12, 2020 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make as much of the worker code non-blocking as possible #2270

Make as much of the worker code non-blocking as possible #2270

kraih commented Aug 19, 2019

kraih commented Aug 19, 2019

kraih commented Aug 19, 2019

Martchus left a comment

kraih commented Aug 19, 2019

Martchus commented Aug 19, 2019

kraih commented Aug 19, 2019

codecov bot commented Aug 20, 2019

codecov bot commented Aug 20, 2019 •

edited

Loading

kraih commented Sep 2, 2019

okurz left a comment •

edited

Loading

okurz Sep 2, 2019

kraih Sep 2, 2019

Martchus Sep 2, 2019

Martchus Sep 2, 2019

Martchus Sep 2, 2019

kraih Sep 2, 2019

okurz Sep 2, 2019

okurz left a comment

okurz commented Sep 2, 2019

Martchus left a comment

Martchus commented Sep 2, 2019

okurz commented Sep 2, 2019

okurz left a comment

kraih commented Sep 2, 2019 •

edited

Loading

okurz commented Sep 2, 2019

Make as much of the worker code non-blocking as possible #2270

Make as much of the worker code non-blocking as possible #2270

Conversation

kraih commented Aug 19, 2019

kraih commented Aug 19, 2019

kraih commented Aug 19, 2019

Martchus left a comment

Choose a reason for hiding this comment

kraih commented Aug 19, 2019

Martchus commented Aug 19, 2019

kraih commented Aug 19, 2019

codecov bot commented Aug 20, 2019

Codecov Report

codecov bot commented Aug 20, 2019 • edited Loading

Codecov Report

kraih commented Sep 2, 2019

okurz left a comment • edited Loading

Choose a reason for hiding this comment

okurz Sep 2, 2019

Choose a reason for hiding this comment

kraih Sep 2, 2019

Choose a reason for hiding this comment

Martchus Sep 2, 2019

Choose a reason for hiding this comment

Martchus Sep 2, 2019

Choose a reason for hiding this comment

Martchus Sep 2, 2019

Choose a reason for hiding this comment

kraih Sep 2, 2019

Choose a reason for hiding this comment

okurz Sep 2, 2019

Choose a reason for hiding this comment

okurz left a comment

Choose a reason for hiding this comment

okurz commented Sep 2, 2019

Martchus left a comment

Choose a reason for hiding this comment

Martchus commented Sep 2, 2019

okurz commented Sep 2, 2019

okurz left a comment

Choose a reason for hiding this comment

kraih commented Sep 2, 2019 • edited Loading

okurz commented Sep 2, 2019

codecov bot commented Aug 20, 2019 •

edited

Loading

okurz left a comment •

edited

Loading

kraih commented Sep 2, 2019 •

edited

Loading