Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make as much of the worker code non-blocking as possible #2270

Merged
merged 21 commits into from
Sep 3, 2019

Conversation

kraih
Copy link
Member

@kraih kraih commented Aug 19, 2019

This is a more complete alternative to #2268. It makes all uploads non-blocking and changes almost all continuation-passing style APIs to use the event loop to defer callback execution. So interruptions of worker status updates should be very minimal if everything works as expected.

Downside is that 24-worker-job.t had to be disabled, since it depends on the worker not using the event loop correctly in very big parts. I can try to rewrite it once i understand what exactly the tests were meant to cover. But that could take some time.

@kraih kraih added the acceptance-tests-needed Needed for code that is required to be tested on a production-like environment label Aug 19, 2019
@kraih
Copy link
Member Author

kraih commented Aug 19, 2019

In the meantime i will also deploy this to the staging machines.

@kraih
Copy link
Member Author

kraih commented Aug 19, 2019

Discussed this PR with @Martchus today, and decided to try to rewrite 24-worker-job.t, even if it takes some time. Investing into a clearer OpenQA::Worker::Job life cycle now should be worth it for future improvements. Also discovered a few new potential risks that need to be assessed, like accidental parallel uploads.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also discovered a few new potential risks that need to be assessed, like accidental parallel uploads.

We have

    # ensure an ongoing timer is cancelled in case upload_results has been called manually
    if (my $upload_results_timer = delete $self->{_upload_results_timer}) {
        Mojo::IOLoop->remove($upload_results_timer);
    }

right at the start of _upload_results. So parallel uploads should not be triggered by the timer. Otherwise _upload_results is only called when starting and stopping a job which should be safe as well. So I guess the current behavior already prevents us from having multiple parallel uploads. We just need to keep it that way.

Looking at the code it seems that inside the subprocesses no variables are mutated. That is instead done afterwards in the callback in the main process. That's how it is supposed to work but I'm wondering why you then need to use this "shared hash" in the worker job test.

But in general - and considering that it is still WIP and needs more testing - it looks good.

lib/OpenQA/Worker/Job.pm Outdated Show resolved Hide resolved
lib/OpenQA/Worker/Job.pm Show resolved Hide resolved
@kraih
Copy link
Member Author

kraih commented Aug 19, 2019

@Martchus The shared_hash was a hack to keep 24-worker-job.t working when asset and log file uploading was changed to use a subprocess. It was needed because the test mocks _upload_log_file and _upload_asset to mutate an array used in the test to record the call history. It will hopefully go away when i rewrite the test.

@Martchus
Copy link
Contributor

That makes sense. But you still need some way to trace what's happening inside the subprocesses.

@kraih
Copy link
Member Author

kraih commented Aug 19, 2019

That is true... maybe it will stay after all. 😆

@codecov
Copy link

codecov bot commented Aug 20, 2019

Codecov Report

Merging #2270 into master will decrease coverage by 2.53%.
The diff coverage is 35.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2270      +/-   ##
==========================================
- Coverage   86.09%   83.56%   -2.54%     
==========================================
  Files         166      164       -2     
  Lines       10833    10791      -42     
==========================================
- Hits         9327     9017     -310     
- Misses       1506     1774     +268
Impacted Files Coverage Δ
lib/OpenQA/Worker/Job.pm 68.14% <35.95%> (-20.66%) ⬇️
lib/OpenQA/WebAPI/Controller/File.pm 37.5% <0%> (-54.17%) ⬇️
lib/OpenQA/WebAPI/Controller/Session.pm 35.82% <0%> (-23.89%) ⬇️
lib/OpenQA/WebAPI/Plugin/REST.pm 75% <0%> (-16.67%) ⬇️
lib/OpenQA/WebAPI/Controller/Step.pm 53.3% <0%> (-9.56%) ⬇️
lib/OpenQA/Schema/Result/Users.pm 87.5% <0%> (-6.25%) ⬇️
lib/OpenQA/WebAPI/Auth/Fake.pm 95.23% <0%> (-4.77%) ⬇️
lib/OpenQA/Utils.pm 88.35% <0%> (-4.2%) ⬇️
lib/OpenQA/Setup.pm 87.02% <0%> (-3.82%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95c83e7...b8aa935. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 20, 2019

Codecov Report

Merging #2270 into master will decrease coverage by 1.01%.
The diff coverage is 56.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2270      +/-   ##
==========================================
- Coverage   86.78%   85.76%   -1.02%     
==========================================
  Files         167      168       +1     
  Lines       10848    10870      +22     
==========================================
- Hits         9414     9323      -91     
- Misses       1434     1547     +113
Impacted Files Coverage Δ
lib/OpenQA/Worker/Engines/isotovideo.pm 94.58% <ø> (-0.08%) ⬇️
lib/OpenQA/Worker/Isotovideo/Client.pm 100% <100%> (ø)
lib/OpenQA/Worker/WebUIConnection.pm 81.33% <20%> (-3.09%) ⬇️
lib/OpenQA/Worker/Job.pm 70.54% <47.95%> (-18.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae171c0...dd099b5. Read the comment docs.

@kraih kraih force-pushed the async_thumbnails branch 5 times, most recently from 47817da to 1ff0cd7 Compare August 28, 2019 15:20
@kraih kraih force-pushed the async_thumbnails branch 3 times, most recently from ca1eca9 to 14207d7 Compare August 30, 2019 14:43
@kraih kraih changed the title [WIP] Make as much of the worker code non-blocking as possible Make as much of the worker code non-blocking as possible Sep 2, 2019
@kraih kraih requested review from Martchus and okurz September 2, 2019 09:08
@kraih
Copy link
Member Author

kraih commented Sep 2, 2019

This PR should be mostly complete now and ready for review.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying this myself I have some problems reported by the worker:

[warn] [pid:4460] Unable to query isotovideo status via "http://localhost:20113/735qvYDWTocdQPln/isotovideo/status": Connection refused
[debug] [pid:4460] REST-API call: POST http://localhost/api/v1/jobs/2642/status
[error] [pid:4460] 400 response: Got status update for job 2642 with unexpected worker ID 3 (expected no updates anymore, job is scheduled) (remaining tries: 2)
[error] [pid:4680] Websocket connection to http://localhost/api/v1/ws/3 finished by remote side with code 1006, no reason - trying again in 10 seconds
[debug] [pid:4460] Collected unknown process with pid 4680 and exit status: 0

However the job still continues, e.g. see
http://lord.arch.suse.de/tests/2642

any ideas?

@@ -187,7 +187,7 @@ sub engine_workit {
session->on(
collected_orphan => sub {
my ($session, $p) = @_;
log_info("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status,
log_debug("Collected unknown process with pid " . $p->pid . " and exit status: " . $p->exit_status,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, when the log messages are "annoying" as you state, why make them more verbose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the solution recommended by @Martchus. To me the whole event doesn't make much sense, but it seems there is nobody left on the team who knows why this was turned into a log message in the first place... so not just outright removing it seems like the safest choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okurz Why would that change make it more verbose? It in fact will silence the message unless debug messages are turned on. But we can also get completely rid of it and only log an error in case of non-zero exist status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if that event doesn't make sense at all (e.g. because a failure to run a subprocess would be logged elsewhere anyways) we should really just remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a debug message will not help to keep autoinst-log.txt clean which is already an unreadable mess. Now it looks like this:

[2019-09-02T12:49:26.0182 CEST] [info] [pid:18157] Collected unknown process with pid 27786 and exit status: 0
[2019-09-02T12:49:26.643 CEST] [debug] no match: 268.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:27.0271 CEST] [info] [pid:18157] Collected unknown process with pid 27787 and exit status: 0
[2019-09-02T12:49:27.645 CEST] [debug] no match: 267.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:28.0371 CEST] [info] [pid:18157] Collected unknown process with pid 27788 and exit status: 0
[2019-09-02T12:49:28.620 CEST] [debug] no change: 266.9s
[2019-09-02T12:49:29.0454 CEST] [info] [pid:18157] Collected unknown process with pid 27789 and exit status: 0
[2019-09-02T12:49:29.852 CEST] [debug] no match: 265.9s, best candidate: pkcon-finished-20180704 (0.00)
[2019-09-02T12:49:30.0568 CEST] [info] [pid:18157] Collected unknown process with pid 27790 and exit status: 0
[2019-09-02T12:49:30.620 CEST] [debug] no change: 264.9s
[2019-09-02T12:49:31.576 CEST] [debug] no change: 264.0s
[2019-09-02T12:49:31.0632 CEST] [info] [pid:18157] Collected unknown process with pid 27791 and exit status: 0
[2019-09-02T12:49:32.577 CEST] [debug] no change: 263.0s
[2019-09-02T12:49:32.0687 CEST] [info] [pid:18157] Collected unknown process with pid 27792 and exit status: 0
[2019-09-02T12:49:33.577 CEST] [debug] no change: 262.0s
[2019-09-02T12:49:33.0791 CEST] [info] [pid:18157] Collected unknown process with pid 27794 and exit status: 0
[2019-09-02T12:49:34.578 CEST] [debug] no change: 261.0s
[2019-09-02T12:49:34.0856 CEST] [info] [pid:18157] Collected unknown process with pid 27797 and exit status: 0
[2019-09-02T12:49:35.842 CEST] [debug] no match: 260.0s, best candidate: pkcon-finished-20180704 (0.00)

So I'd just remove that log line completely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright, that's 3 votes for remove then. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, remove it. Sorry, I got confused: Of course it's not making it "more verbose" but the opposite.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because you have replaced the test I can not easily check myself so I would like to ask you to confirm:

  • Does the live view still work?
  • Does the developer mode still work?
  • When you cancel a job midway (e.g. also after using developer mode) are all results gathered until then uploaded including the current test module during which the job was aborted (this last point might not have worked correctly even in before)?
  • Are qemu snapshots still preserved when starting a worker with --no-cleanup?
  • Did you also try uploading qcow files (big files), rather than only the small results from the openposix ltp test case?

Overall looks nice and the results on http://openqa-staging-1.qa.suse.de/tests are already very promising :)

lib/OpenQA/Worker/Isotovideo/Client.pm Outdated Show resolved Hide resolved
@okurz
Copy link
Member

okurz commented Sep 2, 2019

please also check the coverage target, you might want to adjust the target threshold if you are confident with the changes.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the issues we've already been discussing:

  1. Web socket server error message
  2. autoinst-log.txt being spammed (I'd just get rid of the log line completely)
  3. New worker job tests spams test log with isotovideo log messages (see http://open.qa/docs/#_running_tests_of_openqa_itself#_logging_behavior)

I think 3. is not very important because the messages are very few and there are no confusing errors among them.

@Martchus
Copy link
Contributor

Martchus commented Sep 2, 2019

Tested your latest commit. It resolves 1. :-)

@okurz
Copy link
Member

okurz commented Sep 2, 2019

if you reduce the log level to fix 3. and fix the conflict we are one step further :)

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can approve the PR in the current state as the tests are promising but I would appreciate if you can check the open points #2270 (review) as follow-up.

@kraih
Copy link
Member Author

kraih commented Sep 2, 2019

The test coverage problem here is actually similar to the one from #2289. Because the subprocesses exit with POSIX::_exit there are no stats written for the forked processes and it looks like the whole block has no tests. The actual coverage of the new test should be similar or even a bit more than the old one though.

A solution to change the codecov value might have to wait for a followup PR though, since i can't think of a quick hack at the moment.

@okurz
Copy link
Member

okurz commented Sep 2, 2019

A solution to change the codecov value might have to wait for a followup PR though, since i can't think of a quick hack at the moment.

you might have misunderstood me. I wanted to actually suggest the "quick hack": In codecov.yml simply change the line target: 85.9 to e.g. target: 85.7 to get a green check mark for at least the "project" coverage :) As soon as we have the coverage increased again, e.g. with the POSIX::_exit change we can also bump the threshold in the file again

@kraih kraih merged commit 439fff1 into os-autoinst:master Sep 3, 2019
coolo pushed a commit that referenced this pull request Sep 3, 2019
commit 439fff1
Merge: ae171c0 dd099b5
Author:     Sebastian Riedel <sri@cpan.org>
AuthorDate: Tue Sep 3 11:19:26 2019 +0200
Commit:     GitHub <noreply@github.com>
CommitDate: Tue Sep 3 11:19:26 2019 +0200

    Merge pull request #2270 from kraih/async_thumbnails

    Make as much of the worker code non-blocking as possible
kalikiana added a commit to kalikiana/openQA that referenced this pull request Feb 27, 2020
This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've cauht this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Feb 27, 2020
This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've cauht this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Feb 27, 2020
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- With debugging enabled, unknown parameters result in an error.
- $validation->param is used.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Feb 27, 2020
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- With debugging enabled, unknown parameters result in an error.
- $validation->param is used.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've cauht this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Feb 27, 2020
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- With debugging enabled, unknown parameters result in an error.
- $validation->param is used.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 3, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 3, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 3, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 3, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 9, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 9, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 9, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too, and distinguishes
between invalid and missing parameters.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
kalikiana added a commit to kalikiana/openQA that referenced this pull request Mar 10, 2020
- All params should be validated before use.
- Schema filename validation should be done in the controller, for that
reason the tests are also updated.
- $validation->param is used, except params which are part of the path
get validated through typed routes.
- The `validation_error` helper now speaks JSON, too, and distinguishes
between invalid and missing parameters.

This came up in the context of os-autoinst#2270 where the re-use of the URL object
caused unrelated parameters to be merged until it corrupted the
response. Disallowing unknown parameters would've caught this.

See https://progress.opensuse.org/issues/64075
@kraih kraih deleted the async_thumbnails branch May 12, 2020 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acceptance-tests-needed Needed for code that is required to be tested on a production-like environment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants