Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure worker updates status throughout asset upload and checksum #844

Closed
wants to merge 4 commits into from

Conversation

AdamWill
Copy link
Contributor

@AdamWill AdamWill commented Sep 1, 2016

Combined with os-autoinst/os-autoinst#574 , this should address poo#13526. The aim here is to make sure the worker process continues to update status throughout the teardown process, while it's uploading files and doing checksums. os-autoinst/os-autoinst#574 removes the checksum operation in os-autoinst, which blocks the worker process for ~1-2 mins on large files.

In this PR, we go back to a non-blocking post for file upload, allowing timers to fire while the upload is in progress. We move the remove_timer call for the update_status timer to after the file upload is done. Together, these changes make the worker keep updating status through file upload.

Finally, we change the way we do the post-upload checksum to read in 100MB of data at a time, and update status every five seconds. I did try finding a way to let the timer keep firing while the checksum is running; you can kinda do it by hitting Mojo::IOLoop->one_tick before each add(), but it's clearly not really 'right' - it makes the checksum significantly slower, and when I played with it in a test script, it seems like it could get completely stuck, depending on how fast the disk reads were and how long the timers take to fire. It seems simpler and more robust to simply accept that the timer is going to be blocked and have the checksum function itself update the status every five seconds while it's running.

I've tested these changes on Fedora's staging instance, by watching the HTTP server logs for POST . /api/v1/jobs/36100/status events. Before these changes, in a typical 'run an install and upload the disk image' test, there's about a 4-5 minute period after the VM shuts down during which there is only 0 or 1 status update POST. With all commits from this patch and os-autoinst/os-autoinst#574 applied, this phase takes only 2-3 minutes, and there are status update POSTs throughout.

There is an extremely new addition to Mojo which looks interesting and could potentially help us make all this cleaner: http://mojolicious.org/perldoc/Mojo/IOLoop#subprocess . This was literally added three days ago, in Mojo 7.04, and is labelled experimental; we should probably wait a bit for it to mature before using it.

removing it before the file uploads results in the dead job
checker deciding the job is dead while the upload is running.
Following on from a50e86a, this completes the revert of f2547e9.
There is a reason to use the non-blocking post and then 'block'
ourselves while we wait for it to complete: this allows Mojo
timers to keep triggering while the post runs. Which means the
worker will continue to send status updates to the server, and
the server's 'dead worker' check won't decide the worker is dead.
just piping the entire file to a 'cksum' system call can block
the worker process for over a minute if the file is large,
again rendering us vulnerable to the 'dead worker' check, as
the worker will not send status updates to the server while
it's blocked. Instead, use Digest::MD5 (we're using it in some
other places too) and read 100MiB at a time from the file into
the digest context, updating the status every ~5 seconds. I
tried to find a way to let the timer fire while this is going
on, but I can't; this is simple enough and works.

The actual digest calculation does not take long at all, it's
reading gigabytes of data from disk that takes time.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 70.749% when pulling cf463bd on AdamWill:upload-blocking-post into 8609e09 on os-autoinst:master.

@AdamWill
Copy link
Contributor Author

AdamWill commented Sep 1, 2016

hmm, this seems to be causing my workers to die with:

Sep 01 05:22:50 qa07.qa.fedoraproject.org worker[19568]: Mojo::Reactor::Poll: Timer failed: recursive api_call is fatal at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 529.

I think this is probably due to moving the remove_timer call too late, I'll try moving it back a bit earlier.

seems this was a bit too late, it causes some kind of conflict
with the `upload_status(1)` calls I think, and results in
workers dying with `recursive api_call is fatal` errors. So
move it back to just after the file uploads are done, but
before all of those calls.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 70.749% when pulling bb47255 on AdamWill:upload-blocking-post into 8609e09 on os-autoinst:master.

@AdamWill
Copy link
Contributor Author

AdamWill commented Sep 1, 2016

Here's an example of the difference this change makes. Here's a job from before it (with basically the current git master code): https://openqa.stg.fedoraproject.org/tests/35222 . From its os-autoinst log we see:

17:26:04.5764 1532 QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1532
17:26:04.5768 1532 sending magic and exit
17:26:04.5769 1527 received magic close
17:26:04.5980 1527 got sigchld
17:26:04.5988 1527 preparing hdd 1 for upload as disk_Server-dvd-iso_64bit.qcow2 in qcow2
17:26:22.6963 1527 got sigchld
17:26:22.6964 1527 sha1sum 9992f0f2d76e8a11b72a24045f477dbe8cb416fb *assets_private/disk_Server-dvd-iso_64bit.qcow2
EXIT 0
17:26:22.6967 1529 sysread failed: Connection reset by peer
17:26:22.6968 1527 awaiting death of commands process
17:26:22.7206 1527 commands process exited: 1529
+++ worker notes +++
end time: 2016-08-30 17:26:24
result: done
uploading disk_Server-dvd-iso_64bit.qcow2
Checksum 1474053147:1474053147 Sizes:3563061248:3563061248
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

if I check my server logs for that job, I see:

10.5.124.157 - - [30/Aug/2016:17:26:03 +0000] "POST /api/v1/jobs/35222/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:26:22 +0000] "POST /api/v1/jobs/35222/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:26:23 +0000] "POST /api/v1/jobs/35222/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:26:24 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 104 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/ack_temporary HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/status HTTP/1.1" 200 86 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:53 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.157 - - [30/Aug/2016:17:27:54 +0000] "POST /api/v1/jobs/35222/set_done HTTP/1.1" 200 15 "-" "Mojolicious (Perl)"

So the actual test finishes at 17:26:03. There's then a gap until 17:26:22 - there's no status update in that ~20 second gap. This is the time it takes os-autoinst to checksum the disk. (For a larger disk image this gap is much longer, it's 1-2 mins with a KDE install...this test is a smaller package set, so smaller disk image. Obviously it also depends on whether other tests are running at the same time and competing for CPU time and IO bandwidth). We then get a status update, but there's then a much longer gap from 17:26:23 to 17:27:53 during which there's no status update again. This is the gap in which the actual image upload and the worker process checksum occur.

Compare to a run with these patches applied: https://openqa.stg.fedoraproject.org/tests/36133 . Here's the autoinst log:

04:44:54.5353 28380 test process exited: 28384
04:44:54.5354 28380 isotovideo done
04:44:54.5365 28380 BACKEND SHUTDOWN 1
waitpid for 28403 returned 0
sending TERM to qemu pid: 28403
waitpid for 28403 returned 28403
04:44:55.6191 28391 QEMU: qemu-system-x86_64: terminating on signal 15 from pid 28391
04:44:55.6193 28391 sending magic and exit
04:44:55.6195 28380 received magic close
04:44:55.6419 28380 got sigchld
04:44:55.6425 28380 preparing hdd 1 for upload as disk_Server-dvd-iso_64bit.qcow2 in qcow2
EXIT 0
04:44:55.6439 28380 awaiting death of commands process
04:44:55.6472 28380 commands process exited: 28383
+++ worker notes +++
end time: 2016-09-01 04:44:55
result: done
uploading disk_Server-dvd-iso_64bit.qcow2
Checksum edb9b49a442be2825ad51f221f3c92b8:edb9b49a442be2825ad51f221f3c92b8 Sizes:3587309568:3587309568
uploading video.ogv
uploading vars.json
uploading serial0.txt
uploading autoinst-log.txt

and the server logs:

10.5.124.156 - - [01/Sep/2016:04:44:40 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 90 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:41 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:50 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:00 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:10 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:20 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:30 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:44:55 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 104 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:31 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:37 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:44 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:50 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:45:56 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:02 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:07 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:14 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:21 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:27 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:33 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:39 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:45 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 56 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/ack_temporary HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/status HTTP/1.1" 200 86 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:47 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/artefact HTTP/1.1" 200 2 "-" "Mojolicious (Perl)"
10.5.124.156 - - [01/Sep/2016:04:46:48 +0000] "POST /api/v1/jobs/36133/set_done HTTP/1.1" 200 15 "-" "Mojolicious (Perl)"

There's now no noticeable gap between the actual test completing and the disk upload starting, because the os-autoinst sha1sum is gone. And while the image is uploading and the openQA worker checksum is happening, status updates continue to happen throughout.

@AdamWill AdamWill mentioned this pull request Sep 1, 2016
@AdamWill
Copy link
Contributor Author

AdamWill commented Sep 1, 2016

To give an idea of the length of time the worker process can be blocked in a real case, https://openqa.stg.fedoraproject.org/tests/36217 is a test which installs from our KDE live image and uploads the disk image; that run was while other tests were running too (many of the tests I did while figuring this out ran alone and so probably ran faster than usual). That test was run with the os-autoinst checksum code removed, but even so, there's 6.5 minutes between the test's recorded 'end time':

end time: 2016-09-01 06:04:49

and when the worker process actually posts set_done:

10.5.124.157 - - [01/Sep/2016:06:11:26 +0000] "POST /api/v1/jobs/36217/set_done HTTP/1.1" 200 15 "-" "Mojolicious (Perl)"

it took those 6.5 minutes for the image to be uploaded and the two checksum operations the worker does to happen (using my Digest::MD5 code rather than the current git code, but it's not significantly slower).

@coolo
Copy link
Contributor

coolo commented Sep 1, 2016

we discussed this in our call today and came too the conclusion that this approach is too fragile and bound to break again in future cases.

So I'll add another job state 'postprocessing' which the worker sets just before it removes the update_status timer and in this case the webui will give the worker 2 hours instead of 10 seconds to finish before declaring it dead.

@coolo
Copy link
Contributor

coolo commented Sep 1, 2016

(ok, as I have to type this more often now - let's go for 'uploading' :)

@aaannz
Copy link
Contributor

aaannz commented Sep 1, 2016

(I can already see the future where we will do more and 'uploading' will be particularly misleading. Like evaluating results by custom neural net and so on)

@coolo
Copy link
Contributor

coolo commented Sep 1, 2016

that's a fair point, but I guess we'll want to visualize the process and need an extra state for that too

@AdamWill
Copy link
Contributor Author

AdamWill commented Sep 1, 2016

sure, that approach should work too (it's what I thought of doing first, but then kinda preferred the idea of keeping the process responsive...)

@coolo
Copy link
Contributor

coolo commented Sep 3, 2016

-> #848

@coolo coolo closed this Sep 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants