New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure worker updates status throughout asset upload and checksum #844
Conversation
removing it before the file uploads results in the dead job checker deciding the job is dead while the upload is running.
Following on from a50e86a, this completes the revert of f2547e9. There is a reason to use the non-blocking post and then 'block' ourselves while we wait for it to complete: this allows Mojo timers to keep triggering while the post runs. Which means the worker will continue to send status updates to the server, and the server's 'dead worker' check won't decide the worker is dead.
just piping the entire file to a 'cksum' system call can block the worker process for over a minute if the file is large, again rendering us vulnerable to the 'dead worker' check, as the worker will not send status updates to the server while it's blocked. Instead, use Digest::MD5 (we're using it in some other places too) and read 100MiB at a time from the file into the digest context, updating the status every ~5 seconds. I tried to find a way to let the timer fire while this is going on, but I can't; this is simple enough and works. The actual digest calculation does not take long at all, it's reading gigabytes of data from disk that takes time.
hmm, this seems to be causing my workers to die with:
I think this is probably due to moving the |
seems this was a bit too late, it causes some kind of conflict with the `upload_status(1)` calls I think, and results in workers dying with `recursive api_call is fatal` errors. So move it back to just after the file uploads are done, but before all of those calls.
cf463bd
to
bb47255
Compare
Here's an example of the difference this change makes. Here's a job from before it (with basically the current git master code): https://openqa.stg.fedoraproject.org/tests/35222 . From its os-autoinst log we see:
if I check my server logs for that job, I see:
So the actual test finishes at 17:26:03. There's then a gap until 17:26:22 - there's no status update in that ~20 second gap. This is the time it takes os-autoinst to checksum the disk. (For a larger disk image this gap is much longer, it's 1-2 mins with a KDE install...this test is a smaller package set, so smaller disk image. Obviously it also depends on whether other tests are running at the same time and competing for CPU time and IO bandwidth). We then get a status update, but there's then a much longer gap from 17:26:23 to 17:27:53 during which there's no status update again. This is the gap in which the actual image upload and the worker process checksum occur. Compare to a run with these patches applied: https://openqa.stg.fedoraproject.org/tests/36133 . Here's the autoinst log:
and the server logs:
There's now no noticeable gap between the actual test completing and the disk upload starting, because the os-autoinst sha1sum is gone. And while the image is uploading and the openQA worker checksum is happening, status updates continue to happen throughout. |
To give an idea of the length of time the worker process can be blocked in a real case, https://openqa.stg.fedoraproject.org/tests/36217 is a test which installs from our KDE live image and uploads the disk image; that run was while other tests were running too (many of the tests I did while figuring this out ran alone and so probably ran faster than usual). That test was run with the os-autoinst checksum code removed, but even so, there's 6.5 minutes between the test's recorded 'end time':
and when the worker process actually posts
it took those 6.5 minutes for the image to be uploaded and the two checksum operations the worker does to happen (using my |
we discussed this in our call today and came too the conclusion that this approach is too fragile and bound to break again in future cases. So I'll add another job state 'postprocessing' which the worker sets just before it removes the update_status timer and in this case the webui will give the worker 2 hours instead of 10 seconds to finish before declaring it dead. |
(ok, as I have to type this more often now - let's go for 'uploading' :) |
(I can already see the future where we will do more and 'uploading' will be particularly misleading. Like evaluating results by custom neural net and so on) |
that's a fair point, but I guess we'll want to visualize the process and need an extra state for that too |
sure, that approach should work too (it's what I thought of doing first, but then kinda preferred the idea of keeping the process responsive...) |
-> #848 |
Combined with os-autoinst/os-autoinst#574 , this should address poo#13526. The aim here is to make sure the worker process continues to update status throughout the teardown process, while it's uploading files and doing checksums. os-autoinst/os-autoinst#574 removes the checksum operation in os-autoinst, which blocks the worker process for ~1-2 mins on large files.
In this PR, we go back to a non-blocking post for file upload, allowing timers to fire while the upload is in progress. We move the
remove_timer
call for theupdate_status
timer to after the file upload is done. Together, these changes make the worker keep updating status through file upload.Finally, we change the way we do the post-upload checksum to read in 100MB of data at a time, and update status every five seconds. I did try finding a way to let the timer keep firing while the checksum is running; you can kinda do it by hitting
Mojo::IOLoop->one_tick
before eachadd()
, but it's clearly not really 'right' - it makes the checksum significantly slower, and when I played with it in a test script, it seems like it could get completely stuck, depending on how fast the disk reads were and how long the timers take to fire. It seems simpler and more robust to simply accept that the timer is going to be blocked and have the checksum function itself update the status every five seconds while it's running.I've tested these changes on Fedora's staging instance, by watching the HTTP server logs for
POST . /api/v1/jobs/36100/status
events. Before these changes, in a typical 'run an install and upload the disk image' test, there's about a 4-5 minute period after the VM shuts down during which there is only 0 or 1 status update POST. With all commits from this patch and os-autoinst/os-autoinst#574 applied, this phase takes only 2-3 minutes, and there are status update POSTs throughout.There is an extremely new addition to Mojo which looks interesting and could potentially help us make all this cleaner: http://mojolicious.org/perldoc/Mojo/IOLoop#subprocess . This was literally added three days ago, in Mojo 7.04, and is labelled experimental; we should probably wait a bit for it to mature before using it.