Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Martchus · 2022-03-02T16:54:21Z

The worker doesn't retry accepting a job so far. That makes actually sense
because at this point no work has been done anyways and the worker can just
wait for the scheduler to re-assign the job. However, for directly chained
jobs this is not true because the whole chain needed to be restarted in the
case of an error. So we should try a little bit harder - similarly to how
it is already done when the connection is lost during the job execution.

See https://progress.opensuse.org/issues/107746

codecov · 2022-03-03T13:48:08Z

Codecov Report

Merging #4541 (5a65367) into master (475752c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4541   +/-   ##
=======================================
  Coverage   97.97%   97.97%           
=======================================
  Files         374      375    +1     
  Lines       34268    34274    +6     
=======================================
+ Hits        33573    33581    +8     
+ Misses        695      693    -2

Impacted Files	Coverage Δ
lib/OpenQA/Worker.pm	`95.64% <100.00%> (+0.08%)`	⬆️
lib/OpenQA/Worker/Job.pm	`100.00% <100.00%> (ø)`
t/24-worker-engine.t	`100.00% <100.00%> (+0.37%)`	⬆️
t/24-worker-jobs.t	`100.00% <100.00%> (ø)`
t/24-worker-overall.t	`100.00% <100.00%> (ø)`
t/24-worker-webui-connection.t	`100.00% <100.00%> (ø)`
t/lib/OpenQA/Test/FakeWorker.pm	`100.00% <100.00%> (ø)`
lib/OpenQA/Worker/WebUIConnection.pm	`94.55% <0.00%> (+0.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 475752c...5a65367. Read the comment docs.

lib/OpenQA/Worker.pm

Martchus · 2022-03-03T16:58:25Z

i've tested the case when the web socket connection is aborted when the next job is supposed to be accepted locally by tampering with the web socket server. The worker behaves as expected:

…
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2520/set_done?worker_id=20
[debug] [pid:24424] Job 2520 from http://localhost:9526 finished - reason: done
[debug] [pid:24424] Cleaning up for next job
[info] [pid:24424] Accepting job 2521 from queue
[debug] [pid:24424] Accepting job 2521 from http://localhost:9526.
[debug] [pid:24424] Setting job 2521 from http://localhost:9526 up
[debug] [pid:24424] Preparing Mojo::IOLoop::ReadWriteProcess::Session
[info] [pid:24424] +++ setup notes +++
[info] [pid:24424] Running on linux-9lzf:1 (Linux 5.16.8-1-default #1 SMP PREEMPT Thu Feb 10 11:31:59 UTC 2022 (5d1f5d2) x86_64)
[debug] [pid:24424] Job settings:
[debug] [pid:24424] 
    ARCH=x86_64
    BACKEND=qemu
    BUILD=20220227
    DESKTOP=minimalx
    DISTRI=opensuse
    EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
    EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
    FLAVOR=DVD
    HDDSIZEGB=20
    ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso
    ISO_MAXSIZE=4700372992
    JOBTOKEN=aICzWHG1_g3tbbeo
    LOG_LEVEL=debug
    MACHINE=64bit
    NAME=00002521-opensuse-Tumbleweed-DVD-x86_64-Build20220227-parallel-slave@64bit
    OPENQA_HOSTNAME=localhost:9526
    OPENQA_URL=http://localhost:9526
    PRJDIR=/hdd/openqa-devel/openqa/share
    QEMUCPU=qemu64
    QEMUPORT=20012
    RETRY_DELAY=5
    RETRY_DELAY_IF_WEBUI_BUSY=60
    SCHEDULE=tests/installation/isosize,tests/installation/bootloader_start
    START_DIRECTLY_AFTER_TEST=parallel-supportserver
    TEST=parallel-slave
    TEST_SUITE_NAME=parallel-slave
    VERSION=Tumbleweed
    VIRTIO_CONSOLE=1
    VNC=91
    WORKER_CLASS=qemu_x86_64,qemu_i686,qemu_i586,foo,pc_qam_azure
    WORKER_HOSTNAME=127.0.0.1
    WORKER_ID=20
    WORKER_INSTANCE=1
    YAML_SCHEDULE=schedule/yast/raid/raid1_sle_gpt.yaml
[debug] [pid:24424] Linked asset "/hdd/openqa-devel/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso" to "/hdd/openqa-devel/openqa/pool/1/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso"
[debug] [pid:24424] Symlinked from "/hdd/openqa-devel/openqa/share/tests/opensuse" to "/hdd/openqa-devel/openqa/pool/1/opensuse"
[info] [pid:24424] Preparing cgroup to start isotovideo
[warn] [pid:24424] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.34.0/Mojo/File.pm line 84.
[info] [pid:24424] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
[info] [pid:24424] Starting isotovideo container
[debug] [pid:24424] Registered process:26516
[info] [pid:26516] 26516: WORKING 2521
[debug] [pid:26516] +++ worker notes +++
[info] [pid:24424] isotovideo has been started (PID: 26516)
[debug] [pid:24424] Running job 2521 from http://localhost:9526: 00002521-opensuse-Tumbleweed-DVD-x86_64-Build20220227-parallel-slave@64bit.
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Websocket connection to http://localhost:9526/api/v1/ws/20 finished by remote side with code 1006, no reason - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (no current module)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[debug] [pid:24424] Stopping livelog
[debug] [pid:24424] Starting livelog
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
…
[debug] [pid:26747] Uploading artefact isosize-1.txt
[debug] [pid:26747] Uploading artefact bootloader_start-1.txt
[debug] [pid:26747] Uploading artefact bootloader_start-11.txt
[debug] [pid:26747] Uploading artefact bootloader_start-12.txt
[debug] [pid:24424] Upload concluded (no current module)
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/set_done?worker_id=20
[debug] [pid:24424] Job 2521 from http://localhost:9526 finished - reason: done
[debug] [pid:24424] Cleaning up for next job

So the job isn't immediately skipped but actually accepted after the ws connection is re-established.

okurz · 2022-03-04T20:12:57Z

could you please check the decrease of test coverage in t/24-worker-engine.t and t/24-worker-jobs.t

The worker doesn't retry accepting a job so far. That makes actually sense because at this point no work has been done anyways and the worker can just wait for the scheduler to re-assign the job. However, for directly chained jobs this is not true because the whole chain needed to be restarted in the case of an error. So we should try a little bit harder - similarly to how it is already done when the connection is lost during the job execution. See https://progress.opensuse.org/issues/107746

Martchus · 2022-03-07T15:09:20Z

Coverage reports look good now.

Martchus force-pushed the retry-accept branch 3 times, most recently from 45c7add to 2b4b041 Compare March 3, 2022 13:30

Martchus marked this pull request as ready for review March 3, 2022 13:30

Martchus force-pushed the retry-accept branch from 2b4b041 to e1b35b7 Compare March 3, 2022 14:26

kalikiana reviewed Mar 3, 2022

View reviewed changes

lib/OpenQA/Worker.pm Show resolved Hide resolved

kalikiana approved these changes Mar 7, 2022

View reviewed changes

Martchus added 4 commits March 7, 2022 14:25

Refactor worker tests to de-duplicate definition of fake worker

4977138

Workaround problems with coverage tracking

8a40ce6

Add test for destructor of OpenQA::Worker::WebUIConnection

5a65367

Martchus force-pushed the retry-accept branch from e1b35b7 to 5a65367 Compare March 7, 2022 14:00

okurz approved these changes Mar 7, 2022

View reviewed changes

mergify bot merged commit 44ed4c6 into os-autoinst:master Mar 7, 2022

Martchus deleted the retry-accept branch March 7, 2022 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Martchus commented Mar 2, 2022 •

edited

codecov bot commented Mar 3, 2022 •

edited

Martchus commented Mar 3, 2022

okurz commented Mar 4, 2022

Martchus commented Mar 7, 2022

Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Conversation

Martchus commented Mar 2, 2022 • edited

codecov bot commented Mar 3, 2022 • edited

Codecov Report

Martchus commented Mar 3, 2022

okurz commented Mar 4, 2022

Martchus commented Mar 7, 2022

Martchus commented Mar 2, 2022 •

edited

codecov bot commented Mar 3, 2022 •

edited