Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-try accepting directly chained jobs to avoid skipping whole chain #4541

Merged
merged 4 commits into from Mar 7, 2022

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Mar 2, 2022

The worker doesn't retry accepting a job so far. That makes actually sense
because at this point no work has been done anyways and the worker can just
wait for the scheduler to re-assign the job. However, for directly chained
jobs this is not true because the whole chain needed to be restarted in the
case of an error. So we should try a little bit harder - similarly to how
it is already done when the connection is lost during the job execution.

See https://progress.opensuse.org/issues/107746

@Martchus Martchus force-pushed the retry-accept branch 3 times, most recently from 45c7add to 2b4b041 Compare March 3, 2022 13:30
@Martchus Martchus marked this pull request as ready for review March 3, 2022 13:30
@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Merging #4541 (5a65367) into master (475752c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4541   +/-   ##
=======================================
  Coverage   97.97%   97.97%           
=======================================
  Files         374      375    +1     
  Lines       34268    34274    +6     
=======================================
+ Hits        33573    33581    +8     
+ Misses        695      693    -2     
Impacted Files Coverage Δ
lib/OpenQA/Worker.pm 95.64% <100.00%> (+0.08%) ⬆️
lib/OpenQA/Worker/Job.pm 100.00% <100.00%> (ø)
t/24-worker-engine.t 100.00% <100.00%> (+0.37%) ⬆️
t/24-worker-jobs.t 100.00% <100.00%> (ø)
t/24-worker-overall.t 100.00% <100.00%> (ø)
t/24-worker-webui-connection.t 100.00% <100.00%> (ø)
t/lib/OpenQA/Test/FakeWorker.pm 100.00% <100.00%> (ø)
lib/OpenQA/Worker/WebUIConnection.pm 94.55% <0.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 475752c...5a65367. Read the comment docs.

@Martchus
Copy link
Contributor Author

Martchus commented Mar 3, 2022

i've tested the case when the web socket connection is aborted when the next job is supposed to be accepted locally by tampering with the web socket server. The worker behaves as expected:

…
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2520/set_done?worker_id=20
[debug] [pid:24424] Job 2520 from http://localhost:9526 finished - reason: done
[debug] [pid:24424] Cleaning up for next job
[info] [pid:24424] Accepting job 2521 from queue
[debug] [pid:24424] Accepting job 2521 from http://localhost:9526.
[debug] [pid:24424] Setting job 2521 from http://localhost:9526 up
[debug] [pid:24424] Preparing Mojo::IOLoop::ReadWriteProcess::Session
[info] [pid:24424] +++ setup notes +++
[info] [pid:24424] Running on linux-9lzf:1 (Linux 5.16.8-1-default #1 SMP PREEMPT Thu Feb 10 11:31:59 UTC 2022 (5d1f5d2) x86_64)
[debug] [pid:24424] Job settings:
[debug] [pid:24424] 
    ARCH=x86_64
    BACKEND=qemu
    BUILD=20220227
    DESKTOP=minimalx
    DISTRI=opensuse
    EXTERNAL_VIDEO_ENCODER_CMD=ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 0
    EXTERNAL_VIDEO_ENCODER_OUTPUT_FILE_EXTENSION=webm
    FLAVOR=DVD
    HDDSIZEGB=20
    ISO=openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso
    ISO_MAXSIZE=4700372992
    JOBTOKEN=aICzWHG1_g3tbbeo
    LOG_LEVEL=debug
    MACHINE=64bit
    NAME=00002521-opensuse-Tumbleweed-DVD-x86_64-Build20220227-parallel-slave@64bit
    OPENQA_HOSTNAME=localhost:9526
    OPENQA_URL=http://localhost:9526
    PRJDIR=/hdd/openqa-devel/openqa/share
    QEMUCPU=qemu64
    QEMUPORT=20012
    RETRY_DELAY=5
    RETRY_DELAY_IF_WEBUI_BUSY=60
    SCHEDULE=tests/installation/isosize,tests/installation/bootloader_start
    START_DIRECTLY_AFTER_TEST=parallel-supportserver
    TEST=parallel-slave
    TEST_SUITE_NAME=parallel-slave
    VERSION=Tumbleweed
    VIRTIO_CONSOLE=1
    VNC=91
    WORKER_CLASS=qemu_x86_64,qemu_i686,qemu_i586,foo,pc_qam_azure
    WORKER_HOSTNAME=127.0.0.1
    WORKER_ID=20
    WORKER_INSTANCE=1
    YAML_SCHEDULE=schedule/yast/raid/raid1_sle_gpt.yaml
[debug] [pid:24424] Linked asset "/hdd/openqa-devel/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso" to "/hdd/openqa-devel/openqa/pool/1/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20220227-Media.iso"
[debug] [pid:24424] Symlinked from "/hdd/openqa-devel/openqa/share/tests/opensuse" to "/hdd/openqa-devel/openqa/pool/1/opensuse"
[info] [pid:24424] Preparing cgroup to start isotovideo
[warn] [pid:24424] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.34.0/Mojo/File.pm line 84.
[info] [pid:24424] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
[info] [pid:24424] Starting isotovideo container
[debug] [pid:24424] Registered process:26516
[info] [pid:26516] 26516: WORKING 2521
[debug] [pid:26516] +++ worker notes +++
[info] [pid:24424] isotovideo has been started (PID: 26516)
[debug] [pid:24424] Running job 2521 from http://localhost:9526: 00002521-opensuse-Tumbleweed-DVD-x86_64-Build20220227-parallel-slave@64bit.
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Websocket connection to http://localhost:9526/api/v1/ws/20 finished by remote side with code 1006, no reason - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (no current module)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[warn] [pid:24424] Unable to upgrade to ws connection via http://localhost:9526/api/v1/ws/20 - trying again in 10 seconds
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registering with openQA http://localhost:9526
[info] [pid:24424] Establishing ws connection via ws://localhost:9526/api/v1/ws/20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[info] [pid:24424] Registered and connected via websockets with openQA host http://localhost:9526 and worker ID 20
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[debug] [pid:24424] Stopping livelog
[debug] [pid:24424] Starting livelog
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/status
[debug] [pid:24424] Upload concluded (at bootloader_start)
…
[debug] [pid:26747] Uploading artefact isosize-1.txt
[debug] [pid:26747] Uploading artefact bootloader_start-1.txt
[debug] [pid:26747] Uploading artefact bootloader_start-11.txt
[debug] [pid:26747] Uploading artefact bootloader_start-12.txt
[debug] [pid:24424] Upload concluded (no current module)
[debug] [pid:24424] REST-API call: POST http://localhost:9526/api/v1/jobs/2521/set_done?worker_id=20
[debug] [pid:24424] Job 2521 from http://localhost:9526 finished - reason: done
[debug] [pid:24424] Cleaning up for next job

So the job isn't immediately skipped but actually accepted after the ws connection is re-established.

@okurz
Copy link
Member

okurz commented Mar 4, 2022

could you please check the decrease of test coverage in t/24-worker-engine.t and t/24-worker-jobs.t

The worker doesn't retry accepting a job so far. That makes actually sense
because at this point no work has been done anyways and the worker can just
wait for the scheduler to re-assign the job. However, for directly chained
jobs this is not true because the whole chain needed to be restarted in the
case of an error. So we should try a little bit harder - similarly to how
it is already done when the connection is lost during the job execution.

See https://progress.opensuse.org/issues/107746
@Martchus
Copy link
Contributor Author

Martchus commented Mar 7, 2022

Coverage reports look good now.

@mergify mergify bot merged commit 44ed4c6 into os-autoinst:master Mar 7, 2022
@Martchus Martchus deleted the retry-accept branch March 7, 2022 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants