New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-try accepting directly chained jobs to avoid skipping whole chain #4541
Conversation
45c7add
to
2b4b041
Compare
Codecov Report
@@ Coverage Diff @@
## master #4541 +/- ##
=======================================
Coverage 97.97% 97.97%
=======================================
Files 374 375 +1
Lines 34268 34274 +6
=======================================
+ Hits 33573 33581 +8
+ Misses 695 693 -2
Continue to review full report at Codecov.
|
i've tested the case when the web socket connection is aborted when the next job is supposed to be accepted locally by tampering with the web socket server. The worker behaves as expected:
So the job isn't immediately skipped but actually accepted after the ws connection is re-established. |
could you please check the decrease of test coverage in t/24-worker-engine.t and t/24-worker-jobs.t |
The worker doesn't retry accepting a job so far. That makes actually sense because at this point no work has been done anyways and the worker can just wait for the scheduler to re-assign the job. However, for directly chained jobs this is not true because the whole chain needed to be restarted in the case of an error. So we should try a little bit harder - similarly to how it is already done when the connection is lost during the job execution. See https://progress.opensuse.org/issues/107746
Coverage reports look good now. |
The worker doesn't retry accepting a job so far. That makes actually sense
because at this point no work has been done anyways and the worker can just
wait for the scheduler to re-assign the job. However, for directly chained
jobs this is not true because the whole chain needed to be restarted in the
case of an error. So we should try a little bit harder - similarly to how
it is already done when the connection is lost during the job execution.
See https://progress.opensuse.org/issues/107746