Allow pinning parallel clusters on one host #5536

Martchus · 2024-03-19T17:53:00Z

See contained documentation change
See https://progress.opensuse.org/issues/135035

A draft because I still need to extend unit tests. Possibly the big function should also be split. It got a little bit involved but I think the algorithm works without being a full-blown SAT solver. You might want to read the test code first (before the implementation) to see what the difficulty here actually is.

github-actions · 2024-03-19T17:53:19Z

Great PR! Please pay attention to the following items before merging:

Files matching docs/*.asciidoc:

Consider generating documentation locally to verify it is rendered correctly using tools/generate-docs

This is an automatically generated QA checklist based on modified files.

okurz

oh geez, I have no idea what your very elaborate algorithm is doing :D

only trivial questions or corrections. The majority of the change I did not really want to dive into … yet

docs/WritingTests.asciidoc

lib/OpenQA/Scheduler/Model/Jobs.pm

t/04-scheduler.t

Martchus · 2024-03-19T22:44:42Z

oh geez, I have no idea what your very elaborate algorithm is doing :D

After thinking about it further I must admit that the algorithm is probably also not able to sort out all sorts of inputs. I guess for that we'd really needed a full SAT solver. However, I hope it is sufficient for our purposes. Actually I suppose we don't even have cases as complicated as the one in the unit test in production. The worst case is that jobs won't be scheduled although they could be scheduled if one mixes up worker classes within a scenario too wildely and then we can still improve the algorithm (with TDD by first adding unit tests according to that production scenaio).

lib/OpenQA/Scheduler/Model/Jobs.pm

codecov · 2024-03-22T14:54:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.39%. Comparing base (c4f7601) to head (9a0115e).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #5536    +/-   ##
========================================
  Coverage   98.39%   98.39%            
========================================
  Files         391      392     +1     
  Lines       38044    38208   +164     
========================================
+ Hits        37432    37596   +164     
  Misses        612      612

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Martchus · 2024-03-25T13:13:54Z

I added/extended unit tests where needed and refactored the code into smaller functions and moved it into its own file.

I also gave it a real test on my local machine with two worker slots and faking the worker host (by manually editing the worker code before starting the particular worker instances). I tested with and without PARALLEL_ONE_HOST_ONLY=1 and with and without mixed hosts. It seemed to work as expected in all cases.

docs/WritingTests.asciidoc

lib/OpenQA/Scheduler/Model/Jobs.pm

kraih · 2024-03-27T10:40:05Z

It's a very complex feature, but i like how well encapsulated it is. That will help a lot with debugging in the future.

* See contained documentation change * See https://progress.opensuse.org/issues/135035

* Move code to its own file * Split code into smaller functions * See https://progress.opensuse.org/issues/135035

See https://progress.opensuse.org/issues/135035

* Ensure the `PARALLEL_ONE_HOST_ONLY`-setting is considered for all jobs in a parallel cluster when determining scheduled jobs and add according test * Extend tests for picking parallel siblings of running jobs * See https://progress.opensuse.org/issues/135035

okurz

After thinking about this and potential follow-up work I came up with one open point that I would like to clarify before merging: The openQA scheduler already has access to worker classes and possibly further worker properties from workers. So would it be possible to define a special worker class like "tap-parallel_one_host_only" resembling the new test setting and read that from the worker to have the same effect as the test variable which is to not schedule any across-host clusters on that host?

Martchus · 2024-03-27T14:00:53Z

We discussed more features that involve worker classes and dependencies and how they are intertwined. We need to make sure that this PR is in-line with those ideas for future features:

Allow specifying negative worker class matching. So a job with e.g. WORKER_CLASS=foo,-bar would work on slots that provide "foo" unless they also provide "bar".
Allow grouping worker classes using tags, e.g. WORKER_CLASS=host:worker1,region:prg1 would specify the worker classes host:worker1 and region:prg1 as usual. However, openQA would also recognize host and region as tags that come into play when implementing further features.
1. Allow pinning parallel clusters based on tags. This would be the same as the feature proposed by this PR. However, not the host but a list of tags would be specified. So we would add a variable PARALLEL_ONE_TAG_ONLY that could be assigned like PARALLEL_ONE_TAG_ONLY=region and then a cluster of parallel jobs would only be assigned to a set worker slots with a common region (so those parallel jobs would not be assigned across regions).
  1. Allow specifying the same without extra setting directly as part of the WORKER_CLASS setting by introducing a special syntax like WORKER_CLASS=qemu_x86_64,region:prg1,#region:parallel-one-host.
2. Allow pinning other sets of jobs (like a chain of jobs) based on tags. This is probably not needed in the foreseeable future. Having to introduce another variable for this would be ok.
Allow the settings introduced by this PR or future feature ideas to be specified in workers.ini. This would mean communicating any additional settings as "property" like we already do for e.g. WORKER_CLASS (which would be very easy to do). It would also mean that we need to consider those settings in our scheduling algorithm (which is the hard part).

Note that the last point (3.) is maybe the most interesting one for us because in practice we probably really want to put the setting in workers.ini.

Here's the relation of those feature ideas with this PR (how I see it):

I don't think this PR is in conflict with introducing such a feature. We wouldn't need to change the code I'm introducing here to implement such a feature.
I don't think this PR is in conflict with introducing such a feature but there is of course some overlap. We would extend the code I'm introducing here (and/or we introduce for 3.) to implement it. That's fine. It would however leave us with two settings PARALLEL_ONE_HOST_ONLY and PARALLEL_ONE_TAG_ONLY (and thus basically two ways to achieve the same). I can change this PR to introduce tagging right away to only have PARALLEL_ONE_TAG_ONLY. I could also implement 2.1.1 but than it would probably become a bit too much and the syntax is questionable. (The other point 2.2 is probably not relevant.)
This changes things. It is again not in conflict with this PR but I probably need to adjust extend the algorithm further because it actually adds another dimension to the problem when the setting come from worker properties. (When the setting comes from the job it is very easy to handle because then the information is part of the set of jobs we want to schedule which we know upfront. When it comes from the worker properties then it is harder to handle because we only build the list of assigned/allocated workers during the process of scheduling.)

I would tend to merge this PR right now as is (plus maybe a note in the documentation that this feature is still subject to change). Then we have something to work with at all for tickets like https://progress.opensuse.org/issues/157414 and can already gather experience with scheduler changes like this in general. Otherwise the PR might also become even more complicated to review.

okurz

I also agree with this PR as-is except for fixing CI failures and please just add one line in the documentation that you added. "This feature and the naming of the variable is subject to change."

Martchus · 2024-03-27T14:11:53Z

I extended the documentation.

kraih · 2024-03-27T14:17:22Z

You forgot to add the new file it seems.

It is not a model itself so to avoid confusion it is better moved to be in just `OpenQA::Scheduler`.

See https://progress.opensuse.org/issues/135035

okurz · 2024-03-27T14:45:36Z

The CI failure was due to https://progress.opensuse.org/issues/157540, added a comment there, bumped prio and retriggered the CI job here.

okurz requested changes Mar 19, 2024

View reviewed changes

docs/WritingTests.asciidoc Outdated Show resolved Hide resolved

lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved

t/04-scheduler.t Outdated Show resolved Hide resolved

okurz requested changes Mar 20, 2024

View reviewed changes

lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved

lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved

Martchus force-pushed the dependency-pinning branch from e16961a to be533d6 Compare March 22, 2024 14:34

Martchus marked this pull request as ready for review March 22, 2024 15:07

Martchus force-pushed the dependency-pinning branch from 40ee544 to 69a19c2 Compare March 25, 2024 10:33

okurz reviewed Mar 25, 2024

View reviewed changes

docs/WritingTests.asciidoc Outdated Show resolved Hide resolved

okurz approved these changes Mar 25, 2024

View reviewed changes

kraih reviewed Mar 27, 2024

View reviewed changes

lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved

Martchus added 4 commits March 27, 2024 11:46

Allow pinning parallel clusters on one host

c83550e

* See contained documentation change * See https://progress.opensuse.org/issues/135035

Refactor code to support parallel dependency pinning

d42d270

* Move code to its own file * Split code into smaller functions * See https://progress.opensuse.org/issues/135035

Mark uncoverable line in scheduler test as such

3d2ed6d

See https://progress.opensuse.org/issues/135035

Martchus force-pushed the dependency-pinning branch from 113b11e to f19f0ab Compare March 27, 2024 10:51

okurz requested changes Mar 27, 2024

View reviewed changes

okurz reviewed Mar 27, 2024

View reviewed changes

Martchus added 2 commits March 27, 2024 15:25

Move WorkerSlotPicker out of the Model namespace

8c9f879

It is not a model itself so to avoid confusion it is better moved to be in just `OpenQA::Scheduler`.

Document caveats of the PARALLEL_ONE_HOST_ONLY=1 feature

9a0115e

See https://progress.opensuse.org/issues/135035

Martchus force-pushed the dependency-pinning branch from 37d553c to 9a0115e Compare March 27, 2024 14:25

okurz approved these changes Mar 27, 2024

View reviewed changes

kraih approved these changes Mar 27, 2024

View reviewed changes

mergify bot merged commit 7ddbd0d into os-autoinst:master Mar 27, 2024
42 checks passed

Martchus deleted the dependency-pinning branch March 27, 2024 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow pinning parallel clusters on one host #5536

Allow pinning parallel clusters on one host #5536

Martchus commented Mar 19, 2024 •

edited

github-actions bot commented Mar 19, 2024 •

edited by okurz

okurz left a comment

Martchus commented Mar 19, 2024

codecov bot commented Mar 22, 2024 •

edited

Martchus commented Mar 25, 2024

kraih commented Mar 27, 2024

okurz left a comment

Martchus commented Mar 27, 2024

okurz left a comment

Martchus commented Mar 27, 2024

kraih commented Mar 27, 2024

okurz commented Mar 27, 2024

Allow pinning parallel clusters on one host #5536

Allow pinning parallel clusters on one host #5536

Conversation

Martchus commented Mar 19, 2024 • edited

github-actions bot commented Mar 19, 2024 • edited by okurz

okurz left a comment

Choose a reason for hiding this comment

Martchus commented Mar 19, 2024

codecov bot commented Mar 22, 2024 • edited

Codecov Report

Martchus commented Mar 25, 2024

kraih commented Mar 27, 2024

okurz left a comment

Choose a reason for hiding this comment

Martchus commented Mar 27, 2024

okurz left a comment

Choose a reason for hiding this comment

Martchus commented Mar 27, 2024

kraih commented Mar 27, 2024

okurz commented Mar 27, 2024

Martchus commented Mar 19, 2024 •

edited

github-actions bot commented Mar 19, 2024 •

edited by okurz

codecov bot commented Mar 22, 2024 •

edited