Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pinning parallel clusters on one host #5536

Merged
merged 6 commits into from Mar 27, 2024

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Mar 19, 2024


A draft because I still need to extend unit tests. Possibly the big function should also be split. It got a little bit involved but I think the algorithm works without being a full-blown SAT solver. You might want to read the test code first (before the implementation) to see what the difficulty here actually is.

Copy link

github-actions bot commented Mar 19, 2024

Great PR! Please pay attention to the following items before merging:

Files matching docs/*.asciidoc:

  • Consider generating documentation locally to verify it is rendered correctly using tools/generate-docs

This is an automatically generated QA checklist based on modified files.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh geez, I have no idea what your very elaborate algorithm is doing :D

only trivial questions or corrections. The majority of the change I did not really want to dive into … yet

docs/WritingTests.asciidoc Outdated Show resolved Hide resolved
lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved
t/04-scheduler.t Outdated Show resolved Hide resolved
@Martchus
Copy link
Contributor Author

oh geez, I have no idea what your very elaborate algorithm is doing :D

After thinking about it further I must admit that the algorithm is probably also not able to sort out all sorts of inputs. I guess for that we'd really needed a full SAT solver. However, I hope it is sufficient for our purposes. Actually I suppose we don't even have cases as complicated as the one in the unit test in production. The worst case is that jobs won't be scheduled although they could be scheduled if one mixes up worker classes within a scenario too wildely and then we can still improve the algorithm (with TDD by first adding unit tests according to that production scenaio).

lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved
lib/OpenQA/Scheduler/Model/Jobs.pm Outdated Show resolved Hide resolved
Copy link

codecov bot commented Mar 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.39%. Comparing base (c4f7601) to head (9a0115e).
Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #5536    +/-   ##
========================================
  Coverage   98.39%   98.39%            
========================================
  Files         391      392     +1     
  Lines       38044    38208   +164     
========================================
+ Hits        37432    37596   +164     
  Misses        612      612            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Martchus Martchus marked this pull request as ready for review March 22, 2024 15:07
@Martchus
Copy link
Contributor Author

I added/extended unit tests where needed and refactored the code into smaller functions and moved it into its own file.

I also gave it a real test on my local machine with two worker slots and faking the worker host (by manually editing the worker code before starting the particular worker instances). I tested with and without PARALLEL_ONE_HOST_ONLY=1 and with and without mixed hosts. It seemed to work as expected in all cases.

docs/WritingTests.asciidoc Outdated Show resolved Hide resolved
@kraih
Copy link
Member

kraih commented Mar 27, 2024

It's a very complex feature, but i like how well encapsulated it is. That will help a lot with debugging in the future.

* Move code to its own file
* Split code into smaller functions
* See https://progress.opensuse.org/issues/135035
* Ensure the `PARALLEL_ONE_HOST_ONLY`-setting is considered for all jobs
  in a parallel cluster when determining scheduled jobs and add according
  test
* Extend tests for picking parallel siblings of running jobs
* See https://progress.opensuse.org/issues/135035
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this and potential follow-up work I came up with one open point that I would like to clarify before merging: The openQA scheduler already has access to worker classes and possibly further worker properties from workers. So would it be possible to define a special worker class like "tap-parallel_one_host_only" resembling the new test setting and read that from the worker to have the same effect as the test variable which is to not schedule any across-host clusters on that host?

@Martchus
Copy link
Contributor Author

We discussed more features that involve worker classes and dependencies and how they are intertwined. We need to make sure that this PR is in-line with those ideas for future features:

  1. Allow specifying negative worker class matching. So a job with e.g. WORKER_CLASS=foo,-bar would work on slots that provide "foo" unless they also provide "bar".
  2. Allow grouping worker classes using tags, e.g. WORKER_CLASS=host:worker1,region:prg1 would specify the worker classes host:worker1 and region:prg1 as usual. However, openQA would also recognize host and region as tags that come into play when implementing further features.
    1. Allow pinning parallel clusters based on tags. This would be the same as the feature proposed by this PR. However, not the host but a list of tags would be specified. So we would add a variable PARALLEL_ONE_TAG_ONLY that could be assigned like PARALLEL_ONE_TAG_ONLY=region and then a cluster of parallel jobs would only be assigned to a set worker slots with a common region (so those parallel jobs would not be assigned across regions).
      1. Allow specifying the same without extra setting directly as part of the WORKER_CLASS setting by introducing a special syntax like WORKER_CLASS=qemu_x86_64,region:prg1,#region:parallel-one-host.
    2. Allow pinning other sets of jobs (like a chain of jobs) based on tags. This is probably not needed in the foreseeable future. Having to introduce another variable for this would be ok.
  3. Allow the settings introduced by this PR or future feature ideas to be specified in workers.ini. This would mean communicating any additional settings as "property" like we already do for e.g. WORKER_CLASS (which would be very easy to do). It would also mean that we need to consider those settings in our scheduling algorithm (which is the hard part).

Note that the last point (3.) is maybe the most interesting one for us because in practice we probably really want to put the setting in workers.ini.

Here's the relation of those feature ideas with this PR (how I see it):

  1. I don't think this PR is in conflict with introducing such a feature. We wouldn't need to change the code I'm introducing here to implement such a feature.
  2. I don't think this PR is in conflict with introducing such a feature but there is of course some overlap. We would extend the code I'm introducing here (and/or we introduce for 3.) to implement it. That's fine. It would however leave us with two settings PARALLEL_ONE_HOST_ONLY and PARALLEL_ONE_TAG_ONLY (and thus basically two ways to achieve the same). I can change this PR to introduce tagging right away to only have PARALLEL_ONE_TAG_ONLY. I could also implement 2.1.1 but than it would probably become a bit too much and the syntax is questionable. (The other point 2.2 is probably not relevant.)
  3. This changes things. It is again not in conflict with this PR but I probably need to adjust extend the algorithm further because it actually adds another dimension to the problem when the setting come from worker properties. (When the setting comes from the job it is very easy to handle because then the information is part of the set of jobs we want to schedule which we know upfront. When it comes from the worker properties then it is harder to handle because we only build the list of assigned/allocated workers during the process of scheduling.)

I would tend to merge this PR right now as is (plus maybe a note in the documentation that this feature is still subject to change). Then we have something to work with at all for tickets like https://progress.opensuse.org/issues/157414 and can already gather experience with scheduler changes like this in general. Otherwise the PR might also become even more complicated to review.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with this PR as-is except for fixing CI failures and please just add one line in the documentation that you added. "This feature and the naming of the variable is subject to change."

@Martchus
Copy link
Contributor Author

I extended the documentation.

@kraih
Copy link
Member

kraih commented Mar 27, 2024

You forgot to add the new file it seems.

It is not a model itself so to avoid confusion it is better moved to be
in just `OpenQA::Scheduler`.
@okurz
Copy link
Member

okurz commented Mar 27, 2024

The CI failure was due to https://progress.opensuse.org/issues/157540, added a comment there, bumped prio and retriggered the CI job here.

@mergify mergify bot merged commit 7ddbd0d into os-autoinst:master Mar 27, 2024
42 checks passed
@Martchus Martchus deleted the dependency-pinning branch March 27, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants