Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not cancel parallel parents with still-pending children #2017

Merged
merged 1 commit into from Mar 19, 2019

Conversation

AdamWill
Copy link
Contributor

@AdamWill AdamWill commented Mar 5, 2019

As discussed extensively in
https://progress.opensuse.org/issues/46295 , openQA job logic
makes an assumption that, any time a parallel child fails or is
cancelled, its parent and any other pending children of that
parent ought to be cancelled. This is the behaviour SUSE's tests
expect, but it is not the behaviour Fedora's tests expect. In
Fedora we have several cases of clusters where a parallel parent
acts as a server to multiple unrelated child tests; if one of
the children fails, that does not mean the parent and all other
children must be cancelled.

This patch adds a global config option to set whether parallel
parents with other pending children (and hence those children)
will be cancelled when one child fails or is cancelled. It
defaults to 1, i.e. the current behaviour.

Signed-off-by: Adam Williamson awilliam@redhat.com

@coolo
Copy link
Contributor

coolo commented Mar 6, 2019

I really would prefer if this wouldn't done site global, but depending on job settings. And it's not like 'SUSE tests' expect this, @asmorodskyi's tests expect this - so the next SUSE engineer might actually expect something else. And I would even add a setting like PARALLEL_CANCEL_UNRELATED_CHILDREN (name to be discussed) to the current SUSE tests

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 6, 2019

Well, to me the most obvious way to do it based on job settings is to treat PARALLEL_WITH as a child-to-parent dependency; so if B and C are both PARALLEL_WITH A, then if A fails, B and C both get cancelled, but if B fails, A and C do not.

The problem with that is that your tests are not set up for this, so if I send a PR that does that, it's going to break your tests. If you don't mind that, I can take a swing at implementing it. Fixing your tests ideally shouldn't be hard - you'd just have to mark every test suite in the cluster as being PARALLEL_WITH each of the others. (Assuming having mutual PARALLEL_WITH settings doesn't break anything else, of course - not sure we've ever tried that.)

I can think of a couple of other more complicated ways of doing it, but I think they'd be somewhat more complex to implement. We could have an alternative to PARALLEL_WITH - PARALLEL_WITH_DEPENDS or something - which behaves as described above, and keep the previous behaviour of assuming the whole cluster is implicitly interdependent when the 'old' PARALLEL_WITH is used. Or we could implement some kind of separate argument that defines dependencies - DEPENDS_ON or whatever - and if a cluster contains at least one such explicitly-specified dependency, assume all dependencies would be explicitly specified and only cancel things according to what those dependency statements say...but if a cluster contains none of them, use the old 'cancel everything!' behaviour.

I'm not sure about PARALLEL_CANCEL_UNRELATED_CHILDREN because, I mean, where does it go? What if the parent has it set but not any of the children? What if one child has it set but not the parent or any other children? What if we have parent A and children B, C and D, and B and C are interdependent but D isn't interdependent with them? (To be fair my PR doesn't address that case either, but then my PR is the 'quick hack', if we're doing a 'proper fix' let's make it actually proper :>)

@coolo
Copy link
Contributor

coolo commented Mar 6, 2019

mutual PARALLEL_WITH settings are explicitly forbidden as it creates a cycle

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 6, 2019

fun!

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #2017 into master will decrease coverage by 16.9%.
The diff coverage is 73.68%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2017       +/-   ##
===========================================
- Coverage   89.11%   72.21%   -16.91%     
===========================================
  Files         156      130       -26     
  Lines       10408     9475      -933     
===========================================
- Hits         9275     6842     -2433     
- Misses       1133     2633     +1500
Impacted Files Coverage Δ
lib/OpenQA/Setup.pm 75% <ø> (-15%) ⬇️
lib/OpenQA/Schema/Result/Jobs.pm 71.74% <73.68%> (-19.6%) ⬇️
lib/OpenQA/Worker/Cache.pm 8.33% <0%> (-85.09%) ⬇️
lib/db_profiler.pm 15.38% <0%> (-84.62%) ⬇️
lib/OpenQA/Parser.pm 21.69% <0%> (-78.31%) ⬇️
lib/OpenQA/Parser/Result/OpenQA.pm 22.72% <0%> (-77.28%) ⬇️
lib/OpenQA/Client/Archive.pm 5.83% <0%> (-75.84%) ⬇️
lib/OpenQA/Parser/Result/Test.pm 33.33% <0%> (-66.67%) ⬇️
lib/OpenQA/Task/Asset/Limit.pm 13.63% <0%> (-65.91%) ⬇️
lib/OpenQA/WebAPI/Controller/API/V1/Asset.pm 34.61% <0%> (-63.47%) ⬇️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3e49dc...b8ab6a0. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #2017 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2017      +/-   ##
==========================================
+ Coverage   88.95%   89.03%   +0.07%     
==========================================
  Files         156      156              
  Lines       10412    10422      +10     
==========================================
+ Hits         9262     9279      +17     
+ Misses       1150     1143       -7
Impacted Files Coverage Δ
lib/OpenQA/Schema/Result/Jobs.pm 91.44% <100%> (+0.1%) ⬆️
lib/OpenQA/Utils.pm 94.68% <0%> (+0.17%) ⬆️
lib/OpenQA/Worker/Common.pm 84.5% <0%> (+2.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3f2415...0ce7bf2. Read the comment docs.

@AdamWill
Copy link
Contributor Author

OK, so I just sent a version that's basically the PARALLEL_CANCEL_UNRELATED_CHILDREN suggestion, only with a slightly different name (PARALLEL_CANCEL_WHOLE_CLUSTER) and with the default behaviour as the current behaviour, so you have to explicitly set the value of that setting to '0' to change it.

The value is read only for the parent of each cluster; if the parent has that setting set to '0' you get the minimal cancellation behaviour, otherwise you get the existing 'cancel everything and let God sort it out' behaviour.

How's that?

@AdamWill
Copy link
Contributor Author

tidy sure wanted some weird changes to this...why does it want PARENT: indented in a way that matches absolutely nothing else?

@coolo
Copy link
Contributor

coolo commented Mar 13, 2019

I approve the idea, but I think we need some documentation about it (taking that the default behaviour suprised you and I know I can ask you to write english :)

And your code in cluster_jobs looks like it can be done differently - but unless @kraih or @Martchus have an idea how to do it, I would let it pass and do it later, we have quite some test cases for it, so I can refactor it later.

# check if the setting to disable cancelwhole is set
my $cwset = $p->settings_hash->{PARALLEL_CANCEL_WHOLE_CLUSTER};
$cancelwhole = 0 if (defined $cwset && $cwset eq '0');
if ($args{cancelmode} & !$cancelwhole) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be an &&?

# check if the setting to disable cancelwhole is set
my $cwset = $p->settings_hash->{PARALLEL_CANCEL_WHOLE_CLUSTER};
$cancelwhole = 0 if (defined $cwset && $cwset eq '0');
if ($args{cancelmode} & !$cancelwhole) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary & intended here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes to both of you, well spotted, thanks.

@Martchus
Copy link
Contributor

And your code in cluster_jobs looks like it can be done differently

I'm fine with the labels and goto-like usage of next. The code isn't too hard to read actually.

@kraih
Copy link
Member

kraih commented Mar 13, 2019

I don't mind the labels either.

@AdamWill
Copy link
Contributor Author

I approve the idea, but I think we need some documentation about it (taking that the default behaviour suprised you and I know I can ask you to write english :)

Sure - I just didn't want to waste the time writing it if we changed the approach again :) Now you're OK with the approach, I'll add some docs somewhere.

@AdamWill
Copy link
Contributor Author

And your code in cluster_jobs looks like it can be done differently

I'm fine with the labels and goto-like usage of next. The code isn't too hard to read actually.

Yeah, like most people I think I instinctively felt icky about doing it, but I sat there trying to think of a way to avoid it which wasn't actually just being more silly and complicated for the sake of avoiding the labels, and couldn't think of one :) I think this is one of the rare cases where they actually make perfect sense, especially since we can give them really nice names which 'feel' right.

@AdamWill AdamWill force-pushed the parallel-parents-nokill branch 3 times, most recently from def5153 to b5400a7 Compare March 13, 2019 16:44
@AdamWill
Copy link
Contributor Author

OK, there's a version with some doc text added. I also tweaked the variable check to just check that it's set (defined) and not true, rather than that it equals '0' - that way you can set it to any false-y value, including the empty string.

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 13, 2019

one thing that I wonder about now is: what happens if you restart a running parallel child with this new behaviour active? It won't restart the whole cluster, but will the restarted child correctly pick up the dependency on the parent and will locks and stuff behave as expected? I'm gonna test that out on our staging instance...

edit: so it turns out that restarting a running child does restart the whole cluster even with the change. That's not necessarily bad, in fact, I think I'm fine with that. The case that really annoyed us here was the case where a test failed, not really the case of restarting one. I'll update the docs again for this...

As discussed extensively in
https://progress.opensuse.org/issues/46295 , openQA job logic
makes an assumption that, any time a parallel child fails or is
cancelled, its parent and any other pending children of that
parent ought to be cancelled. This is the behaviour SUSE's tests
expect, but it is not the behaviour Fedora's tests expect. In
Fedora we have several cases of clusters where a parallel parent
acts as a server to multiple unrelated child tests; if one of
the children fails, that does not mean the parent and all other
children must be cancelled.

This patch adds a job setting to set whether parallel parents
with other pending children (and hence those children) will be
cancelled when one child fails or is cancelled. The default is
the current behaviour. For the parent and the other pending
children *not* to be canceled, the parent must have the setting
`PARALLEL_CANCEL_WHOLE_CLUSTER` set to 0 (or anything false-y;
empty string also works).

Signed-off-by: Adam Williamson <awilliam@redhat.com>
@AdamWill
Copy link
Contributor Author

OK, I revised the docs and comments to correctly reflect that this actually only covers cancel/fail cases (not restart at all), and re-ran tidy. I think this should be good now.

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 15, 2019

@coolo @kraih ? Note this is deployed in Fedora staging now and is working fine, nothing has exploded and it's behaving as intended.

@coolo coolo merged commit 104edc4 into os-autoinst:master Mar 19, 2019
coolo pushed a commit that referenced this pull request Mar 19, 2019
commit 104edc4
Merge: 62e9c50 0ce7bf2
Author:     Stephan Kulow <stephan@kulow.org>
AuthorDate: Tue Mar 19 11:07:12 2019 +0100
Commit:     GitHub <noreply@github.com>
CommitDate: Tue Mar 19 11:07:12 2019 +0100

    Merge pull request #2017 from AdamWill/parallel-parents-nokill

    Option to not cancel parallel parents with still-pending children
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants