New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to not cancel parallel parents with still-pending children #2017
Conversation
I really would prefer if this wouldn't done site global, but depending on job settings. And it's not like 'SUSE tests' expect this, @asmorodskyi's tests expect this - so the next SUSE engineer might actually expect something else. And I would even add a setting like PARALLEL_CANCEL_UNRELATED_CHILDREN (name to be discussed) to the current SUSE tests |
Well, to me the most obvious way to do it based on job settings is to treat The problem with that is that your tests are not set up for this, so if I send a PR that does that, it's going to break your tests. If you don't mind that, I can take a swing at implementing it. Fixing your tests ideally shouldn't be hard - you'd just have to mark every test suite in the cluster as being I can think of a couple of other more complicated ways of doing it, but I think they'd be somewhat more complex to implement. We could have an alternative to I'm not sure about |
mutual PARALLEL_WITH settings are explicitly forbidden as it creates a cycle |
fun! |
9f8bd65
to
b8ab6a0
Compare
Codecov Report
@@ Coverage Diff @@
## master #2017 +/- ##
===========================================
- Coverage 89.11% 72.21% -16.91%
===========================================
Files 156 130 -26
Lines 10408 9475 -933
===========================================
- Hits 9275 6842 -2433
- Misses 1133 2633 +1500
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #2017 +/- ##
==========================================
+ Coverage 88.95% 89.03% +0.07%
==========================================
Files 156 156
Lines 10412 10422 +10
==========================================
+ Hits 9262 9279 +17
+ Misses 1150 1143 -7
Continue to review full report at Codecov.
|
OK, so I just sent a version that's basically the The value is read only for the parent of each cluster; if the parent has that setting set to '0' you get the minimal cancellation behaviour, otherwise you get the existing 'cancel everything and let God sort it out' behaviour. How's that? |
b8ab6a0
to
63a749e
Compare
tidy sure wanted some weird changes to this...why does it want |
I approve the idea, but I think we need some documentation about it (taking that the default behaviour suprised you and I know I can ask you to write english :) And your code in cluster_jobs looks like it can be done differently - but unless @kraih or @Martchus have an idea how to do it, I would let it pass and do it later, we have quite some test cases for it, so I can refactor it later. |
lib/OpenQA/Schema/Result/Jobs.pm
Outdated
# check if the setting to disable cancelwhole is set | ||
my $cwset = $p->settings_hash->{PARALLEL_CANCEL_WHOLE_CLUSTER}; | ||
$cancelwhole = 0 if (defined $cwset && $cwset eq '0'); | ||
if ($args{cancelmode} & !$cancelwhole) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be an &&
?
lib/OpenQA/Schema/Result/Jobs.pm
Outdated
# check if the setting to disable cancelwhole is set | ||
my $cwset = $p->settings_hash->{PARALLEL_CANCEL_WHOLE_CLUSTER}; | ||
$cancelwhole = 0 if (defined $cwset && $cwset eq '0'); | ||
if ($args{cancelmode} & !$cancelwhole) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary &
intended here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes to both of you, well spotted, thanks.
I'm fine with the labels and |
I don't mind the labels either. |
Sure - I just didn't want to waste the time writing it if we changed the approach again :) Now you're OK with the approach, I'll add some docs somewhere. |
Yeah, like most people I think I instinctively felt icky about doing it, but I sat there trying to think of a way to avoid it which wasn't actually just being more silly and complicated for the sake of avoiding the labels, and couldn't think of one :) I think this is one of the rare cases where they actually make perfect sense, especially since we can give them really nice names which 'feel' right. |
def5153
to
b5400a7
Compare
OK, there's a version with some doc text added. I also tweaked the variable check to just check that it's set (defined) and not true, rather than that it equals |
one thing that I wonder about now is: what happens if you restart a running parallel child with this new behaviour active? It won't restart the whole cluster, but will the restarted child correctly pick up the dependency on the parent and will locks and stuff behave as expected? I'm gonna test that out on our staging instance... edit: so it turns out that restarting a running child does restart the whole cluster even with the change. That's not necessarily bad, in fact, I think I'm fine with that. The case that really annoyed us here was the case where a test failed, not really the case of restarting one. I'll update the docs again for this... |
As discussed extensively in https://progress.opensuse.org/issues/46295 , openQA job logic makes an assumption that, any time a parallel child fails or is cancelled, its parent and any other pending children of that parent ought to be cancelled. This is the behaviour SUSE's tests expect, but it is not the behaviour Fedora's tests expect. In Fedora we have several cases of clusters where a parallel parent acts as a server to multiple unrelated child tests; if one of the children fails, that does not mean the parent and all other children must be cancelled. This patch adds a job setting to set whether parallel parents with other pending children (and hence those children) will be cancelled when one child fails or is cancelled. The default is the current behaviour. For the parent and the other pending children *not* to be canceled, the parent must have the setting `PARALLEL_CANCEL_WHOLE_CLUSTER` set to 0 (or anything false-y; empty string also works). Signed-off-by: Adam Williamson <awilliam@redhat.com>
b5400a7
to
0ce7bf2
Compare
OK, I revised the docs and comments to correctly reflect that this actually only covers cancel/fail cases (not restart at all), and re-ran tidy. I think this should be good now. |
commit 104edc4 Merge: 62e9c50 0ce7bf2 Author: Stephan Kulow <stephan@kulow.org> AuthorDate: Tue Mar 19 11:07:12 2019 +0100 Commit: GitHub <noreply@github.com> CommitDate: Tue Mar 19 11:07:12 2019 +0100 Merge pull request #2017 from AdamWill/parallel-parents-nokill Option to not cancel parallel parents with still-pending children
As discussed extensively in
https://progress.opensuse.org/issues/46295 , openQA job logic
makes an assumption that, any time a parallel child fails or is
cancelled, its parent and any other pending children of that
parent ought to be cancelled. This is the behaviour SUSE's tests
expect, but it is not the behaviour Fedora's tests expect. In
Fedora we have several cases of clusters where a parallel parent
acts as a server to multiple unrelated child tests; if one of
the children fails, that does not mean the parent and all other
children must be cancelled.
This patch adds a global config option to set whether parallel
parents with other pending children (and hence those children)
will be cancelled when one child fails or is cancelled. It
defaults to 1, i.e. the current behaviour.
Signed-off-by: Adam Williamson awilliam@redhat.com