Don't delete gru task from db if minion job has not completed #2008

AdamWill · 2019-03-01T19:30:05Z

In Fedora openQA, I noticed a problem related to the recent
change to run gru/minion tasks/jobs in parallel. We put guards
in download_asset and several other non-parallel-safe tasks to
prevent multiple instances of them running in parallel with
each other: the tasks use a 'guard' (basically a lock) and if
they cannot acquire it, they restart themselves.

However, it seems that when this happens, the minion execute()
returns and we wind up in this block of our superclassed
execute() which assumes that at this point the job has actually
run, and deletes the corresponding entry from the gru task db.
This was a big problem in Fedora, because we schedule the same
asset as multiple flavors, so each flavor gets its jobs blocked
on a download_asset gru task. One task executes 'normally' and
will not be deleted from the gru task db until the asset is
actually downloaded, but the other task hits the retry mechanism
and is immediately deleted from the db even though the asset has
not yet downloaded...so all the jobs for that flavor run, and
fail immediately.

To protect against this, we can check the state of the minion
job after we call finish and only delete the task from the db
if the job is actually 'failed' or 'finished'. Otherwise we
leave the entry in the db and wait for the job to run again and
actually complete this time.

Related ticket: poo#48554

Signed-off-by: Adam Williamson awilliam@redhat.com

In Fedora openQA, I noticed a problem related to the recent change to run gru/minion tasks/jobs in parallel. We put guards in download_asset and several other non-parallel-safe tasks to prevent multiple instances of them running in parallel with each other: the tasks use a 'guard' (basically a lock) and if they cannot acquire it, they restart themselves. However, it seems that when this happens, the minion execute() returns and we wind up in this block of our superclassed execute() which assumes that at this point the job has actually run, and deletes the corresponding entry from the gru task db. This was a big problem in Fedora, because we schedule the same asset as multiple flavors, so each flavor gets its jobs blocked on a download_asset gru task. One task executes 'normally' and will not be deleted from the gru task db until the asset is actually downloaded, but the other task hits the retry mechanism and is immediately deleted from the db even though the asset has not yet downloaded...so all the jobs for that flavor run, and fail immediately. To protect against this, we can check the state of the minion job after we call `finish` and only delete the task from the db if the job is actually 'failed' or 'finished'. Otherwise we leave the entry in the db and wait for the job to run again and actually complete this time. Related ticket: [poo#48554](https://progress.opensuse.org/issues/48554) Signed-off-by: Adam Williamson <awilliam@redhat.com>

AdamWill · 2019-03-01T19:34:17Z

So, this does seem to actually work to solve the bug I'm trying to solve. I have it running on Fedora openQA staging now and it seems to be behaving correctly.

There are a couple of questions, though:

Is it possible for the job state to be 'active' after finish is called? If so, should we delete the task from the db or not at that point?
Would it be better to use $self->is_finished instead of the string check on $self->info->{state}?

I really hope the answer to 1 is "it's not possible" because it's a really tricky choice to make. If we delete the task from the gru task db while the minion job is actually still running, that's clearly wrong...but if somehow we get here with the state as 'active', if we don't delete the job, can we be sure we'll wind up back here with the state as 'finished' or 'failed'? I am just not sure. I do think the answer is simply "it's not possible", but I'm not 100% sure. We can obviously make the conditional unless ($state eq 'inactive') instead of if (grep { /$state/ } ('failed', 'finished')), if we prefer it that way.

As to 2 - is_finished basically runs through waitpid. I don't really know if that would be more reliable or less reliable here. Not sure if anyone has an opinion.

AdamWill · 2019-03-01T19:36:38Z

lib/OpenQA/WebAPI/GruJob.pm

+        # and we do *not* want to delete the corresponding entry from the gru
+        # task database. We only want to delete that if the job is really done
+        if (grep { /$state/ } ('failed', 'finished')) {
+            if ($gruid) {


in case you're wondering why I nested this: if (grep { /$state/ } ('failed', 'finished') && $gruid) doesn't actually seem to work right for some reason. If anyone knows why and can fix it, let me know. :D

My guess would be that due to precedence rules the last expression (('failed', 'finished') && $gruid) gets evaluated first.

if ((grep { /$state/ } ('failed', 'finished')) && $gruid) should work though, right?

Although grep { $_ eq $state } would be better style.

Actually, why not be totally clear with if (($state eq 'finished' || $state eq 'failed') && $gruid)?

I prefer the last version.

ahhh, my old adversary the bracket, it is you once more?!

thanks guys, will tweak.

AdamWill · 2019-03-01T19:37:52Z

@Martchus

AdamWill · 2019-03-01T19:41:07Z

And yes, no tests yet. I am not sure how best to go about testing this, but I'll try and think of something if I can...

codecov · 2019-03-01T19:56:07Z

Codecov Report

Merging #2008 into master will decrease coverage by 0.07%.
The diff coverage is 10%.

@@            Coverage Diff            @@
##           master   #2008      +/-   ##
=========================================
- Coverage   89.08%     89%   -0.08%     
=========================================
  Files         153     153              
  Lines       10358   10367       +9     
=========================================
  Hits         9227    9227              
- Misses       1131    1140       +9

Impacted Files	Coverage Δ
lib/OpenQA/WebAPI/GruJob.pm	`8.1% <10%> (+0.96%)`	⬆️
lib/OpenQA/Worker/Commands.pm	`81.17% <0%> (-2.36%)`	⬇️
lib/OpenQA/Utils.pm	`94.74% <0%> (+0.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e27d66d...27a1ef7. Read the comment docs.

Martchus · 2019-03-04T09:45:37Z

There are a couple of questions, though

Maybe @kraih knows.

The change makes sense in general, but yes, tests are missing.

kraih · 2019-03-04T10:03:25Z

Unfortunately the answer to 1 is yes, Minion allows for a job to be retried again after it reached the state finished. This is not done in openQA to my knowledge, but it is possible for a user to retry a finished job from the Minion admin ui. Also, finished and failed to Minion are not the same. Minion never deletes a failed job automatically. The assumption is that a failed job has to be reviewed by a user or bot (who can then retry or remove it manually).

Edit: Since Gru hijacks the execute method of Minion there are actually many cases where the job will still be active. That means the proposed solution does not work correctly.

kraih · 2019-03-04T10:28:39Z

lib/OpenQA/WebAPI/GruJob.pm

@@ -17,6 +17,7 @@ package OpenQA::WebAPI::GruJob;
 use Mojo::Base 'Minion::Job';

 use OpenQA::Utils 'log_error';
+use OpenQA::Utils 'log_debug';


use OpenQA::Utils qw(log_debug log_error);

kraih · 2019-03-04T10:30:01Z

lib/OpenQA/WebAPI/GruJob.pm

+        my $state = $self->info->{state};
+        my $jobid = $self->id;
+        my $gruid;
+        $gruid = $self->info->{notes}{gru_id} if exists $self->info->{notes}{gru_id};


You're triple requesting $self->info here, that's two extra Postgres roundtrips for the whole info data structure.

oh, fair point. will fix.

kraih · 2019-03-04T10:30:51Z

And i agree, this problem definitely needs a test case to decide which would be the correct solution.

kraih · 2019-03-04T10:55:45Z

@AdamWill And regarding 2, ->is_finished is only for Minion job process management and unrelated to the actual state of the job. So definitely a no on using it here.

kraih · 2019-03-04T13:17:51Z

Think i have an idea for how to test this, will open a separate pull request for it later today or tomorrow.

AdamWill · 2019-03-04T15:28:05Z

Unfortunately the answer to 1 is yes, Minion allows for a job to be retried again after it reached the state finished. This is not done in openQA to my knowledge, but it is possible for a user to retry a finished job from the Minion admin ui. Also, finished and failed to Minion are not the same. Minion never deletes a failed job automatically. The assumption is that a failed job has to be reviewed by a user or bot (who can then retry or remove it manually).

Edit: Since Gru hijacks the execute method of Minion there are actually many cases where the job will still be active. That means the proposed solution does not work correctly.

thanks, I'm not entirely sure I'm following the comment, though. Question 1 was about whether we can reach this specific point in the subclassed execute() with the job state as 'active' - just because a job is retried, why would we get here with its state as 'active'? edit: oh, I see, I phrased the question a bit ambiguously. I was only thinking of "could the job state be 'active' when we hit this specific code block", but you read it more as "could the job state ever be 'active' again after this code block has run", right? In which case, see below.

If you're also saying that we shouldn't really delete jobs from GruTasks even after they're 'finished' because they might be re-tried, then indeed that seems a valid concern, but then this PR isn't making things worse there - we're already doing that right now. If that's the concern, then this PR actually makes things better, it just doesn't completely fix them, right?

AdamWill · 2019-03-04T15:34:20Z

Concerning the "jobs can be retried under the same ID" case - well, what GruTasks really is these days (AIUI) is a table of pending and active jobs. From my poking around in the code, I get the impression that minion emits events, right? Could we perhaps redesign things so jobs get added to and removed from GruTasks based on the events emitted on state changes?

As it seems like it's going to be relevant: does anyone know of anything remaining in openQA that actually uses the GruTasks / GruDependencies tables besides this 'openQA tests can be wait to wait on gru tasks' mechanism?

kraih · 2019-03-04T15:35:19Z

@AdamWill Yes, i did not explain that very well, but the subject is rather complicated. 😉 Anyway, i do think i have a solution and am now thinking of more edge cases to test. master...kraih:gru_tasks_cleanup

AdamWill · 2019-03-04T15:40:03Z

OK, thanks :) I think it might be wise to preserve the checks around whether keys in the {info} hash exist at all? Presumably they're there for a reason (likely you get errors or ugly log messages if trying to use $self-{info}->{gru_id} if it doesn't exist at all).

kraih · 2019-03-04T15:50:27Z

@AdamWill Perl autovivifies that, no need to worry. Here you would only use exists if a valid Gru id could be a false Perl value, which i don't believe to be the case.

kraih · 2019-03-05T15:00:05Z

This should be resolved with #2011.

AdamWill commented Mar 1, 2019

View reviewed changes

kraih reviewed Mar 4, 2019

View reviewed changes

kraih mentioned this pull request Mar 4, 2019

Gru tasks cleanup #2011

Merged

kraih closed this Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't delete gru task from db if minion job has not completed #2008

Don't delete gru task from db if minion job has not completed #2008

AdamWill commented Mar 1, 2019

AdamWill commented Mar 1, 2019 •

edited

AdamWill Mar 1, 2019

Martchus Mar 4, 2019

kraih Mar 4, 2019

kraih Mar 4, 2019

kraih Mar 4, 2019 •

edited

Martchus Mar 4, 2019

AdamWill Mar 4, 2019

AdamWill commented Mar 1, 2019

AdamWill commented Mar 1, 2019

codecov bot commented Mar 1, 2019 •

edited

Martchus commented Mar 4, 2019

kraih commented Mar 4, 2019 •

edited

kraih Mar 4, 2019

kraih Mar 4, 2019

AdamWill Mar 4, 2019

kraih commented Mar 4, 2019

kraih commented Mar 4, 2019

kraih commented Mar 4, 2019

AdamWill commented Mar 4, 2019 •

edited

AdamWill commented Mar 4, 2019

kraih commented Mar 4, 2019

AdamWill commented Mar 4, 2019

kraih commented Mar 4, 2019

kraih commented Mar 5, 2019

Don't delete gru task from db if minion job has not completed #2008

Don't delete gru task from db if minion job has not completed #2008

Conversation

AdamWill commented Mar 1, 2019

AdamWill commented Mar 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kraih Mar 4, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamWill commented Mar 1, 2019

AdamWill commented Mar 1, 2019

codecov bot commented Mar 1, 2019 • edited

Codecov Report

Martchus commented Mar 4, 2019

kraih commented Mar 4, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kraih commented Mar 4, 2019

kraih commented Mar 4, 2019

kraih commented Mar 4, 2019

AdamWill commented Mar 4, 2019 • edited

AdamWill commented Mar 4, 2019

kraih commented Mar 4, 2019

AdamWill commented Mar 4, 2019

kraih commented Mar 4, 2019

kraih commented Mar 5, 2019

AdamWill commented Mar 1, 2019 •

edited

kraih Mar 4, 2019 •

edited

codecov bot commented Mar 1, 2019 •

edited

kraih commented Mar 4, 2019 •

edited

AdamWill commented Mar 4, 2019 •

edited