New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't delete gru task from db if minion job has not completed #2008
Conversation
In Fedora openQA, I noticed a problem related to the recent change to run gru/minion tasks/jobs in parallel. We put guards in download_asset and several other non-parallel-safe tasks to prevent multiple instances of them running in parallel with each other: the tasks use a 'guard' (basically a lock) and if they cannot acquire it, they restart themselves. However, it seems that when this happens, the minion execute() returns and we wind up in this block of our superclassed execute() which assumes that at this point the job has actually run, and deletes the corresponding entry from the gru task db. This was a big problem in Fedora, because we schedule the same asset as multiple flavors, so each flavor gets its jobs blocked on a download_asset gru task. One task executes 'normally' and will not be deleted from the gru task db until the asset is actually downloaded, but the other task hits the retry mechanism and is immediately deleted from the db even though the asset has not yet downloaded...so all the jobs for that flavor run, and fail immediately. To protect against this, we can check the state of the minion job after we call `finish` and only delete the task from the db if the job is actually 'failed' or 'finished'. Otherwise we leave the entry in the db and wait for the job to run again and actually complete this time. Related ticket: [poo#48554](https://progress.opensuse.org/issues/48554) Signed-off-by: Adam Williamson <awilliam@redhat.com>
So, this does seem to actually work to solve the bug I'm trying to solve. I have it running on Fedora openQA staging now and it seems to be behaving correctly. There are a couple of questions, though:
I really hope the answer to 1 is "it's not possible" because it's a really tricky choice to make. If we delete the task from the gru task db while the minion job is actually still running, that's clearly wrong...but if somehow we get here with the state as 'active', if we don't delete the job, can we be sure we'll wind up back here with the state as 'finished' or 'failed'? I am just not sure. I do think the answer is simply "it's not possible", but I'm not 100% sure. We can obviously make the conditional As to 2 - |
# and we do *not* want to delete the corresponding entry from the gru | ||
# task database. We only want to delete that if the job is really done | ||
if (grep { /$state/ } ('failed', 'finished')) { | ||
if ($gruid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case you're wondering why I nested this: if (grep { /$state/ } ('failed', 'finished') && $gruid)
doesn't actually seem to work right for some reason. If anyone knows why and can fix it, let me know. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess would be that due to precedence rules the last expression (('failed', 'finished') && $gruid
) gets evaluated first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((grep { /$state/ } ('failed', 'finished')) && $gruid)
should work though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although grep { $_ eq $state }
would be better style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why not be totally clear with if (($state eq 'finished' || $state eq 'failed') && $gruid)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the last version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh, my old adversary the bracket, it is you once more?!
thanks guys, will tweak.
And yes, no tests yet. I am not sure how best to go about testing this, but I'll try and think of something if I can... |
Codecov Report
@@ Coverage Diff @@
## master #2008 +/- ##
=========================================
- Coverage 89.08% 89% -0.08%
=========================================
Files 153 153
Lines 10358 10367 +9
=========================================
Hits 9227 9227
- Misses 1131 1140 +9
Continue to review full report at Codecov.
|
Maybe @kraih knows. The change makes sense in general, but yes, tests are missing. |
Unfortunately the answer to 1 is yes, Minion allows for a job to be retried again after it reached the state finished. This is not done in openQA to my knowledge, but it is possible for a user to retry a finished job from the Minion admin ui. Also, finished and failed to Minion are not the same. Minion never deletes a failed job automatically. The assumption is that a failed job has to be reviewed by a user or bot (who can then retry or remove it manually). Edit: Since Gru hijacks the |
@@ -17,6 +17,7 @@ package OpenQA::WebAPI::GruJob; | |||
use Mojo::Base 'Minion::Job'; | |||
|
|||
use OpenQA::Utils 'log_error'; | |||
use OpenQA::Utils 'log_debug'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use OpenQA::Utils qw(log_debug log_error);
my $state = $self->info->{state}; | ||
my $jobid = $self->id; | ||
my $gruid; | ||
$gruid = $self->info->{notes}{gru_id} if exists $self->info->{notes}{gru_id}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're triple requesting $self->info
here, that's two extra Postgres roundtrips for the whole info data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, fair point. will fix.
And i agree, this problem definitely needs a test case to decide which would be the correct solution. |
@AdamWill And regarding 2, |
Think i have an idea for how to test this, will open a separate pull request for it later today or tomorrow. |
thanks, I'm not entirely sure I'm following the comment, though. Question 1 was about whether we can reach this specific point in the subclassed If you're also saying that we shouldn't really delete jobs from GruTasks even after they're 'finished' because they might be re-tried, then indeed that seems a valid concern, but then this PR isn't making things worse there - we're already doing that right now. If that's the concern, then this PR actually makes things better, it just doesn't completely fix them, right? |
Concerning the "jobs can be retried under the same ID" case - well, what GruTasks really is these days (AIUI) is a table of pending and active jobs. From my poking around in the code, I get the impression that minion emits events, right? Could we perhaps redesign things so jobs get added to and removed from GruTasks based on the events emitted on state changes? As it seems like it's going to be relevant: does anyone know of anything remaining in openQA that actually uses the GruTasks / GruDependencies tables besides this 'openQA tests can be wait to wait on gru tasks' mechanism? |
@AdamWill Yes, i did not explain that very well, but the subject is rather complicated. 😉 Anyway, i do think i have a solution and am now thinking of more edge cases to test. master...kraih:gru_tasks_cleanup |
OK, thanks :) I think it might be wise to preserve the checks around whether keys in the |
@AdamWill Perl autovivifies that, no need to worry. Here you would only use |
This should be resolved with #2011. |
In Fedora openQA, I noticed a problem related to the recent
change to run gru/minion tasks/jobs in parallel. We put guards
in download_asset and several other non-parallel-safe tasks to
prevent multiple instances of them running in parallel with
each other: the tasks use a 'guard' (basically a lock) and if
they cannot acquire it, they restart themselves.
However, it seems that when this happens, the minion execute()
returns and we wind up in this block of our superclassed
execute() which assumes that at this point the job has actually
run, and deletes the corresponding entry from the gru task db.
This was a big problem in Fedora, because we schedule the same
asset as multiple flavors, so each flavor gets its jobs blocked
on a download_asset gru task. One task executes 'normally' and
will not be deleted from the gru task db until the asset is
actually downloaded, but the other task hits the retry mechanism
and is immediately deleted from the db even though the asset has
not yet downloaded...so all the jobs for that flavor run, and
fail immediately.
To protect against this, we can check the state of the minion
job after we call
finish
and only delete the task from the dbif the job is actually 'failed' or 'finished'. Otherwise we
leave the entry in the db and wait for the job to run again and
actually complete this time.
Related ticket: poo#48554
Signed-off-by: Adam Williamson awilliam@redhat.com