Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't delete gru task from db if minion job has not completed #2008

Closed

Conversation

AdamWill
Copy link
Contributor

@AdamWill AdamWill commented Mar 1, 2019

In Fedora openQA, I noticed a problem related to the recent
change to run gru/minion tasks/jobs in parallel. We put guards
in download_asset and several other non-parallel-safe tasks to
prevent multiple instances of them running in parallel with
each other: the tasks use a 'guard' (basically a lock) and if
they cannot acquire it, they restart themselves.

However, it seems that when this happens, the minion execute()
returns and we wind up in this block of our superclassed
execute() which assumes that at this point the job has actually
run, and deletes the corresponding entry from the gru task db.
This was a big problem in Fedora, because we schedule the same
asset as multiple flavors, so each flavor gets its jobs blocked
on a download_asset gru task. One task executes 'normally' and
will not be deleted from the gru task db until the asset is
actually downloaded, but the other task hits the retry mechanism
and is immediately deleted from the db even though the asset has
not yet downloaded...so all the jobs for that flavor run, and
fail immediately.

To protect against this, we can check the state of the minion
job after we call finish and only delete the task from the db
if the job is actually 'failed' or 'finished'. Otherwise we
leave the entry in the db and wait for the job to run again and
actually complete this time.

Related ticket: poo#48554

Signed-off-by: Adam Williamson awilliam@redhat.com

In Fedora openQA, I noticed a problem related to the recent
change to run gru/minion tasks/jobs in parallel. We put guards
in download_asset and several other non-parallel-safe tasks to
prevent multiple instances of them running in parallel with
each other: the tasks use a 'guard' (basically a lock) and if
they cannot acquire it, they restart themselves.

However, it seems that when this happens, the minion execute()
returns and we wind up in this block of our superclassed
execute() which assumes that at this point the job has actually
run, and deletes the corresponding entry from the gru task db.
This was a big problem in Fedora, because we schedule the same
asset as multiple flavors, so each flavor gets its jobs blocked
on a download_asset gru task. One task executes 'normally' and
will not be deleted from the gru task db until the asset is
actually downloaded, but the other task hits the retry mechanism
and is immediately deleted from the db even though the asset has
not yet downloaded...so all the jobs for that flavor run, and
fail immediately.

To protect against this, we can check the state of the minion
job after we call `finish` and only delete the task from the db
if the job is actually 'failed' or 'finished'. Otherwise we
leave the entry in the db and wait for the job to run again and
actually complete this time.

Related ticket: [poo#48554](https://progress.opensuse.org/issues/48554)

Signed-off-by: Adam Williamson <awilliam@redhat.com>
@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 1, 2019

So, this does seem to actually work to solve the bug I'm trying to solve. I have it running on Fedora openQA staging now and it seems to be behaving correctly.

There are a couple of questions, though:

  1. Is it possible for the job state to be 'active' after finish is called? If so, should we delete the task from the db or not at that point?
  2. Would it be better to use $self->is_finished instead of the string check on $self->info->{state}?

I really hope the answer to 1 is "it's not possible" because it's a really tricky choice to make. If we delete the task from the gru task db while the minion job is actually still running, that's clearly wrong...but if somehow we get here with the state as 'active', if we don't delete the job, can we be sure we'll wind up back here with the state as 'finished' or 'failed'? I am just not sure. I do think the answer is simply "it's not possible", but I'm not 100% sure. We can obviously make the conditional unless ($state eq 'inactive') instead of if (grep { /$state/ } ('failed', 'finished')), if we prefer it that way.

As to 2 - is_finished basically runs through waitpid. I don't really know if that would be more reliable or less reliable here. Not sure if anyone has an opinion.

# and we do *not* want to delete the corresponding entry from the gru
# task database. We only want to delete that if the job is really done
if (grep { /$state/ } ('failed', 'finished')) {
if ($gruid) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case you're wondering why I nested this: if (grep { /$state/ } ('failed', 'finished') && $gruid) doesn't actually seem to work right for some reason. If anyone knows why and can fix it, let me know. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be that due to precedence rules the last expression (('failed', 'finished') && $gruid) gets evaluated first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ((grep { /$state/ } ('failed', 'finished')) && $gruid) should work though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although grep { $_ eq $state } would be better style.

Copy link
Member

@kraih kraih Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why not be totally clear with if (($state eq 'finished' || $state eq 'failed') && $gruid)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the last version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, my old adversary the bracket, it is you once more?!

thanks guys, will tweak.

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 1, 2019

@Martchus

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 1, 2019

And yes, no tests yet. I am not sure how best to go about testing this, but I'll try and think of something if I can...

@codecov
Copy link

codecov bot commented Mar 1, 2019

Codecov Report

Merging #2008 into master will decrease coverage by 0.07%.
The diff coverage is 10%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2008      +/-   ##
=========================================
- Coverage   89.08%     89%   -0.08%     
=========================================
  Files         153     153              
  Lines       10358   10367       +9     
=========================================
  Hits         9227    9227              
- Misses       1131    1140       +9
Impacted Files Coverage Δ
lib/OpenQA/WebAPI/GruJob.pm 8.1% <10%> (+0.96%) ⬆️
lib/OpenQA/Worker/Commands.pm 81.17% <0%> (-2.36%) ⬇️
lib/OpenQA/Utils.pm 94.74% <0%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e27d66d...27a1ef7. Read the comment docs.

@Martchus
Copy link
Contributor

Martchus commented Mar 4, 2019

There are a couple of questions, though

Maybe @kraih knows.


The change makes sense in general, but yes, tests are missing.

@kraih
Copy link
Member

kraih commented Mar 4, 2019

Unfortunately the answer to 1 is yes, Minion allows for a job to be retried again after it reached the state finished. This is not done in openQA to my knowledge, but it is possible for a user to retry a finished job from the Minion admin ui. Also, finished and failed to Minion are not the same. Minion never deletes a failed job automatically. The assumption is that a failed job has to be reviewed by a user or bot (who can then retry or remove it manually).

Edit: Since Gru hijacks the execute method of Minion there are actually many cases where the job will still be active. That means the proposed solution does not work correctly.

@@ -17,6 +17,7 @@ package OpenQA::WebAPI::GruJob;
use Mojo::Base 'Minion::Job';

use OpenQA::Utils 'log_error';
use OpenQA::Utils 'log_debug';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use OpenQA::Utils qw(log_debug log_error);

my $state = $self->info->{state};
my $jobid = $self->id;
my $gruid;
$gruid = $self->info->{notes}{gru_id} if exists $self->info->{notes}{gru_id};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're triple requesting $self->info here, that's two extra Postgres roundtrips for the whole info data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, fair point. will fix.

@kraih
Copy link
Member

kraih commented Mar 4, 2019

And i agree, this problem definitely needs a test case to decide which would be the correct solution.

@kraih
Copy link
Member

kraih commented Mar 4, 2019

@AdamWill And regarding 2, ->is_finished is only for Minion job process management and unrelated to the actual state of the job. So definitely a no on using it here.

@kraih
Copy link
Member

kraih commented Mar 4, 2019

Think i have an idea for how to test this, will open a separate pull request for it later today or tomorrow.

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 4, 2019

Unfortunately the answer to 1 is yes, Minion allows for a job to be retried again after it reached the state finished. This is not done in openQA to my knowledge, but it is possible for a user to retry a finished job from the Minion admin ui. Also, finished and failed to Minion are not the same. Minion never deletes a failed job automatically. The assumption is that a failed job has to be reviewed by a user or bot (who can then retry or remove it manually).

Edit: Since Gru hijacks the execute method of Minion there are actually many cases where the job will still be active. That means the proposed solution does not work correctly.

thanks, I'm not entirely sure I'm following the comment, though. Question 1 was about whether we can reach this specific point in the subclassed execute() with the job state as 'active' - just because a job is retried, why would we get here with its state as 'active'? edit: oh, I see, I phrased the question a bit ambiguously. I was only thinking of "could the job state be 'active' when we hit this specific code block", but you read it more as "could the job state ever be 'active' again after this code block has run", right? In which case, see below.

If you're also saying that we shouldn't really delete jobs from GruTasks even after they're 'finished' because they might be re-tried, then indeed that seems a valid concern, but then this PR isn't making things worse there - we're already doing that right now. If that's the concern, then this PR actually makes things better, it just doesn't completely fix them, right?

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 4, 2019

Concerning the "jobs can be retried under the same ID" case - well, what GruTasks really is these days (AIUI) is a table of pending and active jobs. From my poking around in the code, I get the impression that minion emits events, right? Could we perhaps redesign things so jobs get added to and removed from GruTasks based on the events emitted on state changes?

As it seems like it's going to be relevant: does anyone know of anything remaining in openQA that actually uses the GruTasks / GruDependencies tables besides this 'openQA tests can be wait to wait on gru tasks' mechanism?

@kraih
Copy link
Member

kraih commented Mar 4, 2019

@AdamWill Yes, i did not explain that very well, but the subject is rather complicated. 😉 Anyway, i do think i have a solution and am now thinking of more edge cases to test. master...kraih:gru_tasks_cleanup

@AdamWill
Copy link
Contributor Author

AdamWill commented Mar 4, 2019

OK, thanks :) I think it might be wise to preserve the checks around whether keys in the {info} hash exist at all? Presumably they're there for a reason (likely you get errors or ugly log messages if trying to use $self-{info}->{gru_id} if it doesn't exist at all).

@kraih
Copy link
Member

kraih commented Mar 4, 2019

@AdamWill Perl autovivifies that, no need to worry. Here you would only use exists if a valid Gru id could be a false Perl value, which i don't believe to be the case.

@kraih kraih mentioned this pull request Mar 4, 2019
@kraih
Copy link
Member

kraih commented Mar 5, 2019

This should be resolved with #2011.

@kraih kraih closed this Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants