Automatically rerun incompleted jobs because of no space left #3672

Amrysliu · 2021-01-14T12:33:55Z

Relate: https://progress.opensuse.org/issues/80334

lib/OpenQA/Schema/Result/Jobs.pm

okurz

In general I would favor a more general solution. Grepping the log files for certain terms is something that we already support and use with "job done" hooks. If we want to include that into openQA code directly then we should do it in a more generic style that works with multiple, different "known issues" and would need to configure the search terms at best outside the actual source code.

@Martchus introduced the possibility to better automatically retrigger jobs for known causes based on the reason in #3624 . Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing? If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

I hope you understand what I mean. If I can not make myself clear I am happy to discuss the topic in a video call the next time we come together.

okurz · 2021-01-14T13:44:15Z

t/api/04-jobs.t

@@ -1307,6 +1307,27 @@ subtest 'show parent group name and id when getting job details' => sub {
    is $job->{parent_group_id}, $parent_group_id, 'parent group id was shown correctly';
 };

+subtest 're-trigger jobs when it incomplete becase no space left' => sub {


s/becase/because/

Martchus · 2021-01-14T15:43:32Z

If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

This is a very good point so I'd also say you should implement this on the worker-side.

Amrysliu · 2021-01-15T07:53:01Z

In general I would favor a more general solution. Grepping the log files for certain terms is something that we already support and use with "job done" hooks. If we want to include that into openQA code directly then we should do it in a more generic style that works with multiple, different "known issues" and would need to configure the search terms at best outside the actual source code.

@Martchus introduced the possibility to better automatically retrigger jobs for known causes based on the reason in #3624 . Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing? If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

I hope you understand what I mean. If I can not make myself clear I am happy to discuss the topic in a video call the next time we come together.

Thanks for your comments. do you mean that: we should put the no space left into the reason, and configure auto_clone_regex in openqa.ini to re-trigger the job?

Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing?

Could we parse the logfiles on worker side?

Martchus · 2021-01-19T09:34:50Z

Thanks for your comments. do you mean that: we should put the no space left into the reason, and configure auto_clone_regex in openqa.ini to re-trigger the job?

yes

Could we parse the logfiles on worker side?

I think that's the plan. Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

lib/OpenQA/Worker/Job.pm

t/24-worker-jobs.t

okurz · 2021-01-19T12:43:40Z

Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile. We should have everything in the reason field. If the reason field was not updated we should still have a generic "Unexpected premature termination encountered, no reason could be found, retriggering automatically"-reason and automatic retrigger, no log parsing.

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that.

Amrysliu · 2021-01-20T09:18:26Z

Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile. We should have everything in the reason field. If the reason field was not updated we should still have a generic "Unexpected premature termination encountered, no reason could be found, retriggering automatically"-reason and automatic retrigger, no log parsing.

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that.

So do you mean we only need to add a prefix for the reason, such as Unexpected died: terminated prematurely with corrupted state file, without parsing log file?
@Martchus WDYT?

Amrysliu · 2021-01-20T09:24:39Z

@Martchus Could you give me some details, what I need to do to let the CI check pass? I add the File::Map in dependencies.yaml, is there somewhere I need to modify it? Thanks

Martchus · 2021-01-20T09:24:54Z

I still don't understand @okurz comment.

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile.

Where is the information to come from then?

…, retriggering automatically

I would not put "retriggering automatically" into the reason (at least not on the worker-side when the retriggering is actually done by the web UI).

okurz · 2021-01-20T10:10:15Z

I still don't understand @okurz comment.

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile.

Where is the information to come from then?

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that. So if we fail to write a json file which includes a useful reason then the worker should just describe that clearly in the reason field and abort. IMHO the worker should not try to be too intelligent, parsing the logfile, which might fail for the same reason "out of space" because maybe some temporary file or something could not be created.

…, retriggering automatically

I would not put "retriggering automatically" into the reason (at least not on the worker-side when the retriggering is actually done by the web UI).

you can still come up with some text. I just wanted to provide an example.

Martchus · 2021-01-20T12:03:56Z

Of course parsing the log file is hacky. I personally also don't think it is worth it but somehow it seemed very wanted in previous discussions to restart in this particular case. Of course we can alternatively just say that we want to restart all incompletes which can not write the state file. However, that's a different set of incompletes.

Amrysliu · 2021-01-21T07:07:23Z

You know what: By now I think for the time being we should rely on auto-review and not do anything more from the side of the worker

so, as a result, I could close this pr? we do not need to change any code because auto-review already does well.

okurz · 2021-01-21T10:02:42Z

lib/OpenQA/Worker/Job.pm

        }
        else {
-            $reason = "$reason: terminated prematurely with corrupted state file, see log output for details";
+            $reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details";


As discussed we can change the reason to be more specific. I suggest:

Suggested change

$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details";

$reason = "terminated prematurely: Encountered corrupted state file$msg, see log output for details";

okurz

with the new prefix suggestion "terminated prematurely" please add that in etc/openqa/openqa.ini in "auto_clone_regex". Please keep the log parsing. It's ok. Sorry about my previous confusing comment stating otherwise.

okurz · 2021-01-21T10:11:57Z

please see http://open.qa/docs/#_perl_and_other_packages about adding dependencies. And please let us know or correct if it's wrong.

okurz

please also extend auto_clone_regex => '^cache failure: ' in lib/OpenQA/Setup.pm

Amrysliu · 2021-01-22T06:53:43Z

please see http://open.qa/docs/#_perl_and_other_packages about adding dependencies. And please let us know or correct if it's wrong.

The document doesn't say how to let the ci check pass, except that, the rest are fine. I modified the ci-packages.txt, not sure if it's the right way. Seems this file is written by other processes, not handly, right?

codecov · 2021-01-22T07:48:22Z

Codecov Report

Merging #3672 (670e11d) into master (ef6edb0) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3672      +/-   ##
==========================================
- Coverage   96.34%   96.31%   -0.03%     
==========================================
  Files         367      367              
  Lines       31934    31953      +19     
==========================================
+ Hits        30768    30777       +9     
- Misses       1166     1176      +10

Impacted Files	Coverage Δ
lib/OpenQA/Setup.pm	`97.67% <ø> (ø)`
lib/OpenQA/Worker/Job.pm	`75.73% <100.00%> (+0.69%)`	⬆️
t/24-worker-jobs.t	`100.00% <100.00%> (ø)`
t/lib/OpenQA/Test/Utils.pm	`77.74% <0.00%> (-3.76%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef6edb0...670e11d. Read the comment docs.

Related: https://progress.opensuse.org/issues/80334

Martchus · 2021-01-22T10:12:08Z

Seems this file is written by other processes, not handly, right?

Yes. But it is nevertheless correct to simply add the dependency to the file in the way you did to make a new dependency immediately available within the CI.

Amrysliu requested a review from Martchus January 14, 2021 12:34

Martchus requested changes Jan 14, 2021

View reviewed changes

lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved

okurz requested changes Jan 14, 2021

View reviewed changes

Amrysliu force-pushed the fix_80334 branch from d555490 to c6d45d2 Compare January 15, 2021 06:12

Amrysliu force-pushed the fix_80334 branch from c6d45d2 to 9c57721 Compare January 19, 2021 09:15

Amrysliu requested a review from Martchus January 19, 2021 09:17

Martchus requested changes Jan 19, 2021

View reviewed changes

lib/OpenQA/Worker/Job.pm Outdated Show resolved Hide resolved

t/24-worker-jobs.t Outdated Show resolved Hide resolved

Amrysliu force-pushed the fix_80334 branch from 9c57721 to e128199 Compare January 20, 2021 09:14

This comment has been minimized.

Sign in to view

Amrysliu force-pushed the fix_80334 branch from e128199 to 0591237 Compare January 21, 2021 04:06

okurz reviewed Jan 21, 2021

View reviewed changes

okurz requested changes Jan 21, 2021

View reviewed changes

Amrysliu force-pushed the fix_80334 branch from 0591237 to e614438 Compare January 22, 2021 06:40

Amrysliu force-pushed the fix_80334 branch from e614438 to c4f4904 Compare January 22, 2021 07:05

Automatically rerun incompleted jobs because of no space left

670e11d

Related: https://progress.opensuse.org/issues/80334

Amrysliu force-pushed the fix_80334 branch from c4f4904 to 670e11d Compare January 22, 2021 08:27

Martchus approved these changes Jan 22, 2021

View reviewed changes

okurz approved these changes Jan 22, 2021

View reviewed changes

okurz merged commit 5b4b748 into os-autoinst:master Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically rerun incompleted jobs because of no space left #3672

Automatically rerun incompleted jobs because of no space left #3672

Amrysliu commented Jan 14, 2021

okurz left a comment •

edited

okurz Jan 14, 2021

Martchus commented Jan 14, 2021

Amrysliu commented Jan 15, 2021

Martchus commented Jan 19, 2021

okurz commented Jan 19, 2021 •

edited

Amrysliu commented Jan 20, 2021

Amrysliu commented Jan 20, 2021

Martchus commented Jan 20, 2021

okurz commented Jan 20, 2021

This comment has been minimized.

Martchus commented Jan 20, 2021 •

edited

Amrysliu commented Jan 21, 2021

okurz Jan 21, 2021

okurz left a comment

okurz commented Jan 21, 2021

okurz left a comment

Amrysliu commented Jan 22, 2021 •

edited

codecov bot commented Jan 22, 2021 •

edited

Martchus commented Jan 22, 2021

	$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details";
	$reason = "terminated prematurely: Encountered corrupted state file$msg, see log output for details";

Automatically rerun incompleted jobs because of no space left #3672

Automatically rerun incompleted jobs because of no space left #3672

Conversation

Amrysliu commented Jan 14, 2021

okurz left a comment • edited

Choose a reason for hiding this comment

okurz Jan 14, 2021

Choose a reason for hiding this comment

Martchus commented Jan 14, 2021

Amrysliu commented Jan 15, 2021

Martchus commented Jan 19, 2021

okurz commented Jan 19, 2021 • edited

Amrysliu commented Jan 20, 2021

Amrysliu commented Jan 20, 2021

Martchus commented Jan 20, 2021

okurz commented Jan 20, 2021

This comment has been minimized.

Martchus commented Jan 20, 2021 • edited

Amrysliu commented Jan 21, 2021

okurz Jan 21, 2021

Choose a reason for hiding this comment

okurz left a comment

Choose a reason for hiding this comment

okurz commented Jan 21, 2021

okurz left a comment

Choose a reason for hiding this comment

Amrysliu commented Jan 22, 2021 • edited

codecov bot commented Jan 22, 2021 • edited

Codecov Report

Martchus commented Jan 22, 2021

okurz left a comment •

edited

okurz commented Jan 19, 2021 •

edited

Martchus commented Jan 20, 2021 •

edited

Amrysliu commented Jan 22, 2021 •

edited

codecov bot commented Jan 22, 2021 •

edited