New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically rerun incompleted jobs because of no space left #3672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I would favor a more general solution. Grepping the log files for certain terms is something that we already support and use with "job done" hooks. If we want to include that into openQA code directly then we should do it in a more generic style that works with multiple, different "known issues" and would need to configure the search terms at best outside the actual source code.
@Martchus introduced the possibility to better automatically retrigger jobs for known causes based on the reason in #3624 . Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing? If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.
I hope you understand what I mean. If I can not make myself clear I am happy to discuss the topic in a video call the next time we come together.
t/api/04-jobs.t
Outdated
@@ -1307,6 +1307,27 @@ subtest 'show parent group name and id when getting job details' => sub { | |||
is $job->{parent_group_id}, $parent_group_id, 'parent group id was shown correctly'; | |||
}; | |||
|
|||
subtest 're-trigger jobs when it incomplete becase no space left' => sub { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/becase/because/
This is a very good point so I'd also say you should implement this on the worker-side. |
Thanks for your comments. do you mean that: we should put the
Could we parse the logfiles on worker side? |
yes
I think that's the plan. Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it). |
Yes, sorry about that. What I mean is: I think we should not need to parse the logfile. We should have everything in the reason field. If the reason field was not updated we should still have a generic "Unexpected premature termination encountered, no reason could be found, retriggering automatically"-reason and automatic retrigger, no log parsing. Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that. |
So do you mean we only need to add a prefix for the reason, such as |
@Martchus Could you give me some details, what I need to do to let the CI check pass? I add the |
I still don't understand @okurz comment.
Where is the information to come from then?
I would not put "retriggering automatically" into the reason (at least not on the worker-side when the retriggering is actually done by the web UI). |
Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that. So if we fail to write a json file which includes a useful reason then the worker should just describe that clearly in the reason field and abort. IMHO the worker should not try to be too intelligent, parsing the logfile, which might fail for the same reason "out of space" because maybe some temporary file or something could not be created.
you can still come up with some text. I just wanted to provide an example. |
This comment has been minimized.
This comment has been minimized.
Of course parsing the log file is hacky. I personally also don't think it is worth it but somehow it seemed very wanted in previous discussions to restart in this particular case. Of course we can alternatively just say that we want to restart all incompletes which can not write the state file. However, that's a different set of incompletes. |
so, as a result, I could close this pr? we do not need to change any code because auto-review already does well. |
lib/OpenQA/Worker/Job.pm
Outdated
} | ||
else { | ||
$reason = "$reason: terminated prematurely with corrupted state file, see log output for details"; | ||
$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed we can change the reason to be more specific. I suggest:
$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details"; | |
$reason = "terminated prematurely: Encountered corrupted state file$msg, see log output for details"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new prefix suggestion "terminated prematurely" please add that in etc/openqa/openqa.ini in "auto_clone_regex". Please keep the log parsing. It's ok. Sorry about my previous confusing comment stating otherwise.
please see http://open.qa/docs/#_perl_and_other_packages about adding dependencies. And please let us know or correct if it's wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also extend auto_clone_regex => '^cache failure: '
in lib/OpenQA/Setup.pm
The document doesn't say how to let the ci check pass, except that, the rest are fine. I modified the ci-packages.txt, not sure if it's the right way. Seems this file is written by other processes, not handly, right? |
Codecov Report
@@ Coverage Diff @@
## master #3672 +/- ##
==========================================
- Coverage 96.34% 96.31% -0.03%
==========================================
Files 367 367
Lines 31934 31953 +19
==========================================
+ Hits 30768 30777 +9
- Misses 1166 1176 +10
Continue to review full report at Codecov.
|
Yes. But it is nevertheless correct to simply add the dependency to the file in the way you did to make a new dependency immediately available within the CI. |
Relate: https://progress.opensuse.org/issues/80334