Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically rerun incompleted jobs because of no space left #3672

Merged
merged 1 commit into from Jan 22, 2021

Conversation

Amrysliu
Copy link
Contributor

lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would favor a more general solution. Grepping the log files for certain terms is something that we already support and use with "job done" hooks. If we want to include that into openQA code directly then we should do it in a more generic style that works with multiple, different "known issues" and would need to configure the search terms at best outside the actual source code.

@Martchus introduced the possibility to better automatically retrigger jobs for known causes based on the reason in #3624 . Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing? If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

I hope you understand what I mean. If I can not make myself clear I am happy to discuss the topic in a video call the next time we come together.

t/api/04-jobs.t Outdated
@@ -1307,6 +1307,27 @@ subtest 'show parent group name and id when getting job details' => sub {
is $job->{parent_group_id}, $parent_group_id, 'parent group id was shown correctly';
};

subtest 're-trigger jobs when it incomplete becase no space left' => sub {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/becase/because/

@Martchus
Copy link
Contributor

If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

This is a very good point so I'd also say you should implement this on the worker-side.

@Amrysliu
Copy link
Contributor Author

In general I would favor a more general solution. Grepping the log files for certain terms is something that we already support and use with "job done" hooks. If we want to include that into openQA code directly then we should do it in a more generic style that works with multiple, different "known issues" and would need to configure the search terms at best outside the actual source code.

@Martchus introduced the possibility to better automatically retrigger jobs for known causes based on the reason in #3624 . Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing? If the reason "No space" is not visible in the reason field then this simply means that we need to ensure this is included in the reason. This work is done by the worker so scales always better than anything the webUI needs to do.

I hope you understand what I mean. If I can not make myself clear I am happy to discuss the topic in a video call the next time we come together.

Thanks for your comments. do you mean that: we should put the no space left into the reason, and configure auto_clone_regex in openqa.ini to re-trigger the job?

Can we find some middle ground of errors that we detect as "nothing that openQA can solve" but worth to retrigger without needing to parse the logfiles but just relying on the logfile parsing?

Could we parse the logfiles on worker side?

@Martchus
Copy link
Contributor

Thanks for your comments. do you mean that: we should put the no space left into the reason, and configure auto_clone_regex in openqa.ini to re-trigger the job?

yes

Could we parse the logfiles on worker side?

I think that's the plan. Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

lib/OpenQA/Worker/Job.pm Outdated Show resolved Hide resolved
t/24-worker-jobs.t Outdated Show resolved Hide resolved
@okurz
Copy link
Member

okurz commented Jan 19, 2021

Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile. We should have everything in the reason field. If the reason field was not updated we should still have a generic "Unexpected premature termination encountered, no reason could be found, retriggering automatically"-reason and automatic retrigger, no log parsing.

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that.

@Amrysliu
Copy link
Contributor Author

Of course the sentence "without needing to parse the logfiles but just relying on the logfile parsing" is self-contradicting and makes no sense (so just ignore it).

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile. We should have everything in the reason field. If the reason field was not updated we should still have a generic "Unexpected premature termination encountered, no reason could be found, retriggering automatically"-reason and automatic retrigger, no log parsing.

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that.

So do you mean we only need to add a prefix for the reason, such as Unexpected died: terminated prematurely with corrupted state file, without parsing log file?
@Martchus WDYT?

@Amrysliu
Copy link
Contributor Author

@Martchus Could you give me some details, what I need to do to let the CI check pass? I add the File::Map in dependencies.yaml, is there somewhere I need to modify it? Thanks

@Martchus
Copy link
Contributor

I still don't understand @okurz comment.

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile.

Where is the information to come from then?

…, retriggering automatically

I would not put "retriggering automatically" into the reason (at least not on the worker-side when the retriggering is actually done by the web UI).

@okurz
Copy link
Member

okurz commented Jan 20, 2021

I still don't understand @okurz comment.

Yes, sorry about that. What I mean is: I think we should not need to parse the logfile.

Where is the information to come from then?

Again, if a host runs out of memory or storage that is not for openQA to handle, the admin+monitoring need to take care about that. So if we fail to write a json file which includes a useful reason then the worker should just describe that clearly in the reason field and abort. IMHO the worker should not try to be too intelligent, parsing the logfile, which might fail for the same reason "out of space" because maybe some temporary file or something could not be created.

…, retriggering automatically

I would not put "retriggering automatically" into the reason (at least not on the worker-side when the retriggering is actually done by the web UI).

you can still come up with some text. I just wanted to provide an example.

@okurz

This comment has been minimized.

@Martchus
Copy link
Contributor

Martchus commented Jan 20, 2021

Of course parsing the log file is hacky. I personally also don't think it is worth it but somehow it seemed very wanted in previous discussions to restart in this particular case. Of course we can alternatively just say that we want to restart all incompletes which can not write the state file. However, that's a different set of incompletes.

@Amrysliu
Copy link
Contributor Author

You know what: By now I think for the time being we should rely on auto-review and not do anything more from the side of the worker

so, as a result, I could close this pr? we do not need to change any code because auto-review already does well.

}
else {
$reason = "$reason: terminated prematurely with corrupted state file, see log output for details";
$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we can change the reason to be more specific. I suggest:

Suggested change
$reason = "$reason: terminated prematurely with corrupted state file$msg, see log output for details";
$reason = "terminated prematurely: Encountered corrupted state file$msg, see log output for details";

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new prefix suggestion "terminated prematurely" please add that in etc/openqa/openqa.ini in "auto_clone_regex". Please keep the log parsing. It's ok. Sorry about my previous confusing comment stating otherwise.

@okurz
Copy link
Member

okurz commented Jan 21, 2021

please see http://open.qa/docs/#_perl_and_other_packages about adding dependencies. And please let us know or correct if it's wrong.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also extend auto_clone_regex => '^cache failure: ' in lib/OpenQA/Setup.pm

@Amrysliu
Copy link
Contributor Author

Amrysliu commented Jan 22, 2021

please see http://open.qa/docs/#_perl_and_other_packages about adding dependencies. And please let us know or correct if it's wrong.

The document doesn't say how to let the ci check pass, except that, the rest are fine. I modified the ci-packages.txt, not sure if it's the right way. Seems this file is written by other processes, not handly, right?

@codecov
Copy link

codecov bot commented Jan 22, 2021

Codecov Report

Merging #3672 (670e11d) into master (ef6edb0) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3672      +/-   ##
==========================================
- Coverage   96.34%   96.31%   -0.03%     
==========================================
  Files         367      367              
  Lines       31934    31953      +19     
==========================================
+ Hits        30768    30777       +9     
- Misses       1166     1176      +10     
Impacted Files Coverage Δ
lib/OpenQA/Setup.pm 97.67% <ø> (ø)
lib/OpenQA/Worker/Job.pm 75.73% <100.00%> (+0.69%) ⬆️
t/24-worker-jobs.t 100.00% <100.00%> (ø)
t/lib/OpenQA/Test/Utils.pm 77.74% <0.00%> (-3.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef6edb0...670e11d. Read the comment docs.

@Martchus
Copy link
Contributor

Seems this file is written by other processes, not handly, right?

Yes. But it is nevertheless correct to simply add the dependency to the file in the way you did to make a new dependency immediately available within the CI.

@okurz okurz merged commit 5b4b748 into os-autoinst:master Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants