-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better detection for stuck job when error is printed in tight loop #7754
Comments
On Donnerstag, 13. Juni 2019, 17:17:57 CEST StefanBruens wrote:
# Issue Description
It is a common issue, e.g. for test cases, where a message is printed in a
tight loop, but no actual progress is made.
e.g.:
https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-> Mojo-Redis/openSUSE_Tumbleweed/x86_64 ```
[36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string
("ioloop") as a subroutine ref while "strict refs" in use at
/usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36410s]
Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a
subroutine ref while "strict refs" in use at
/usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36425s]
Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a
subroutine ref while "strict refs" in use at
/usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36425s]
Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a
subroutine ref while "strict refs" in use at
/usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. ```
As any output is considered as "progress", the jobs never gets killed.
it is not never, but it take quite some time indeed.
However, we can not simply reduce timeouts, since this affects older
codebases.
We have added a prjconf settings therefore to influence it. However, only
maxidle time and logsizelimit so far.
More ideas what to detect are welcome, but repeating lines are unfortunatly
not enough. We have way to many of them during good builds...
…--
Adrian Schroeter
SUSE Linux Products GmbH, Maxfeldstr. 5, 90409 Nuernberg, Germany
email: adrian@suse.de
|
quite some time is a joke though: for another target it was > 13d. I don't think older codebases will suffer if you limit a build job to a day :) |
Just to clarify, I do not propose to treat repeated lines as immediate fail, but as "no progress made". In the past, I have seen various instances of stuck build jobs were the last message was repeated again and again. IIRC, these were all instances where the test suite executed some malformed code or ran out of memory/disk space, and failed to detect this problem. In a perfect world, the test suites would catch this and bail out, in reality, it does not happen. Can you pinpoint a project where the same message is repeated again and again, for more than the "no progress" timeout, and the build is still in a successful state? |
Yes, earlier today there were 2 instances of perl-Mojo-Redis @ 13 days, one @ 12 days, several at >= 5 hours. A typical successful built of perl-Mojo-Redis requires a few minutes. But perl-Mojo-Redis is just the example of the day, this is a repeating pattern. Just visit https://build.opensuse.org/monitor/old and scroll to the bottom - there are some like llvm and libreoffice which require about a day, but these are still at <= 150% of anticipated build time. Everything else which takes so long is due to some broken state. How expensive is it to query the built time for say the last 5 successful builds? Would you expect many false positives if we would limit the build time to (1 hour + 2 * max(last 5 buildtime)) for builds triggered by dependency changes, and use (24 hours + 2 * max(last 5 builds)) for manually triggered and source changed triggered builds? Heavy jobs are scheduled to fast workers by _constraints already, and light jobs would see more variation but this would be covered by the 1 hour (24 hours) offset. Dependency changes should not affect the build time considerably, while source changes may (e.g. enabling a test suite, significant code changes). |
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:B/perl-ExtUtils-Helpers/standard/x86_64 |
Issue Description
It is a common issue, e.g. for test cases, where a message is printed in a tight loop, but no actual progress is made.
e.g.:
https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-Redis/openSUSE_Tumbleweed/x86_64
As any output is considered as "progress", the jobs never gets killed.
Expected Result
If the same message is printed again and again, (i.e. only the timestamp changes), the build should be aborted.
The text was updated successfully, but these errors were encountered: