Better detection for stuck job when error is printed in tight loop #7754

StefanBruens · 2019-06-13T15:17:56Z

Issue Description

It is a common issue, e.g. for test cases, where a message is printed in a tight loop, but no actual progress is made.

e.g.:
https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-Redis/openSUSE_Tumbleweed/x86_64

[36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.

As any output is considered as "progress", the jobs never gets killed.

Expected Result

If the same message is printed again and again, (i.e. only the timestamp changes), the build should be aborted.

The text was updated successfully, but these errors were encountered:

adrianschroeter · 2019-06-13T15:24:45Z

On Donnerstag, 13. Juni 2019, 17:17:57 CEST StefanBruens wrote: # Issue Description It is a common issue, e.g. for test cases, where a message is printed in a tight loop, but no actual progress is made. e.g.: https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-> Mojo-Redis/openSUSE_Tumbleweed/x86_64 ``` [36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. [36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71. ``` As any output is considered as "progress", the jobs never gets killed.

it is not never, but it take quite some time indeed. However, we can not simply reduce timeouts, since this affects older codebases. We have added a prjconf settings therefore to influence it. However, only maxidle time and logsizelimit so far. More ideas what to detect are welcome, but repeating lines are unfortunatly not enough. We have way to many of them during good builds...

…

-- Adrian Schroeter SUSE Linux Products GmbH, Maxfeldstr. 5, 90409 Nuernberg, Germany email: adrian@suse.de

coolo · 2019-06-13T16:51:19Z

quite some time is a joke though:
2019-06-13 05:06:03 perl-Mojo-Redis meta change failed 12d 19h 13m 12s lamb55:1

for another target it was > 13d. I don't think older codebases will suffer if you limit a build job to a day :)

StefanBruens · 2019-06-13T22:24:45Z

... More ideas what to detect are welcome, but repeating lines are unfortunatly not enough. We have way to many of them during good builds...
…

Just to clarify, I do not propose to treat repeated lines as immediate fail, but as "no progress made".

In the past, I have seen various instances of stuck build jobs were the last message was repeated again and again. IIRC, these were all instances where the test suite executed some malformed code or ran out of memory/disk space, and failed to detect this problem. In a perfect world, the test suites would catch this and bail out, in reality, it does not happen.

Can you pinpoint a project where the same message is repeated again and again, for more than the "no progress" timeout, and the build is still in a successful state?

StefanBruens · 2019-06-13T22:51:22Z

quite some time is a joke though:
2019-06-13 05:06:03 perl-Mojo-Redis meta change failed 12d 19h 13m 12s lamb55:1

for another target it was > 13d. I don't think older codebases will suffer if you limit a build job to a day :)

Yes, earlier today there were 2 instances of perl-Mojo-Redis @ 13 days, one @ 12 days, several at >= 5 hours. A typical successful built of perl-Mojo-Redis requires a few minutes. But perl-Mojo-Redis is just the example of the day, this is a repeating pattern. Just visit https://build.opensuse.org/monitor/old and scroll to the bottom - there are some like llvm and libreoffice which require about a day, but these are still at <= 150% of anticipated build time. Everything else which takes so long is due to some broken state.

How expensive is it to query the built time for say the last 5 successful builds? Would you expect many false positives if we would limit the build time to (1 hour + 2 * max(last 5 buildtime)) for builds triggered by dependency changes, and use (24 hours + 2 * max(last 5 builds)) for manually triggered and source changed triggered builds?

Heavy jobs are scheduled to fast workers by _constraints already, and light jobs would see more variation but this would be covered by the 1 hour (24 hours) offset. Dependency changes should not affect the build time considerably, while source changes may (e.g. enabling a test suite, significant code changes).

StefanBruens · 2019-06-19T16:58:14Z

https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:B/perl-ExtUtils-Helpers/standard/x86_64
Worker: lamb23:12 Buildtime: 5 days (499382%)

StefanBruens · 2019-07-05T02:23:09Z

18 days:
https://build.opensuse.org/package/live_build_log/home:kraih/perl-Mojo-Redis/openSUSE_Tumbleweed/x86_64

hennevogel added Feature Backend Things regarding the OBS backend labels Jun 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better detection for stuck job when error is printed in tight loop #7754

Better detection for stuck job when error is printed in tight loop #7754

StefanBruens commented Jun 13, 2019

adrianschroeter commented Jun 13, 2019 via email

coolo commented Jun 13, 2019

StefanBruens commented Jun 13, 2019

StefanBruens commented Jun 13, 2019

StefanBruens commented Jun 19, 2019

StefanBruens commented Jul 5, 2019

Better detection for stuck job when error is printed in tight loop #7754

Better detection for stuck job when error is printed in tight loop #7754

Comments

StefanBruens commented Jun 13, 2019

Issue Description

Expected Result

adrianschroeter commented Jun 13, 2019 via email

coolo commented Jun 13, 2019

StefanBruens commented Jun 13, 2019

StefanBruens commented Jun 13, 2019

StefanBruens commented Jun 19, 2019

StefanBruens commented Jul 5, 2019