Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better detection for stuck job when error is printed in tight loop #7754

Open
StefanBruens opened this issue Jun 13, 2019 · 6 comments
Open
Labels
Backend Things regarding the OBS backend Feature

Comments

@StefanBruens
Copy link

Issue Description

It is a common issue, e.g. for test cases, where a message is printed in a tight loop, but no actual progress is made.

e.g.:
https://build.opensuse.org/package/live_build_log/devel:languages:perl/perl-Mojo-Redis/openSUSE_Tumbleweed/x86_64

[36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36410s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.
[36425s] Mojo::Reactor::Poll: I/O watcher failed: Can't use string ("ioloop") as a subroutine ref while "strict refs" in use at /usr/lib/perl5/vendor_perl/5.28.1/Mojo/Promise.pm line 71.

As any output is considered as "progress", the jobs never gets killed.

Expected Result

If the same message is printed again and again, (i.e. only the timestamp changes), the build should be aborted.

@adrianschroeter
Copy link
Member

adrianschroeter commented Jun 13, 2019 via email

@hennevogel hennevogel added Feature Backend Things regarding the OBS backend labels Jun 13, 2019
@coolo
Copy link
Member

coolo commented Jun 13, 2019

quite some time is a joke though:
2019-06-13 05:06:03 perl-Mojo-Redis meta change failed 12d 19h 13m 12s lamb55:1

for another target it was > 13d. I don't think older codebases will suffer if you limit a build job to a day :)

@StefanBruens
Copy link
Author

... More ideas what to detect are welcome, but repeating lines are unfortunatly not enough. We have way to many of them during good builds...

Just to clarify, I do not propose to treat repeated lines as immediate fail, but as "no progress made".

In the past, I have seen various instances of stuck build jobs were the last message was repeated again and again. IIRC, these were all instances where the test suite executed some malformed code or ran out of memory/disk space, and failed to detect this problem. In a perfect world, the test suites would catch this and bail out, in reality, it does not happen.

Can you pinpoint a project where the same message is repeated again and again, for more than the "no progress" timeout, and the build is still in a successful state?

@StefanBruens
Copy link
Author

quite some time is a joke though:
2019-06-13 05:06:03 perl-Mojo-Redis meta change failed 12d 19h 13m 12s lamb55:1

for another target it was > 13d. I don't think older codebases will suffer if you limit a build job to a day :)

Yes, earlier today there were 2 instances of perl-Mojo-Redis @ 13 days, one @ 12 days, several at >= 5 hours. A typical successful built of perl-Mojo-Redis requires a few minutes. But perl-Mojo-Redis is just the example of the day, this is a repeating pattern. Just visit https://build.opensuse.org/monitor/old and scroll to the bottom - there are some like llvm and libreoffice which require about a day, but these are still at <= 150% of anticipated build time. Everything else which takes so long is due to some broken state.

How expensive is it to query the built time for say the last 5 successful builds? Would you expect many false positives if we would limit the build time to (1 hour + 2 * max(last 5 buildtime)) for builds triggered by dependency changes, and use (24 hours + 2 * max(last 5 builds)) for manually triggered and source changed triggered builds?

Heavy jobs are scheduled to fast workers by _constraints already, and light jobs would see more variation but this would be covered by the 1 hour (24 hours) offset. Dependency changes should not affect the build time considerably, while source changes may (e.g. enabling a test suite, significant code changes).

@StefanBruens
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend Things regarding the OBS backend Feature
Projects
None yet
Development

No branches or pull requests

4 participants