New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix build-waiting logic to use polling instead of watcher #5812
Conversation
[testonlyextended][extended:core] This should clean-out slow-building flakes, like the mentioned #5728. It might make more flakes visible. However,I've also made logs a lot cleaner, making it obvious from a glance that the build timed out, so raising the timeout would fix that. Hopefully, we don't have many 30 minute builds.. |
The watcher logic is prone to not enforcing the timeout. If no event happens, the build can go on for a very long time. Enforce the timeout by using polling instead. Also raise the timeout to 30 minutes for very slow builds and make the errors more specific.
993844f
to
f8162a3
Compare
I've seen two flakes that I believe are caused by #5816. Couple more flakes for hot deploy; raising timeout for url fetching and making the error message more verbose. |
Evaluated for origin testonlyextended up to f8162a3 |
continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7117/) (Extended Tests: core) |
lgtm. but i'm surprised about the watcher issue, i thought our watchers timed out pretty reliably. |
What in the old code was setting the watch timeout? |
I don't think we were setting any. And looking through the code, I don't see that you can set a timeout on a watch..? Github shows me that the jenkins job is still running, but it has in fact failed. It failed on a flake (working on it). |
There is preliminary API support for requesting a server timeout, but we don't have it in origin yet (and it's not exact... more of a guideline, really)... if you want an exact timeout, you have to select between a watch and a timeout of your own choosing, right? |
and handle cases where the watch expires before your timeout (and you have to re-establish the watch) |
Yeah, I don't think it's a good idea for tests. Too complex without any real benefit. Just polling is simpler. |
@bparees merge? |
lgtm. |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to f8162a3 |
continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7118/) (Extended Tests: core) |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4006/) (Image: devenv-rhel7_2706) |
unrelated to extended test changes... [merge] |
Evaluated for origin merge up to f8162a3 |
Merged by openshift-bot
The watcher logic is prone to not enforcing the timeout. If no event
happens, the build can go on for a very long time.
Enforce the timeout by using polling instead. Also raise the timeout to
30 minutes for very slow builds and make the errors more specific.
@mfojtik: PTAL. In the end I decided against the channel-timeout approach we
discussed because it seemed too complicated and cumbersome compared to this
approach. I know that polling is technically inferior to watching, but this
code is only used in tests after all and I don't see the (probably negligible)
performance gains to be worth the added complexity.
Fixes #5728