Fix build-waiting logic to use polling instead of watcher #5812

mnagy · 2015-11-09T22:14:10Z

The watcher logic is prone to not enforcing the timeout. If no event
happens, the build can go on for a very long time.

Enforce the timeout by using polling instead. Also raise the timeout to
30 minutes for very slow builds and make the errors more specific.

@mfojtik: PTAL. In the end I decided against the channel-timeout approach we
discussed because it seemed too complicated and cumbersome compared to this
approach. I know that polling is technically inferior to watching, but this
code is only used in tests after all and I don't see the (probably negligible)
performance gains to be worth the added complexity.

Fixes #5728

mnagy · 2015-11-09T22:20:33Z

[testonlyextended][extended:core]

This should clean-out slow-building flakes, like the mentioned #5728. It might make more flakes visible. However,I've also made logs a lot cleaner, making it obvious from a glance that the build timed out, so raising the timeout would fix that. Hopefully, we don't have many 30 minute builds..

The watcher logic is prone to not enforcing the timeout. If no event happens, the build can go on for a very long time. Enforce the timeout by using polling instead. Also raise the timeout to 30 minutes for very slow builds and make the errors more specific.

mnagy · 2015-11-10T10:20:42Z

I've seen two flakes that I believe are caused by #5816. Couple more flakes for hot deploy; raising timeout for url fetching and making the error message more verbose.

openshift-bot · 2015-11-10T10:24:10Z

Evaluated for origin testonlyextended up to f8162a3

openshift-bot · 2015-11-10T12:48:10Z

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7117/) (Extended Tests: core)

bparees · 2015-11-10T14:22:14Z

lgtm.

but i'm surprised about the watcher issue, i thought our watchers timed out pretty reliably.

liggitt · 2015-11-10T14:34:44Z

What in the old code was setting the watch timeout?

mnagy · 2015-11-10T15:20:32Z

I don't think we were setting any. And looking through the code, I don't see that you can set a timeout on a watch..?

Github shows me that the jenkins job is still running, but it has in fact failed. It failed on a flake (working on it).

liggitt · 2015-11-10T16:15:59Z

I don't think we were setting any. And looking through the code, I don't see that you can set a timeout on a watch..?

There is preliminary API support for requesting a server timeout, but we don't have it in origin yet (and it's not exact... more of a guideline, really)... if you want an exact timeout, you have to select between a watch and a timeout of your own choosing, right?

liggitt · 2015-11-10T16:16:33Z

and handle cases where the watch expires before your timeout (and you have to re-establish the watch)

mnagy · 2015-11-10T16:21:22Z

Yeah, I don't think it's a good idea for tests. Too complex without any real benefit. Just polling is simpler.

mnagy · 2015-11-13T15:31:28Z

@bparees merge?

bparees · 2015-11-13T15:47:12Z

lgtm.
[merge]

openshift-bot · 2015-11-13T15:50:40Z

[Test]ing while waiting on the merge queue

openshift-bot · 2015-11-13T15:55:18Z

Evaluated for origin test up to f8162a3

openshift-bot · 2015-11-13T20:10:30Z

continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7118/) (Extended Tests: core)

openshift-bot · 2015-11-13T23:20:26Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4006/) (Image: devenv-rhel7_2706)

bparees · 2015-11-13T23:23:27Z

Error from server: Timeout: timed out waiting for deployment test/failing-dc-1 to start after 10s

unrelated to extended test changes...

[merge]

openshift-bot · 2015-11-13T23:35:27Z

Evaluated for origin merge up to f8162a3

Merged by openshift-bot

mnagy force-pushed the better_build_timeout branch from 993844f to f8162a3 Compare November 10, 2015 10:18

bparees added this to the 1.1.1 milestone Nov 10, 2015

openshift-bot pushed a commit that referenced this pull request Nov 14, 2015

Merge pull request #5812 from mnagy/better_build_timeout

9108c97

Merged by openshift-bot

openshift-bot merged commit 9108c97 into openshift:master Nov 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix build-waiting logic to use polling instead of watcher #5812

Fix build-waiting logic to use polling instead of watcher #5812

mnagy commented Nov 9, 2015

mnagy commented Nov 9, 2015

mnagy commented Nov 10, 2015

openshift-bot commented Nov 10, 2015

openshift-bot commented Nov 10, 2015

bparees commented Nov 10, 2015

liggitt commented Nov 10, 2015

mnagy commented Nov 10, 2015

liggitt commented Nov 10, 2015

liggitt commented Nov 10, 2015

mnagy commented Nov 10, 2015

mnagy commented Nov 13, 2015

bparees commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

bparees commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

Fix build-waiting logic to use polling instead of watcher #5812

Fix build-waiting logic to use polling instead of watcher #5812

Conversation

mnagy commented Nov 9, 2015

mnagy commented Nov 9, 2015

mnagy commented Nov 10, 2015

openshift-bot commented Nov 10, 2015

openshift-bot commented Nov 10, 2015

bparees commented Nov 10, 2015

liggitt commented Nov 10, 2015

mnagy commented Nov 10, 2015

liggitt commented Nov 10, 2015

liggitt commented Nov 10, 2015

mnagy commented Nov 10, 2015

mnagy commented Nov 13, 2015

bparees commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

openshift-bot commented Nov 13, 2015

bparees commented Nov 13, 2015

openshift-bot commented Nov 13, 2015