Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build-waiting logic to use polling instead of watcher #5812

Merged
merged 1 commit into from Nov 14, 2015

Conversation

mnagy
Copy link
Contributor

@mnagy mnagy commented Nov 9, 2015

The watcher logic is prone to not enforcing the timeout. If no event
happens, the build can go on for a very long time.

Enforce the timeout by using polling instead. Also raise the timeout to
30 minutes for very slow builds and make the errors more specific.

@mfojtik: PTAL. In the end I decided against the channel-timeout approach we
discussed because it seemed too complicated and cumbersome compared to this
approach. I know that polling is technically inferior to watching, but this
code is only used in tests after all and I don't see the (probably negligible)
performance gains to be worth the added complexity.

Fixes #5728

@mnagy
Copy link
Contributor Author

mnagy commented Nov 9, 2015

[testonlyextended][extended:core]

This should clean-out slow-building flakes, like the mentioned #5728. It might make more flakes visible. However,I've also made logs a lot cleaner, making it obvious from a glance that the build timed out, so raising the timeout would fix that. Hopefully, we don't have many 30 minute builds..

The watcher logic is prone to not enforcing the timeout. If no event
happens, the build can go on for a very long time.

Enforce the timeout by using polling instead. Also raise the timeout to
30 minutes for very slow builds and make the errors more specific.
@mnagy
Copy link
Contributor Author

mnagy commented Nov 10, 2015

I've seen two flakes that I believe are caused by #5816. Couple more flakes for hot deploy; raising timeout for url fetching and making the error message more verbose.

@openshift-bot
Copy link
Contributor

Evaluated for origin testonlyextended up to f8162a3

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7117/) (Extended Tests: core)

@bparees
Copy link
Contributor

bparees commented Nov 10, 2015

lgtm.

but i'm surprised about the watcher issue, i thought our watchers timed out pretty reliably.

@liggitt
Copy link
Contributor

liggitt commented Nov 10, 2015

What in the old code was setting the watch timeout?

@mnagy
Copy link
Contributor Author

mnagy commented Nov 10, 2015

I don't think we were setting any. And looking through the code, I don't see that you can set a timeout on a watch..?

Github shows me that the jenkins job is still running, but it has in fact failed. It failed on a flake (working on it).

@bparees bparees added this to the 1.1.1 milestone Nov 10, 2015
@liggitt
Copy link
Contributor

liggitt commented Nov 10, 2015

I don't think we were setting any. And looking through the code, I don't see that you can set a timeout on a watch..?

There is preliminary API support for requesting a server timeout, but we don't have it in origin yet (and it's not exact... more of a guideline, really)... if you want an exact timeout, you have to select between a watch and a timeout of your own choosing, right?

@liggitt
Copy link
Contributor

liggitt commented Nov 10, 2015

and handle cases where the watch expires before your timeout (and you have to re-establish the watch)

@mnagy
Copy link
Contributor Author

mnagy commented Nov 10, 2015

Yeah, I don't think it's a good idea for tests. Too complex without any real benefit. Just polling is simpler.

@mnagy
Copy link
Contributor Author

mnagy commented Nov 13, 2015

@bparees merge?

@bparees
Copy link
Contributor

bparees commented Nov 13, 2015

lgtm.
[merge]

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to f8162a3

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7118/) (Extended Tests: core)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4006/) (Image: devenv-rhel7_2706)

@bparees
Copy link
Contributor

bparees commented Nov 13, 2015

Error from server: Timeout: timed out waiting for deployment test/failing-dc-1 to start after 10s

unrelated to extended test changes...

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to f8162a3

openshift-bot pushed a commit that referenced this pull request Nov 14, 2015
@openshift-bot openshift-bot merged commit 9108c97 into openshift:master Nov 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants