Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using wait.Until doesn't work for long durations #31345

Closed
wojtek-t opened this issue Aug 24, 2016 · 22 comments · Fixed by #67350
Closed

Using wait.Until doesn't work for long durations #31345

wojtek-t opened this issue Aug 24, 2016 · 22 comments · Fixed by #67350
Assignees
Labels
area/kubectl area/reliability lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@wojtek-t
Copy link
Member

Density test is failing without GC on large clusters with the following error:

• Failure [2058.061 seconds]
[k8s.io] Density
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:783
  [Feature:Performance] should allow starting 30 pods per node [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:673

  Expected error:
      <*errors.errorString | 0xc865089da0>: {
          s: "error while stopping RC: density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b: timed out waiting for \"density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b\" to be synced",
      }
      error while stopping RC: density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b: timed out waiting for "density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b" to be synced
  not to have occurred

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:332

However, this "working as implemented" currently.

The problem is that in this case, underneath we are using "DeleteRCAndPods" method to delete RC, and this in turn is using "ReplicationControllerReaper", which results in calling this one:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/scale.go#L199

However, when watch is close, "wait.Until" returns "ErrWaitTimeout":
https://github.com/kubernetes/kubernetes/blob/master/pkg/watch/until.go#L63

The test is failing after exactly 5 minutes of waiting for the RC deletion, because of Timeout on http.Client:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/util.go#L1749

I think that there are two issues here:

  1. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/scale.go#L199 is not going to watch if we intend to wait for a long time, since watch can be close at any time. In those cases, we should be actually renewing a watch.
    We may have similar problems in other places of the code.
  2. Why do we set client.Timeout in tests if we are not doing it anywhere in production? We should use as similar setup to production as possible in our test. @krousey @lavalamp
    [For this one I'm going to send a PR that will stop setting timeouts for http.Client in tests.]
@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/kubectl area/reliability sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 24, 2016
@wojtek-t
Copy link
Member Author

@caesarxuchao

@wojtek-t wojtek-t changed the title test/e2e/density.go is failing without GC on large clusters Using wait.Until doesn't work for long durations Sep 6, 2016
@wojtek-t
Copy link
Member Author

wojtek-t commented Sep 6, 2016

So to summarize, the main problem in my opinion is that:
wait.Until take "watch.Interface" as an argument, and we are breaking up watches every 5-10 minutes. So if you are supposed to wait longer than that, it will just timeout.

Adding @smarterclayton - who added wait.Until IIRC

@kubernetes/sig-scalability

@smarterclayton
Copy link
Contributor

We should be renewing the watch. However that's not enough - you really need Get then Watch. So there has to be something that abstracts watcher creation (a watch wrapper).

@wojtek-t
Copy link
Member Author

wojtek-t commented Sep 6, 2016

Yeah - I agree that this is not trivial change. It should be somewhat similar to what we are doing in reflector.

I guess it's not 1.4 change though.

@smarterclayton
Copy link
Contributor

Agreed that anyone who switched to watch.Until may have done so prematurely.

k8s-github-robot pushed a commit that referenced this issue Oct 7, 2016
Automatic merge from submit-queue

Don't set timeouts in clients in tests

We are not setting timeouts in production - we shouldn't do it in tests then...

Addresses point 2. of #31345
@yujuhong
Copy link
Contributor

yujuhong commented Feb 7, 2017

This is causing quite a lot of test flakes. Can we replace watch.Until with something else (polling?) in the tests until it's fixed properly?

@smarterclayton
Copy link
Contributor

We should maybe just turn the client timeout way up. It's kind of arbitrary and I opened another issue about removing it.

@janetkuo
Copy link
Member

janetkuo commented Feb 8, 2017

This is causing quite a lot of test flakes. Can we replace watch.Until with something else (polling?) in the tests until it's fixed properly?

Filed #41112

k8s-github-robot pushed a commit that referenced this issue Feb 10, 2017
Automatic merge from submit-queue (batch tested with PRs 41112, 41201, 41058, 40650, 40926)

e2e test flakes: remove some uses of watch.Until in e2e tests

`watch.Until` is somewhat broken and is causing quite a lot of test flakes. See #39879 (comment) and #31345 for more context.

@wojtek-t @yujuhong @Kargakis
@smarterclayton
Copy link
Contributor

@wojtek-t
Copy link
Member Author

Do we want to do something with this soon-ish?

@smarterclayton
Copy link
Contributor

It seems like we want a client aware watcher abstraction that is basically a lightweight, low cost informer.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2018
@wojtek-t
Copy link
Member Author

/lifecycle frozen
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2018
shyamjvs added a commit to shyamjvs/kubernetes that referenced this issue Feb 27, 2018
k8s-github-robot pushed a commit that referenced this issue Feb 28, 2018
Automatic merge from submit-queue (batch tested with PRs 60470, 59149, 56075, 60280, 60504). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Make Scale() for RC poll-based until #31345 is fixed

Fixes #56064
,in the short-term until issue #31345 is fixed.
We should eventually move RS, job, deployment, etc all to watch-based (#56071)

/cc @wojtek-t - SGTY?

```release-note
NONE
```
jingxu97 pushed a commit to jingxu97/kubernetes that referenced this issue Mar 13, 2018
@spiffxp
Copy link
Member

spiffxp commented Mar 16, 2018

/priority backlog
taking a guess at priority based on long it's been since this was touched

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Mar 16, 2018
@spiffxp spiffxp removed the triaged label Mar 16, 2018
@tnozicka
Copy link
Contributor

tnozicka commented Apr 5, 2018

Fix is ready here: #50102

@tnozicka
Copy link
Contributor

tnozicka commented Apr 5, 2018

/remove-lifecycle frozen

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2018
@wojtek-t
Copy link
Member Author

wojtek-t commented Jul 4, 2018

/remove-lifecycle stale
/lifecycle frozen

#50102 (once figured out) should fix this issue.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2018
@gatici
Copy link

gatici commented Jul 19, 2019

Hi @wojtek-t

My big deployment with helm is failing with below error. I am starting the helm installation with
--timeout 5000 parameter. Is there any solution to increase watch timeout in kube-api ?

helm history dev-so -o yaml

  • chart: so-4.0.0
    description: 'Release "dev-so" failed post-install: watch closed before UntilWithoutRetry
    timeout'
    revision: 1
    status: FAILED

akhilerm pushed a commit to akhilerm/apimachinery that referenced this issue Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 41112, 41201, 41058, 40650, 40926)

e2e test flakes: remove some uses of watch.Until in e2e tests

`watch.Until` is somewhat broken and is causing quite a lot of test flakes. See kubernetes/kubernetes#39879 (comment) and kubernetes/kubernetes#31345 for more context.

@wojtek-t @yujuhong @Kargakis

Kubernetes-commit: 558c37aee3ae62356bb16068af9973e5489aa86a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubectl area/reliability lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.