Using wait.Until doesn't work for long durations #31345

wojtek-t · 2016-08-24T13:18:50Z

Density test is failing without GC on large clusters with the following error:

• Failure [2058.061 seconds]
[k8s.io] Density
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:783
  [Feature:Performance] should allow starting 30 pods per node [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:673

  Expected error:
      <*errors.errorString | 0xc865089da0>: {
          s: "error while stopping RC: density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b: timed out waiting for \"density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b\" to be synced",
      }
      error while stopping RC: density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b: timed out waiting for "density60000-0-9e82e298-69eb-11e6-8157-a0481cabf39b" to be synced
  not to have occurred

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:332

However, this "working as implemented" currently.

The problem is that in this case, underneath we are using "DeleteRCAndPods" method to delete RC, and this in turn is using "ReplicationControllerReaper", which results in calling this one:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/scale.go#L199

However, when watch is close, "wait.Until" returns "ErrWaitTimeout":
https://github.com/kubernetes/kubernetes/blob/master/pkg/watch/until.go#L63

The test is failing after exactly 5 minutes of waiting for the RC deletion, because of Timeout on http.Client:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/util.go#L1749

I think that there are two issues here:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/scale.go#L199 is not going to watch if we intend to wait for a long time, since watch can be close at any time. In those cases, we should be actually renewing a watch.
We may have similar problems in other places of the code.
Why do we set client.Timeout in tests if we are not doing it anywhere in production? We should use as similar setup to production as possible in our test. @krousey @lavalamp
[For this one I'm going to send a PR that will stop setting timeouts for http.Client in tests.]

The text was updated successfully, but these errors were encountered:

wojtek-t · 2016-08-24T13:23:25Z

@caesarxuchao

wojtek-t · 2016-09-06T09:27:39Z

So to summarize, the main problem in my opinion is that:
wait.Until take "watch.Interface" as an argument, and we are breaking up watches every 5-10 minutes. So if you are supposed to wait longer than that, it will just timeout.

Adding @smarterclayton - who added wait.Until IIRC

@kubernetes/sig-scalability

smarterclayton · 2016-09-06T13:59:04Z

We should be renewing the watch. However that's not enough - you really need Get then Watch. So there has to be something that abstracts watcher creation (a watch wrapper).

wojtek-t · 2016-09-06T14:02:10Z

Yeah - I agree that this is not trivial change. It should be somewhat similar to what we are doing in reflector.

I guess it's not 1.4 change though.

smarterclayton · 2016-09-06T14:08:14Z

Agreed that anyone who switched to watch.Until may have done so prematurely.

Automatic merge from submit-queue Don't set timeouts in clients in tests We are not setting timeouts in production - we shouldn't do it in tests then... Addresses point 2. of #31345

yujuhong · 2017-02-07T17:00:14Z

This is causing quite a lot of test flakes. Can we replace watch.Until with something else (polling?) in the tests until it's fixed properly?

smarterclayton · 2017-02-07T19:40:56Z

We should maybe just turn the client timeout way up. It's kind of arbitrary and I opened another issue about removing it.

janetkuo · 2017-02-08T00:54:57Z

This is causing quite a lot of test flakes. Can we replace watch.Until with something else (polling?) in the tests until it's fixed properly?

Filed #41112

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 41112, 41201, 41058, 40650, 40926) e2e test flakes: remove some uses of watch.Until in e2e tests `watch.Until` is somewhat broken and is causing quite a lot of test flakes. See #39879 (comment) and #31345 for more context. @wojtek-t @yujuhong @Kargakis

smarterclayton · 2017-02-22T16:03:11Z

Seeing this in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/4544

wojtek-t · 2017-05-30T07:15:37Z

Do we want to do something with this soon-ish?

smarterclayton · 2017-05-30T14:23:19Z

It seems like we want a client aware watcher abstraction that is basically a lightweight, low cost informer.

fejta-bot · 2018-01-22T11:49:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

wojtek-t · 2018-01-22T12:29:35Z

/lifecycle frozen
/remove-lifecycle stale

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 60470, 59149, 56075, 60280, 60504). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Make Scale() for RC poll-based until #31345 is fixed Fixes #56064 ,in the short-term until issue #31345 is fixed. We should eventually move RS, job, deployment, etc all to watch-based (#56071) /cc @wojtek-t - SGTY? ```release-note NONE ```

spiffxp · 2018-03-16T18:28:35Z

/priority backlog
taking a guess at priority based on long it's been since this was touched

tnozicka · 2018-04-05T08:48:34Z

Fix is ready here: #50102

tnozicka · 2018-04-05T08:49:35Z

/remove-lifecycle frozen

fejta-bot · 2018-07-04T09:33:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2018-07-04T11:52:11Z

/remove-lifecycle stale
/lifecycle frozen

#50102 (once figured out) should fix this issue.

gatici · 2019-07-19T08:46:37Z

Hi @wojtek-t

My big deployment with helm is failing with below error. I am starting the helm installation with
--timeout 5000 parameter. Is there any solution to increase watch timeout in kube-api ?

helm history dev-so -o yaml

chart: so-4.0.0
description: 'Release "dev-so" failed post-install: watch closed before UntilWithoutRetry
timeout'
revision: 1
status: FAILED

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 41112, 41201, 41058, 40650, 40926) e2e test flakes: remove some uses of watch.Until in e2e tests `watch.Until` is somewhat broken and is causing quite a lot of test flakes. See kubernetes/kubernetes#39879 (comment) and kubernetes/kubernetes#31345 for more context. @wojtek-t @yujuhong @Kargakis Kubernetes-commit: 558c37aee3ae62356bb16068af9973e5489aa86a

wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/kubectl area/reliability sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 24, 2016

wojtek-t mentioned this issue Aug 24, 2016

Don't set timeouts in clients in tests #31346

Merged

jszczepkowski mentioned this issue Sep 2, 2016

[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] ReplicationController light Should scale from 1 pod to 2 pods {Kubernetes e2e suite} #28900

Closed

wojtek-t changed the title ~~test/e2e/density.go is failing without GC on large clusters~~ Using wait.Until doesn't work for long durations Sep 6, 2016

wojtek-t mentioned this issue Feb 7, 2017

[k8s.io] Deployment deployment should support rollover {Kubernetes e2e suite} #39879

Closed

yujuhong mentioned this issue Feb 7, 2017

Reduce Deployment replicas in e2e tests #41063

Closed

janetkuo mentioned this issue Feb 8, 2017

e2e test flakes: remove some uses of watch.Until in e2e tests #41112

Merged

yujuhong mentioned this issue Feb 22, 2017

[k8s.io] Pods should be submitted and removed [Conformance] 1m42s #41902

Closed

ncdc mentioned this issue Mar 9, 2017

Switch kubectl to use watch.Until #33942

Merged

timothysc added the triaged label Jun 1, 2017

This was referenced Jun 19, 2017

Flaky test: Deployment deployment should support rollover causes pr:pull-kubernetes-e2e-gce-etcd3 flaked 125 times in the past week #47697

Closed

Poll instead of watch for ready ReplicaSets in e2e test #47756

Merged

tnozicka mentioned this issue Jun 30, 2017

Fix avaibility reporting for DC with MinReadySeconds set openshift/origin#14936

Merged

tnozicka mentioned this issue Aug 3, 2017

Fix broken watches #50102

Closed

5 tasks

djhaskin987 mentioned this issue Jan 4, 2018

watch closed before Until timeout helm/helm#2918

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2018

shyamjvs added a commit to shyamjvs/kubernetes that referenced this issue Feb 27, 2018

Make Scale() for RC poll-based until kubernetes#31345 is fixed

fd2ea3f

jingxu97 pushed a commit to jingxu97/kubernetes that referenced this issue Mar 13, 2018

Make Scale() for RC poll-based until kubernetes#31345 is fixed

6090ade

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Mar 16, 2018

spiffxp removed the triaged label Mar 16, 2018

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 5, 2018

SpComb mentioned this issue May 30, 2018

Master upgrade reboot interrupts lock waits, causing simultaneously scheduled upgrades on other nodes to be skipped kontena/pharos-host-upgrades#20

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2018

tnozicka mentioned this issue Aug 15, 2018

#50102 Task 3: Until, backed by retry watcher #67350

Merged

2 tasks

jlebon mentioned this issue Sep 17, 2018

WIP: daemon: Watch for 20 minute increments forever openshift/machine-config-operator#72

Closed

k8s-ci-robot closed this as completed in #67350 Feb 27, 2019

github-actions bot mentioned this issue Dec 14, 2020

[Make the implementation of this watch-based (#56075] once #31345 is fixed. pacoxu/kubernetes#1657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using wait.Until doesn't work for long durations #31345

Using wait.Until doesn't work for long durations #31345

wojtek-t commented Aug 24, 2016

wojtek-t commented Aug 24, 2016

wojtek-t commented Sep 6, 2016

smarterclayton commented Sep 6, 2016

wojtek-t commented Sep 6, 2016

smarterclayton commented Sep 6, 2016

yujuhong commented Feb 7, 2017

smarterclayton commented Feb 7, 2017

janetkuo commented Feb 8, 2017

smarterclayton commented Feb 22, 2017

wojtek-t commented May 30, 2017

smarterclayton commented May 30, 2017

fejta-bot commented Jan 22, 2018

wojtek-t commented Jan 22, 2018

spiffxp commented Mar 16, 2018

tnozicka commented Apr 5, 2018

tnozicka commented Apr 5, 2018

fejta-bot commented Jul 4, 2018

wojtek-t commented Jul 4, 2018

gatici commented Jul 19, 2019

Using wait.Until doesn't work for long durations #31345

Using wait.Until doesn't work for long durations #31345

Comments

wojtek-t commented Aug 24, 2016

wojtek-t commented Aug 24, 2016

wojtek-t commented Sep 6, 2016

smarterclayton commented Sep 6, 2016

wojtek-t commented Sep 6, 2016

smarterclayton commented Sep 6, 2016

yujuhong commented Feb 7, 2017

smarterclayton commented Feb 7, 2017

janetkuo commented Feb 8, 2017

smarterclayton commented Feb 22, 2017

wojtek-t commented May 30, 2017

smarterclayton commented May 30, 2017

fejta-bot commented Jan 22, 2018

wojtek-t commented Jan 22, 2018

spiffxp commented Mar 16, 2018

tnozicka commented Apr 5, 2018

tnozicka commented Apr 5, 2018

fejta-bot commented Jul 4, 2018

wojtek-t commented Jul 4, 2018

gatici commented Jul 19, 2019