Watching stops working after 10 minutes #85

PeterGerrard · 2020-08-18T14:12:56Z

Watching a set of configmaps fails to be alerted of new changes after 10 minutes of no changes.

##Repro steps

Install the following configmap into a cluster

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-test-dashboard
  labels:
    grafana_dashboard: "1"
data:
  cm-test.json: {}

Install the sidecar into the cluster

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - env:
        - name: METHOD
        - name: LABEL
          value: grafana_dashboard
        - name: FOLDER
          value: /tmp/dashboards
        - name: RESOURCE
          value: both
      image: kiwigrid/k8s-sidecar:0.1.178
      imagePullPolicy: IfNotPresent
      name: grafana-sc-dashboard

Wait 10 minutes
Make a change to the config map and update in the cluster

Expected Behaviour

Will see a modification occur

Actual Behaviour

Nothing

Done on AKS with kubernetes version 1.16.10

The text was updated successfully, but these errors were encountered:

ahiaht · 2020-08-28T03:54:28Z

I have same issue with aks 1.16.10 and sidecar 0.1.151.
When the sidecar startup, it ran ok, however the loop failed after 30 minutes with this error:
[2020-08-27 15:25:47] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

axdotl · 2020-09-17T11:28:36Z

@monotek Could you please take care of it?

monotek · 2020-09-17T15:24:36Z

Please try with current version 0.1.193.

auroaj · 2020-09-17T16:21:06Z

Please try with current version 0.1.193.

The same issue.

axdotl · 2020-09-18T06:58:59Z

As a containment I'm using SLEEP mode, which actually polls the kube-api.

But what's causing the problem in WATCH is currently unclear for me. Maybe the used k8s client in the sidecar having issues...

I created #90, maybe this helps 🤞🏼

monotek · 2020-09-21T16:34:54Z

@auroaj & @PeterGerrard
#90 was merged.
please try with 0.1.209

monotek · 2020-09-22T11:01:17Z

@axdotl
mentioning sleep command, do you think this could help too: https://github.com/kiwigrid/k8s-sidecar/pull/88/files ?

qaiserali · 2020-09-29T06:49:53Z

Any updates on this? I'm getting the same issue and sidecar stops working after couple of minutes.

monotek · 2020-09-29T08:47:32Z

No, as you can see there was no feedback if it works with image 0.1.209.

qaiserali · 2020-09-29T12:43:41Z

No, as you can see there was no feedback if it works with image 0.1.209.

I have tried it with image 0.1.209, and it doesn't work

monotek · 2020-09-30T10:57:12Z

Do anyone of you know whats the last working version?

axdotl · 2020-09-30T11:30:55Z

I think it is a general issue. Run long time w/ 0.1.20 where the issue also occurs... maybe it is more related to changes in k8s api.

monotek · 2020-09-30T15:44:03Z

So lastg working k8s version would be interesting too.

pulledtim · 2020-10-14T07:35:36Z

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

axdotl · 2020-10-14T07:39:56Z

My assumption is, that this is related to interrupts to kube-api. This might cause the resource watching to stop.

auroaj · 2020-10-16T09:20:15Z

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

3-4 hours for me.

auroaj · 2020-10-20T12:08:01Z

Checked with kiwigrid/k8s-sidecar:1.1.0 and AKS K8s Rev: v1.17.11.
The same issue.

PoliM · 2020-10-27T07:47:19Z

There is an interesting fix in the Kubernetes Python Client v12.0.0. From the https://github.com/kubernetes-client/python/blob/release-12.0/CHANGELOG.md

Retry expired watches kubernetes-client/python-base#133

PoliM · 2020-10-28T08:01:16Z

Thanks for merging. I updated the deployment yesterday to the docker image tag 0.1.259 and this morning, 15 hours later, it still detects modifications on configmaps 👍
There is also a change in the log. About every 30 to 60 minutes there's an entry like:

[2020-10-28 07:29:07] ApiException when calling kubernetes: (410)
Reason: Expired: too old resource version: 194195784 (199007602)

And so the resource watcher gets restarted.

BTW, the tag 1.2.0 had a build error, that's why I used 0.1.259 from the CI build.

auroaj · 2020-10-28T13:48:27Z

Hmmm...
Not in my case...
Still the same issue.
Also, I can't find this entry:

ApiException when calling kubernetes: (410)

I've checked twice, with grafana-sc-dashboard kiwigrid/k8s-sidecar:0.1.259, it stops working.

K8s Rev: v1.17.11

I'll try to check it in another location.

qaiserali · 2020-10-28T13:52:49Z

I'm also still getting an issue with version 'k8s-sidecar:0.1.259'. It stops working after a couple of minutes

auroaj · 2020-10-28T14:33:55Z

Checked in another location. Unfortunately, it was updated to v1.17.11 too.
I've got the same result - just several minutes (instead of hours with prev) before stop.
And I've found this line in logs in one of my envs (the only one):

[2020-10-28 14:03:21] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

djsly · 2020-11-10T19:46:49Z

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

OmegaVVeapon · 2020-11-10T20:16:38Z

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

It seems to be.

The only commit in the 1.2.0 release was the kubernetes library bump to 12.0.0 and if you check the libraries in that image, you'll see that the kubernetes library is updated.

❯ docker run --rm -it kiwigrid/k8s-sidecar:0.1.259 sh
/app $ pip list | grep kubernetes
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
kubernetes        12.0.0

However, I still had the same issue as @qaiserali where it stopped working after a few minutes.

djsly · 2020-11-25T13:55:06Z

Any luck from anyone with a more recent version ? We moved to SLEEP method, but now the dashboards are not getting removed on ConfigMap deletion :(

chadlwilson · 2020-11-25T14:08:57Z

@djsly We noticed exactly the same thing about dashboards not getting deleted.

With a specific use of this sidecar (Grafana) earlier today, this behaviour under SLEEP mode caused a denial-of-service on one of our Grafana environments when a config map was effectively renamed (same content, one ConfigMap deleted, another one added). As a result of the old name dashboard not being deleted; and then the new one being detected as not having a unique uid led to grafana/grafana#14674. Apart from hundreds of thousands of logged Grafana errors, this seemingly caused the dashboard version to churn in the Grafana DB between two different versions; bloating our DB and eventually running out of space for the grafana DB on our PVs. Cool. :-)

So we went back to WATCH mode; and upgraded the sidecar to 1.3.0 which includes the kube client bump.

In our case WATCH mode was only occasionally losing connectivity/stopping watching that we had noticed, rather than the every 10 minutes/few hours that some people have observed here. Since we run on AWS EKS, one theory was that it was during control plane/master upgrades by AWS that the watches might get terminated and not re-established reliably, but that was just a theory given how infrequent we had experienced the issues with WATCH. Will see how we go.

vsabella · 2021-05-23T10:49:22Z

I think there was a similar issue with the notifications API that affected aadpodidentity as well.
The fix was to update to the latest Kubernetes libraries which had a fix for this. But not sure if that applies to the python libraries as well. Either way it doesn't work behind a load balancer at the moment.

stale · 2021-09-06T20:38:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

visand · 2021-10-06T13:47:40Z

I'm also seeing this issue. Running quay.io/kiwigrid/k8s-sidecar:1.12.3 on AKS. Did you all work around this issue somehow or should the issue be reopened?

diversit · 2021-11-05T11:02:05Z

Still experiencing this issue on AKS with k8s-sidecar:1.14.2.

Is there still the problem that when using LIST deleted resources are not deleted?

jekkel · 2021-11-05T12:23:42Z

Is there still the problem that when using LIST deleted resources are not deleted?

yes. We're not keeping track of resources, with WATCH the sidecar follows the events k8s emits (create/update/delete) whereas SLEEP more or less acts like create/update all the time.

gowrisankar22 · 2021-11-10T03:56:11Z

Issue is reproducible with 1.14.2 sidecar as well.

[2021-11-10 03:47:48] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

jekkel · 2021-11-10T07:31:03Z

Thanks for the report. Unfortunately I have no idea what's going wrong here. Any ideas? Current state is that we pass urllib Retry configured by REQ_RETRY_TOTAL ,REQ_RETRY_CONNECT, REQ_RETRY_READ, REQ_RETRY_BACKOFF_FACTOR, REQ_TIMEOUT into k8s client, so in general k8s communication should be retried. But from your report it seems that watching, i.e. streaming updates, is not really subject to those retries.

So I'd be happy to follow any pointers you might have.

diversit · 2021-11-10T08:40:14Z

Unfortunately I have no idea either.
It seems to be retrying but each retry seems to always fail with the same ProtocolError (see above).

bergerx · 2021-12-17T16:44:25Z

Anyone here is familiar with how the network timeouts are configured when using watches/informers with client-go?
I previously missed this comment #85 (comment)
I strongly believe this has something at the network level.

bergerx · 2021-12-17T18:03:34Z

Ah, I always thought this repo is in golang. Now I checked the code for the first time with @vsliouniaev's kubernetes-client/python#1148 (comment) comment in mind.
And I see that we can test this out pretty easily.

Apparently, details about these are now covered in https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md.
Server-side timeout has a default (a random value between 1800 and 3600 seconds) in the python library, but the client-side timeout seems to have None as default:

In case, if the _request_timeout value is not set, then the default value is None & socket will have no timeout.
Refer: https://docs.python.org/3/library/socket.html#socket.getdefaulttimeout

It is recommended to set this timeout value to a lower number (for eg. ~ maybe 60 seconds).

We can give the below change a try here.

Update this part:

k8s-sidecar/sidecar/resources.py

Lines 193 to 199 in cbb48df

    
           additional_args = { 
        
               'label_selector': label_selector 
        
           } 
        
           if namespace != "ALL": 
        
               additional_args['namespace'] = namespace 
        
           stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

As this:

    additional_args = {
        'label_selector': label_selector,

        # Tune default timeouts as outlined in
        # https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613
        # https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
        # I picked 60 and 66 due to https://github.com/nolar/kopf/issues/847#issuecomment-971651446

        # 60 is a polite request to the server, asking it to cleanly close the connection after that.
        # If you have a network outage, this does nothing.
        # You can set this number much higher, maybe to 3600 seconds (1h).
        'timeout_seconds': os.environ.get(WATCH_SERVER_TIMEOUT, 60),

        # 66 is a client-side timeout, configuring your local socket.
        # If you have a network outage dropping all packets with no RST/FIN,
        # this is how long your client waits before realizing & dropping the connection.
        # You can keep this number low, maybe 60 seconds.
        '_request_timeout': os.environ.get(WATCH_CLIENT_TIMEOUT, 66),
    }
    ...
    stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

This is also effectively what the alternative kopf based implementation does here, also see nolar/kopf#585 on the historical context on these settings:
https://github.com/OmegaVVeapon/kopf-k8s-sidecar/blob/main/app/sidecar_settings.py#L58-L70:

And, ironically the kopf (which OmegaVVeapon/kopf-k8s-sidecar is based on) project has nolar/kopf#847 currently open which seems to be related, but I guess that's another edge case. We have been using kopf in our AKS clusters without regular issues. But this particular kiwigrid/k8s-sidecar issue is quite frequent.

I'll try to give my suggestion above a chance if I can reserve some time, but given that we already have a workaround in place (grafana/helm-charts#18 (comment)), it won't likely be soon.

Add proposed timeout configuration parameters.

jekkel · 2021-12-20T08:35:22Z

@bergerx Thanks a lot for this detailed analysis. I took the opportunity and incorporated your proposal in a PR. Hopefully we can get this issue fixed with it. 🤞

Add proposed timeout configuration parameters.

fixes #85 #minor

PeterGerrard mentioned this issue Aug 18, 2020

[stable/grafana] Dashboard sidecar stops working after 10 minutes helm/charts#23565

Closed

PeterGerrard mentioned this issue Sep 14, 2020

Dashboard sidecar stops working after 10 minutes grafana/helm-charts#18

Closed

monotek assigned pulledtim Sep 30, 2020

PoliM mentioned this issue Oct 27, 2020

update Kubernetes library to v12.0.0 #97

Merged

adusumillipraveen mentioned this issue Mar 15, 2021

Changing to sleep mode for Jenkins sidecar hmcts/cnp-flux-config#8674

Merged

dduportal mentioned this issue Mar 17, 2021

Chore/switch ci sidecars to polling jenkins-infra/kubernetes-management#974

Merged

gburton1 mentioned this issue Jul 9, 2021

Configmap removals not being recognized in WATCH mode OmegaVVeapon/kopf-k8s-sidecar#25

Closed

stale bot added the wontfix This will not be worked on label Sep 6, 2021

thaoula mentioned this issue Sep 19, 2021

Custom Grafana Dashboards only appear after restarting pod rancher/rancher#34809

Closed

stale bot closed this as completed Oct 2, 2021

jwausle mentioned this issue Nov 29, 2021

Grafana initializes dashboard provider before sidecar has completed parsing all dashboard ConfigMaps grafana/helm-charts#527

Open

transkonduktor mentioned this issue Dec 17, 2021

[grafana] grafana-sc-dashboard sidecard works only at initializatrion phase, subsequent connection failed by ProtocolError when calling kubernetes grafana/helm-charts#525

Open

jekkel added a commit that referenced this issue Dec 20, 2021

fixes #85 #minor

c92204e

Add proposed timeout configuration parameters.

jekkel reopened this Dec 20, 2021

stale bot removed the wontfix This will not be worked on label Dec 20, 2021

jekkel linked a pull request Dec 20, 2021 that will close this issue

fixes #85 #minor #150

Merged

jekkel added a commit that referenced this issue Dec 20, 2021

fixes #85 #minor

f8a1edd

Add proposed timeout configuration parameters.

jekkel added a commit that referenced this issue Dec 20, 2021

fixes #85 #minor

e5a4e3c

Add proposed timeout configuration parameters.

jekkel closed this as completed in #150 Dec 20, 2021

jekkel added a commit that referenced this issue Dec 20, 2021

Merge pull request #150 from kiwigrid/timeout-issue85

e95f818

fixes #85 #minor

omichels mentioned this issue Mar 7, 2022

[grafana] enable configuration of parameters WATCH_SERVER_TIMEOUT and WATCH_CLI… grafana/helm-charts#1095

Merged

OmegaVVeapon mentioned this issue Apr 18, 2022

More frequent Docker image builds OmegaVVeapon/kopf-k8s-sidecar#33

Closed

w13915984028 mentioned this issue Apr 28, 2022

[BUG] The Prometheus monitoring chart become empty after staying on dashboard page for a period of time harvester/harvester#2150

Open

Watching stops working after 10 minutes #85

Watching stops working after 10 minutes #85

Comments

PeterGerrard commented Aug 18, 2020

Expected Behaviour

Actual Behaviour

ahiaht commented Aug 28, 2020

axdotl commented Sep 17, 2020

monotek commented Sep 17, 2020

auroaj commented Sep 17, 2020

axdotl commented Sep 18, 2020 • edited Loading

monotek commented Sep 21, 2020

monotek commented Sep 22, 2020 • edited Loading

qaiserali commented Sep 29, 2020

monotek commented Sep 29, 2020

qaiserali commented Sep 29, 2020

monotek commented Sep 30, 2020

axdotl commented Sep 30, 2020 • edited Loading

monotek commented Sep 30, 2020

pulledtim commented Oct 14, 2020

axdotl commented Oct 14, 2020

auroaj commented Oct 16, 2020

auroaj commented Oct 20, 2020

PoliM commented Oct 27, 2020

PoliM commented Oct 28, 2020

auroaj commented Oct 28, 2020

qaiserali commented Oct 28, 2020

auroaj commented Oct 28, 2020 • edited Loading

djsly commented Nov 10, 2020

OmegaVVeapon commented Nov 10, 2020 • edited Loading

djsly commented Nov 25, 2020

chadlwilson commented Nov 25, 2020 • edited Loading

vsabella commented May 23, 2021

stale bot commented Sep 6, 2021

visand commented Oct 6, 2021

diversit commented Nov 5, 2021

jekkel commented Nov 5, 2021 • edited Loading

gowrisankar22 commented Nov 10, 2021

jekkel commented Nov 10, 2021

diversit commented Nov 10, 2021

bergerx commented Dec 17, 2021 • edited Loading

bergerx commented Dec 17, 2021 • edited Loading

jekkel commented Dec 20, 2021

axdotl commented Sep 18, 2020 •

edited

Loading

monotek commented Sep 22, 2020 •

edited

Loading

axdotl commented Sep 30, 2020 •

edited

Loading

auroaj commented Oct 28, 2020 •

edited

Loading

OmegaVVeapon commented Nov 10, 2020 •

edited

Loading

chadlwilson commented Nov 25, 2020 •

edited

Loading

jekkel commented Nov 5, 2021 •

edited

Loading

bergerx commented Dec 17, 2021 •

edited

Loading

bergerx commented Dec 17, 2021 •

edited

Loading