Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handlers stop working after some time #585

Closed
OmegaVVeapon opened this issue Nov 29, 2020 · 6 comments
Closed

Handlers stop working after some time #585

OmegaVVeapon opened this issue Nov 29, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@OmegaVVeapon
Copy link

OmegaVVeapon commented Nov 29, 2020

Long story short

The handlers for secret resources seem to stop silently working after some time.
At the start, I'm able to create configmaps and secrets with that findme label and they both show events in the logs. However, after 30 mins - 1 hour... the secret handlers stop firing completely... (the finalizer no longer gets added to the resources either of course)
Meanwhile, the configmaps continue to work...

UPDATE: Tried with handlers ONLY for secrets and it still stopped working... amended snippet to show this.

Description

The code snippet to reproduce the issue
import kopf

LABEL = "findme"

@kopf.on.startup()
def startup_tasks(settings: kopf.OperatorSettings, logger, **_):
    # Set k8s event logging
    settings.posting.enabled = False

@kopf.on.resume('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
@kopf.on.create('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
@kopf.on.update('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
def log_resume_create_update(body, event, logger, **_):
    logger.info(f"Got {event.upper()} event for a {body['kind']} resource")

@kopf.on.delete('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
def log_delete(body, event, logger, **_):
    logger.info(f"Got {event.upper()} event for a {body['kind']} resource")

Here is the secret I'm using for testing:

---
apiVersion: v1
kind: Secret
metadata:
  name: sample-secret
  labels:
    findme: "yea"
type: Opaque
data:
  # base64 encoded: my super cool \n multiline \ secret
  secret.world: bXkgc3VwZXIgY29vbAptdWx0aWxpbmUKc2VjcmV0
The exact command to reproduce the issue
kopf run simple.py --standalone --liveness --verbose --debug

Environment

  • Kopf version: 0.28
  • Kubernetes version: v1.18.8
  • Python version: 3.7.9
  • OS/platform: Debian 10.6 (FROM python 3.7)
Python packages installed

(Ran directly on the container)

# pip freeze --all
aiohttp==3.7.3
aiojobs==0.3.0
async-timeout==3.0.1
attrs==20.3.0
certifi==2020.11.8
chardet==3.0.4
click==7.1.2
idna==2.10
iso8601==0.1.13
kopf==0.28
multidict==5.0.2
pip==20.2.4
pykube-ng==20.10.0
python-json-logger==2.0.1
PyYAML==5.3.1
requests==2.25.0
setuptools==50.3.2
typing-extensions==3.7.4.3
urllib3==1.26.2
wheel==0.35.1
yarl==1.6.3
@OmegaVVeapon OmegaVVeapon added the bug Something isn't working label Nov 29, 2020
@jamiechapmanbrn
Copy link

We've seen this happen when our k8s masters get migrated, there's an exception in a watcher and then no more updates. In our case we somewhat resolved the issue by halting if we stop seeing expected resources coming through, but it's not the nicest fix.

@nolar
Copy link
Owner

nolar commented Nov 30, 2020

Can you please try configuring settings.watching.server_timeout to e.g. 60 or 600 seconds?

settings.watching.client_timeout might help too, both alone and combined with the server timeout. E.g. ask the server to disconnect in 10 mins, but if it does not, disconnect from the client side in 11 mins.

This was the case in the past in some setups — if the server is not said for how long to wait, it waits forever. For reasons unknown, while the connection is alive, the server stops sending anything to it. And so, the operator becomes "blind".

It is generally a good idea to have a specific value there.

There is no default value only because I am not sure what would be a good default value, and for how long can the servers guarantee the connection's liveness — i.e., if I set it to 10 mins (600 seconds), is it enough to solve the problem for sure?

@OmegaVVeapon
Copy link
Author

Ok, I've added:

settings.watching.client_timeout = 60
settings.watching.server_timeout = 60

and redeployed the snippet.

I'll play with some permutations of that and see if helps.

One thing that I find very strange is why I don't see this problem when watching configmaps...
I modified the snippet in this ticket to ONLY watch for configmaps and it's been running for hours and everything triggers as it should still. Meanwhile, when I listen for secrets, I can never make it past 30 minutes.
There shouldn't be anything special to how secrets are handled in kopf or in k8s itself... should it?

But yeah... after several hours just watching configmaps, I see still:

[2020-11-30 00:49:32,911] asyncio              [INFO    ] poll took 2401.788 ms: 1 events
[2020-11-30 00:49:38,021] asyncio              [INFO    ] poll 4999.396 ms took 5005.137 ms: timeout
[2020-11-30 00:49:40,428] asyncio              [INFO    ] poll took 2404.599 ms: 1 events
[2020-11-30 00:49:45,537] asyncio              [INFO    ] poll 4999.651 ms took 5004.884 ms: timeout
[2020-11-30 00:49:47,939] asyncio              [INFO    ] poll took 2401.089 ms: 1 events
[2020-11-30 00:49:53,046] asyncio              [INFO    ] poll 4999.340 ms took 5001.108 ms: timeout
[2020-11-30 00:49:55,451] asyncio              [INFO    ] poll took 2403.718 ms: 1 events

logs appear.
Meanwhile, when watching secrets, those logs eventually stop showing altogether.
There's something weird going on there.

It's also worth mentioning that I'm testing in an AKS k8s cluster with only 3 nodes and I'm the only person that has access to it. It's a very controlled environment so there shouldn't be issues related to master migrations and the like.

Anyways, I'll leave the operator running with the secret handlers for a while and see how that goes.

Thanks for the suggestions 👍

@OmegaVVeapon
Copy link
Author

I'm happy to report that after leaving the operator running all night with those settings, it's still responding this morning!

I'm also not exactly sure what the correct value to use will be for our real cluster which has hundreds of configmaps and secrets, I imagine it'd need to be higher?

Guess we'll try a few things and see if the operator becomes "blind" or not.

One thing that may be worth changing is this section in the documentation?

Yet, it feels wrong to assume any “good” values in a framework (especially since it works without timeouts defined, just produces extra logs).

At least in my case, it definitely wasn't working without timeouts defined. Handlers were not firing and secret resources that should have been processed by the operator were being ignored. Not sure how to best communicate that issue to other users of kopf however.

@nolar
Copy link
Owner

nolar commented Nov 30, 2020

I cannot hint you what would be a good value. 10-60 mins sound good enough. The framework uses the resourceVersion= API query param to continue watching since the point it was interrupted at, so there should be no big workload or traffic on reconnections. But I'm not sure when the server-side starts freezing. In that case, the new events will be processed with a delay of up to this timeout.

Regarding the docs — yes, I agree, it is worth writing a note/warning there regarding these timeouts. I also thought about a warning/suggestion in the logs if nothing happens for some long period of time. Something like: "[WARNING] Nothing happened for 60 mins, which is unusual. Consider setting a server or client timeout to avoid server-side freezes. Read more at https://...docs..." — and start the "Troubleshooting" page for such cases.

If I set the default timeout to these same 60 mins, it would cause the same weird behaviour: freezing for tens of minutes and then suddenly unfreezing — not what the users expect.

@OmegaVVeapon
Copy link
Author

OmegaVVeapon commented Dec 3, 2020

I have not seen this issue ever since I added the client and server timeouts.
Everything works great!
Thanks for the advice (and the framework ofc) @nolar. Wouldn't have thought that was the issue.

@OmegaVVeapon OmegaVVeapon mentioned this issue Feb 10, 2021
2 tasks
legigor added a commit to youscan/ClusterSecret that referenced this issue Sep 28, 2022
rra added a commit to lsst-sqre/gafaelfawr that referenced this issue Mar 17, 2023
Set a timeout on Kubernetes watches in Kopf rather than using an
unlimited watch. nolar/kopf#585 says that
watches without an expiration sometimes go silent and stop receiving
events, which is consistent with some problems we've seen.

Mark Kubernetes object parsing failures as permanent so that Kopf
won't retry with the same version of the object. Mark Kubernetes API
failures as temporary so that Kopf will use its normal delay.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants