-
Notifications
You must be signed in to change notification settings - Fork 237
Description
I've noticed some odd behavior with some recent changes I've made to one of our custom operators, and this may be more of confirmation question than a real issue, but it's unexpected behavior to me at least.
Context; Custom Operator deployed to GKE monitoring and updating a set of Deployments, Services and Virtual Services.
What's happening is that we'll have a set of deployments being managed by our operator (monitoring a CRD as a single top level object) stop getting updated and not finishing a reconciliation loop, and then, any changes we make to the CRD Object after this starts are also ignored. There are no errors in the logs or anything saying something went wrong, and the only clue I have is one log message that is simply:
pause
that's it, no context around it, no additional metadata, just "pause". The messages does align with when our resources top getting updated, so somehow they are related. Also, it pauses one set of resources, while other managed objects are updated and reconciled just fine during this.
My working theory is that I introduced a usage of the kubernetes client to query for a list of pods related to the deployments, and this is somehow causing something to trip and hit some kind of rate limiting. I can see log messages right before "pause" all ending right before I make use of the API calls.
Is this a thing? Is there any documentation around any rate limits in the operator framework or kubernetes client? My googling hasn't turned up anything useful. I see some "pause" related code in the core operator-sdk, but can't really tell if it's related or not.