You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was spotted last night by the devex team. It appeared to surface after a cluster upgrade at which point the cred operator goes into a hot loop for all credentials it manages.
The loop is essentially this block of logging repeated over and over:
The crux of the problem is that the controller cannot see the secret it's supposed to be writing, it claims to write it successfully, and then re-syncs only to find no secret.
It knows the username, probably recorded pre-upgrade before the hotloop in the request's status. Because it cannot find a secret however, it does not know the secret key which would be effectively lost without the secret and cannot be reobtained, so it destroys all existing access keys and creates a new one, saves it, supposedly successfully, and then re-syncs.
There is no error taking place and thus backoff is not being triggered.
@sjenning also reported a 20 second watch which appears to show the secret being recreated repeatedly. (unclear why the controllers can't see it)
Does this imply that something is actually deleting the secrets? What could this be? The credentials operator never deletes the secret itself, it sets a controller reference on the target secret. However it is worth noting that this controller reference is in another namespace, this behavior appears fine on Kube 1.11, perhaps something has changed and we jumped to 1.12?
Questions to answer:
Did this upgrade involve Kube 1.12, either before or after the upgrade?
Was anything else in the cluster acting unusually?
How do we trigger one of these upgrades and does it reproduce the problem?
Current Theory
The controller reference on target secrets, pointing to the credentials request in another namespace is the likely source of the bug. This appeared to work fine and cleanup target secrets correctly, but during the upgrade it could be that this behavior changes and Kube now things the controller of the secret is gone, deletes it, causing the cred operator to see nothing and recreate it.
The text was updated successfully, but these errors were encountered:
This was spotted last night by the devex team. It appeared to surface after a cluster upgrade at which point the cred operator goes into a hot loop for all credentials it manages.
The loop is essentially this block of logging repeated over and over:
The crux of the problem is that the controller cannot see the secret it's supposed to be writing, it claims to write it successfully, and then re-syncs only to find no secret.
It knows the username, probably recorded pre-upgrade before the hotloop in the request's status. Because it cannot find a secret however, it does not know the secret key which would be effectively lost without the secret and cannot be reobtained, so it destroys all existing access keys and creates a new one, saves it, supposedly successfully, and then re-syncs.
There is no error taking place and thus backoff is not being triggered.
@sjenning also reported a 20 second watch which appears to show the secret being recreated repeatedly. (unclear why the controllers can't see it)
Does this imply that something is actually deleting the secrets? What could this be? The credentials operator never deletes the secret itself, it sets a controller reference on the target secret. However it is worth noting that this controller reference is in another namespace, this behavior appears fine on Kube 1.11, perhaps something has changed and we jumped to 1.12?
Questions to answer:
Current Theory
The controller reference on target secrets, pointing to the credentials request in another namespace is the likely source of the bug. This appeared to work fine and cleanup target secrets correctly, but during the upgrade it could be that this behavior changes and Kube now things the controller of the secret is gone, deletes it, causing the cred operator to see nothing and recreate it.
The text was updated successfully, but these errors were encountered: