Hotloop In Credentials Operator Possible #24

dgoodwin · 2019-01-30T12:26:10Z

This was spotted last night by the devex team. It appeared to surface after a cluster upgrade at which point the cred operator goes into a hot loop for all credentials it manages.

The loop is essentially this block of logging repeated over and over:

time="2019-01-29T21:03:33Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2019-01-29T21:03:33Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="target secret does not exist" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="target secret does not exist" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="loading cluster install config"
time="2019-01-29T21:03:33Z" level=debug msg="cluster install config loaded successfully"
time="2019-01-29T21:03:33Z" level=debug msg="loading AWS credentials from secret" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry secret=kube-system/aws-creds
time="2019-01-29T21:03:33Z" level=debug msg="creating root AWS client" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry secret=kube-system/aws-creds
time="2019-01-29T21:03:33Z" level=debug msg="loading AWS credentials from secret" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
time="2019-01-29T21:03:33Z" level=debug msg="creating read AWS client" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-cloud-credential-operator/cloud-credential-operator-iam-ro-creds
time="2019-01-29T21:03:33Z" level=debug msg="loading cluster version to read clusterID" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="found cluster ID" actuator=aws clusterID=8b812d0a-1795-42ff-852f-a50211c47598 cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="user exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry userName=sjenning-openshift-image-registry-5pbs2
time="2019-01-29T21:03:33Z" level=debug msg="desired user policy: {\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Action\":[\"s3:CreateBucket\",\"s3:DeleteBucket\",\"s3:PutBucketTagging\",\"s3:GetBucketTagging\",\"s3:PutEncryptionConfiguration\",\"s3:GetEncryptionConfiguration\",\"s3:PutLifecycleConfiguration\",\"s3:GetLifecycleConfiguration\",\"s3:GetBucketLocation\",\"s3:ListBucket\",\"s3:HeadBucket\",\"s3:GetObject\",\"s3:PutObject\",\"s3:DeleteObject\",\"s3:ListBucketMultipartUploads\",\"s3:AbortMultipartUpload\"],\"Resource\":\"*\"},{\"Effect\":\"Allow\",\"Action\":[\"iam:GetUser\"],\"Resource\":\"arn:aws:iam::269733383066:user/sjenning-openshift-image-registry-5pbs2\"}]}" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="current user policy: {\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Action\":[\"s3:CreateBucket\",\"s3:DeleteBucket\",\"s3:PutBucketTagging\",\"s3:GetBucketTagging\",\"s3:PutEncryptionConfiguration\",\"s3:GetEncryptionConfiguration\",\"s3:PutLifecycleConfiguration\",\"s3:GetLifecycleConfiguration\",\"s3:GetBucketLocation\",\"s3:ListBucket\",\"s3:HeadBucket\",\"s3:GetObject\",\"s3:PutObject\",\"s3:DeleteObject\",\"s3:ListBucketMultipartUploads\",\"s3:AbortMultipartUpload\"],\"Resource\":\"*\"},{\"Effect\":\"Allow\",\"Action\":[\"iam:GetUser\"],\"Resource\":\"arn:aws:iam::269733383066:user/sjenning-openshift-image-registry-5pbs2\"}]}" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="no changes to user policy" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="sync ListAccessKeys" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="secret does not exist" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="access key exists? false" accessKeyID= actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="generating new AWS access key" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="deleting all AWS access keys" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="deleting access key" accessKeyID=AKIAI2UVYWZYBVSEZJZA actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="all access keys deleted" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="access key created" accessKeyID=AKIAI6YRJPMGRMLEDHOA actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=info msg="creating secret" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry targetSecret=openshift-image-registry/installer-cloud-credentials
time="2019-01-29T21:03:33Z" level=info msg="secret created successfully" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry targetSecret=openshift-image-registry/installer-cloud-credentials
time="2019-01-29T21:03:33Z" level=debug msg="updating credentials request status" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2019-01-29T21:03:33Z" level=debug msg="parsed annotation" cr=openshift-cloud-credential-operator/openshift-image-registry
time="2019-01-29T21:03:33Z" level=debug msg="status unchanged" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials

The crux of the problem is that the controller cannot see the secret it's supposed to be writing, it claims to write it successfully, and then re-syncs only to find no secret.

It knows the username, probably recorded pre-upgrade before the hotloop in the request's status. Because it cannot find a secret however, it does not know the secret key which would be effectively lost without the secret and cannot be reobtained, so it destroys all existing access keys and creates a new one, saves it, supposedly successfully, and then re-syncs.

There is no error taking place and thus backoff is not being triggered.

@sjenning also reported a 20 second watch which appears to show the secret being recreated repeatedly. (unclear why the controllers can't see it)

$ oc get secret -w
NAME                                              TYPE                                  DATA   AGE
builder-dockercfg-9cwg8                           kubernetes.io/dockercfg               1      30m
builder-token-48h2v                               kubernetes.io/service-account-token   3      30m
builder-token-6wlxb                               kubernetes.io/service-account-token   3      31m
cluster-image-registry-operator-dockercfg-l2fxw   kubernetes.io/dockercfg               1      30m
cluster-image-registry-operator-token-d2v7l       kubernetes.io/service-account-token   3      30m
cluster-image-registry-operator-token-t9nhg       kubernetes.io/service-account-token   3      31m
default-dockercfg-wr89c                           kubernetes.io/dockercfg               1      30m
default-token-4pfhm                               kubernetes.io/service-account-token   3      31m
default-token-tl4l7                               kubernetes.io/service-account-token   3      30m
deployer-dockercfg-bc8tx                          kubernetes.io/dockercfg               1      30m
deployer-token-fc77j                              kubernetes.io/service-account-token   3      31m
deployer-token-fz74s                              kubernetes.io/service-account-token   3      30m
image-registry-private-configuration              Opaque                                2      30m
image-registry-tls                                kubernetes.io/tls                     2      30m
node-ca-dockercfg-tbshp                           kubernetes.io/dockercfg               1      30m
node-ca-token-m284n                               kubernetes.io/service-account-token   3      31m
node-ca-token-vvhbd                               kubernetes.io/service-account-token   3      30m
registry-dockercfg-zkh7p                          kubernetes.io/dockercfg               1      30m
registry-token-7wn7f                              kubernetes.io/service-account-token   3      30m
registry-token-dsqd4                              kubernetes.io/service-account-token   3      30m
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     1s
image-registry-private-configuration   Opaque   2     30m
installer-cloud-credentials   Opaque   2     2s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     1s
installer-cloud-credentials   Opaque   2     1s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s
installer-cloud-credentials   Opaque   2     0s

Does this imply that something is actually deleting the secrets? What could this be? The credentials operator never deletes the secret itself, it sets a controller reference on the target secret. However it is worth noting that this controller reference is in another namespace, this behavior appears fine on Kube 1.11, perhaps something has changed and we jumped to 1.12?

Questions to answer:

Did this upgrade involve Kube 1.12, either before or after the upgrade?
Was anything else in the cluster acting unusually?
How do we trigger one of these upgrades and does it reproduce the problem?

Current Theory

The controller reference on target secrets, pointing to the credentials request in another namespace is the likely source of the bug. This appeared to work fine and cleanup target secrets correctly, but during the upgrade it could be that this behavior changes and Kube now things the controller of the secret is gone, deletes it, causing the cred operator to see nothing and recreate it.

The text was updated successfully, but these errors were encountered:

dgoodwin · 2019-01-30T18:40:12Z

Occurred in standard e2e test run so it is not specific to upgrade, artifacts: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3668/artifacts/release-e2e-aws/pods/

dgoodwin · 2019-01-30T18:53:19Z

In those e2e test runs I found the following 1500 lines in openshift-kube-controller-manager logs like:

I0130 14:43:27.666216       1 garbagecollector.go:408] processing item [v1/Secret, namespace: openshift-image-registry, name: installer-cloud-credentials, uid: 6780135b-249d-11e9-9f18-0a0fe991104c]              
I0130 14:43:27.676750       1 garbagecollector.go:521] delete object [v1/Secret, namespace: openshift-image-registry, name: installer-cloud-credentials, uid: 6780135b-249d-11e9-9f18-0a0fe991104c] with propagation policy Background

Fairly confident the fix in #26 will fix the problem as a result of this.

This was referenced Jan 30, 2019

InvalidAccessKeyId error during upgrade openshift/cluster-image-registry-operator#173

Closed

Remove secret controller refs. #26

Merged

dgoodwin closed this as completed Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotloop In Credentials Operator Possible #24

Hotloop In Credentials Operator Possible #24

dgoodwin commented Jan 30, 2019 •

edited

dgoodwin commented Jan 30, 2019 •

edited

dgoodwin commented Jan 30, 2019

Hotloop In Credentials Operator Possible #24

Hotloop In Credentials Operator Possible #24

Comments

dgoodwin commented Jan 30, 2019 • edited

Current Theory

dgoodwin commented Jan 30, 2019 • edited

dgoodwin commented Jan 30, 2019

dgoodwin commented Jan 30, 2019 •

edited

dgoodwin commented Jan 30, 2019 •

edited