Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support workload identity in flush manager service #1630

Merged
merged 1 commit into from
Apr 2, 2021

Conversation

whd
Copy link
Member

@whd whd commented Apr 2, 2021

We use workload identity in newer GKE deployments. In order to actually flush queued messages to pubsub, flush jobs spawned by flush manager need to be annotated with the appropriate k8s service account information (mapped to a GCP SA by workload identity).

In the old stack that flush manager was developed on, we used GOOGLE_APPLICATION_CREDENTIALS via env var and mounted secret volume to pass credentials to containers. In point of fact, I'm not sure if the flush manager config ever worked in old stage either since the current code passes neither GCP SA nor K8s SA information. I vaguely recall discussing this with :relud so it might have been known and I simply didn't catch that flushes always failed until doing some more exhaustive testing (flushes of empty disks do succeed however which is the common case).

See service_account_name. The behavior of k8s when leaving this unspecified is to use default so this shouldn't be a change in behavior. In ops logic we generally prefer not to use the default k8s SA in annotations and annotate explicit service accounts within namespaces instead.

Tested with https://github.com/mozilla-services/cloudops-infra/pull/2695/commits/b92de758dbb751a77dda8d25146b93ddf56a7d4e in stage.

Separately while testing this I was able to induce data loss by simply deleting a pod associated with a flush job while in an induced error state (flush manager would delete the pv even though the flush job did not succeed, or at least should not have returned a successful status). This is mildly concerning but I may be misunderstanding the expectation here (kubectl delete pod is not going to happen on production stacks in practice). I will double check on this with :relud next week.

@whd whd requested review from relud and jklukas April 2, 2021 03:14
Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@whd whd merged commit cebe636 into master Apr 2, 2021
@whd whd deleted the flush_workload_identity branch April 2, 2021 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants