Support workload identity in flush manager service #1630
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We use workload identity in newer GKE deployments. In order to actually flush queued messages to pubsub, flush jobs spawned by flush manager need to be annotated with the appropriate k8s service account information (mapped to a GCP SA by workload identity).
In the old stack that flush manager was developed on, we used
GOOGLE_APPLICATION_CREDENTIALS
via env var and mounted secret volume to pass credentials to containers. In point of fact, I'm not sure if the flush manager config ever worked in old stage either since the current code passes neither GCP SA nor K8s SA information. I vaguely recall discussing this with :relud so it might have been known and I simply didn't catch that flushes always failed until doing some more exhaustive testing (flushes of empty disks do succeed however which is the common case).See service_account_name. The behavior of k8s when leaving this unspecified is to use default so this shouldn't be a change in behavior. In ops logic we generally prefer not to use the
default
k8s SA in annotations and annotate explicit service accounts within namespaces instead.Tested with https://github.com/mozilla-services/cloudops-infra/pull/2695/commits/b92de758dbb751a77dda8d25146b93ddf56a7d4e in stage.
Separately while testing this I was able to induce data loss by simply deleting a pod associated with a flush job while in an induced error state (flush manager would delete the
pv
even though the flush job did not succeed, or at least should not have returned a successful status). This is mildly concerning but I may be misunderstanding the expectation here (kubectl delete pod
is not going to happen on production stacks in practice). I will double check on this with :relud next week.