Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable bound SA tokens #718

Merged
merged 4 commits into from Jan 24, 2020
Merged

Conversation

marun
Copy link
Contributor

@marun marun commented Jan 16, 2020

Implements operator support for openshift/enhancements#150

TODO

  • - Add key management controller and test coverage of same
  • - Ensure keys are configured in the apiserver pods
  • - Ensure configuration of issuer and audience
  • - Unskip test coverage for the TokenRequest API
  • - Ensure e2e is passing consistently

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 16, 2020
@marun marun force-pushed the bound-sa-tokens branch 2 times, most recently from e2071a3 to adec557 Compare January 16, 2020 10:49
@marun
Copy link
Contributor Author

marun commented Jan 16, 2020

/retest

serviceAccountIssuer: auth.openshift.io
apiAudiences:
- auth.openshift.io
serviceAccountSigningKeyFile: /etc/kubernetes/static-pod-certs/secrets/service-account-signing-key/service-account.key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there are need to override this for bootstrapping? Or do we get this key from the render step that early?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is not to enable bound tokens in the bootstrapping phase, since as per a discussion on the enhancement there does not appear to be a need for that. Will it be necessary to set these values in code that can detect that bootstrapping is complete rather than here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ok not having bound tokens available from the bootstrap kubeapiserver. we asked in the enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more whether this path is a valid one during bootstrap or whether the process dies if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Will test tomorrow. Maybe the operator will have to detect that it is past the bootstrap phase and compose the config accordingly in code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two override yaml files, one for the bootstrap phase, one for after. Just set sensible values for each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second)
}

certConfigMap, err := c.configMapClient.ConfigMaps(targetNamespace).Get(CertConfigMapName, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the secret is changed, but the config map is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its the operator secret that is changing, that is not a problem because it is not used directly by the apiserver instances. In the case of the operand secret, notice that promotion occurs only after the configmap has been successfully updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the functional separation makes my intent clear?

// Giving time for apiserver instances to pick up the change in public keys before
// changing the private key minimizes the potential for one or more apiservers to
// issue tokens signed by the new private key that apiservers without the
// corresponding public key are unable to validate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we lack back-pressure again here, right? If rolling update is blocked for some reason, we might get into trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a problem now that actual state is considered rather than just waiting for a random amount of time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will read the new code. Looking at the deployed state should be enough.

// corresponding public key are unable to validate.
//
// TODO(marun) Find a more accurate indication that all apiservers are capable of
// validating tokens signed by the new private key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the etcd encryption code, we wait until all API servers have settled on the same revision, and that there is no new pending revision. Maybe that approach works here too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having the CM revisioned.

Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the etcd code in question? And should there be a corresponding update to the kcmo token controller?

re: dynamic key reads - afaict it's enough to just update the resource. Any changes to the resources/config that influence the state of apiserver pods prompt a redeployment of a single pod and only if that redeployment is successful will the change be rolled out to all pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code, you're even ok to avoid demanding a stable level, you just need each revision on nodes to include the cert for your key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

@@ -106,6 +106,11 @@ servingInfo:
requestTimeoutSeconds: 3600
serviceAccountPublicKeyFiles:
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs
serviceAccountIssuer: auth.openshift.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be mentioned as a default value in openshift/api#569

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it wasn't clear to me that a value that wasn't set by default at the API level should be documented for the API type. Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use"
readyInterval = 5 * time.Minute

CertConfigMapName = "bound-sa-token-signing-certs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there will never be actual certs in this CM, will they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I'm being consistent with the name of the controller-manager equivalent (sa-tokens-signing-certs) which has effectively the same content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, ok

}
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0
if needKeypair {
newSecret, err := newSigningSecret()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ApplySecret has certain output, a human-friendly log line might be helpful here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/operator/boundsatokensignercontroller/controller.go Outdated Show resolved Hide resolved
// corresponding public key are unable to validate.
//
// TODO(marun) Find a more accurate indication that all apiservers are capable of
// validating tokens signed by the new private key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having the CM revisioned.

Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?

Comment on lines 203 to 328
go wait.Until(func() {
ticker := time.NewTicker(time.Minute)
defer ticker.Stop()

for {
c.queue.Add(workQueueKey)
select {
case <-ticker.C:
case <-stopCh:
return
}
}

}, time.Minute, stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this correct? That inner ticker looks quite unnecessary given that wait.Until has its own internal timer which does the same, this basically runs wait.Until and freezes it on its first loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right - it should be sufficient to call func() {c.queue.Add(workQueueKey)} as the argument to wait.Util. Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -106,6 +106,11 @@ servingInfo:
requestTimeoutSeconds: 3600
serviceAccountPublicKeyFiles:
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs
serviceAccountIssuer: auth.openshift.io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pass as a flag, not as a struct value. Same with all these values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're looking for apiServerArguments above and add comments to help future me who won't remember this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0
if needKeypair {
klog.Infof("Creating a new signing secret for bound service account tokens.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's important enough for an info message, it's important enough for an event. If it isn't important enough for an event, then it belongs at a lower level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which would you prefer - even or lower level log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// did not have the latest public key would not be able to
// validate those new tokens.
TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use"
readyInterval = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the ability to inspect the revisions to figure out which ones contains the key in question. You can then inspect the levels on the nodes (the kubeapiservers.operator.openshift.io) to know if the nodes actually have the levels required. No need for time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// passed (see comment above the ready interval constant). Do not return
// immediately to ensure that the new public key can be set in the configmap
// in advance of promotion.
c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make your decision based on levels of the configmap actually on the nodes and trigger based on updates to kubeapiserver.operator.openshift.io, you don't need to have this delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return ret
}

func (c *BoundSATokenSignerController) sync() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please break this function into logical bits with no data based between them. I see

  1. create a secret if needed. depends on nothing outside the secrets
  2. update configmap if needed. This can retrieve the current secret. A stale lister will always get another event notification, so staleness of a cache doesn't matter and this can then be contained.
  3. promoting a key. This is based on the current secret in the operator namespace, the current secret in the operand namespace, the revisions on the nodes, the content of the configmaps on those nodes. All of which can be cached.

If any step has an error, the other steps should still be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@marun marun force-pushed the bound-sa-tokens branch 2 times, most recently from 2fae090 to 84a6b05 Compare January 21, 2020 08:51
@marun marun changed the title WIP Enable bound SA tokens Enable bound SA tokens Jan 21, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2020
} else {
// Update the operand secret only if the current public key has been synced to
// all nodes.
syncRequired, err = c.publicKeySyncedToAllNodes(currPublicKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/syncRequired/syncAllowed/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
# The following path contains the public keys needed to verify bound sa
# tokens. This is only supported post-bootstrap.
- /etc/kubernetes/static-pod-resources/configmaps/bound-sa-token-signing-certs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copy these into the overrides file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which overrides file? afaict the only ones available are for bootstrap. Would you prefer that I add another post-bootstrap override file rather than adding this feature-specific one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been renamed to config-overrides.yaml as requested.

@@ -246,6 +247,7 @@ func manageKubeAPIServerConfig(client coreclientv1.ConfigMapsGetter, recorder ev
"config.yaml",
specialMergeRules,
defaultConfig,
boundSATokenConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we have the overrides yaml here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict the overrides yaml are only for bootstrap:

https://github.com/openshift/cluster-kube-apiserver-operator/tree/master/bindata/bootkube/config

I don't know why there are 2 files for bootstrap. I do know that putting these options in the either of the bootstrap overrides will break bootstrapping because the bound token keypair is only created post-bootstrap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted a new PR to document the requirement to put post-bootstrap overrides in the renamed file: #731

@marun
Copy link
Contributor Author

marun commented Jan 23, 2020

@sttts Updated, PTAL

marun and others added 3 commits January 23, 2020 16:03
This controller is modeled after the one that manages key material for
legacy sa tokens in cluster-kube-controller-manager-operator, but is
simplified by not having to enable bound sa tokens during bootstrap.
@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

@sttts Updated, PTAL

@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

/test all

@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

@damemi I've added the DO NOT MERGE Bump revision limit in TestRevisionLimit commit to bump the apparently magic revision limit in the TestRevisionLimit. This PR is likely increasing the number of revisions in the process of coordinating the keypair configuration required to enable bound sa tokens. I'd appreciate your input as to whether changing the magic number is an acceptable workaround of if more involved changes are required.

@@ -56,10 +56,10 @@ func TestRevisionLimits(t *testing.T) {

totalRevisionLimit := operatorSpec.SucceededRevisionLimit + operatorSpec.FailedRevisionLimit
if operatorSpec.SucceededRevisionLimit == 0 {
totalRevisionLimit += 5
totalRevisionLimit += 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a follow-up: @damemi you wrote this original test. Why was this 5? Magic number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now get the +=5 I believe (it's the default value?). But why does the loop below not wait for the revision pruning controller to do its work, but errors the test immediately?

The addition of bound token configuration - specifically keypair
management - appears to be adding to the number of revisions just
enough to break the test. The test needs to be updated to allow the
pruning controller time to work.
@@ -77,7 +77,9 @@ func TestRevisionLimits(t *testing.T) {
// are InProgress or Unknown (since these do not count toward failed or succeeded), which could indicate zombie revisions.
// Check total+1 to account for possibly a current new revision that just hasn't pruned off the oldest one yet.
if len(newRevisions) > int(totalRevisionLimit)+1 {
t.Errorf("more revisions (%v) than total allowed (%v): %+v", len(revisions), totalRevisionLimit, revisions)
// TODO(marun) If number of revisions has been exceeded, need to give time for the pruning controller to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @damemi

@sttts
Copy link
Contributor

sttts commented Jan 24, 2020

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 303d32d into openshift:master Jan 24, 2020
deads2k added a commit to deads2k/cluster-kube-apiserver-operator that referenced this pull request Jan 27, 2020
deads2k added a commit to deads2k/cluster-kube-apiserver-operator that referenced this pull request Jan 27, 2020
@marun marun deleted the bound-sa-tokens branch February 4, 2020 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants